date:20180605

Re: Code of Conduct plan

2018-06-05 Thread Chris Travers

On Mon, Jun 4, 2018 at 6:59 PM, Joshua D. Drake 
wrote:

> On 06/03/2018 04:08 PM, Gavin Flower wrote:
>
> My comments:
>>>
>>> 1) Reiterate my contention that this is a solution is search of problem.
>>> Still it looks like it is going forward, so see below.
>>>
>>> 2) "... engaging in behavior that may bring the PostgreSQL project into
>>> disrepute, ..."
>>> This to me is overly broad and pulls in actions that may happen outside
>>> the community. Those if they are actually an issue should be handled where
>>> they occur not here.
>>>
>>
> This is good point. There are those who would think that one has performed
> an action that brings the project into disrepute and a similar sized bias
> that suggests that in fact that isn't the case. This based on the CoC would
> be judged by the CoC committee.
>
> It is my hope that PostgreSQL.Org -Core chooses members for that committee
> that are exceedingly diverse otherwise it is just an echo chamber for a
> single ideology and that will destroy this community.


If I may suggest:  The committee should be international as well and
include people from around the world.  The last thing we want is for it to
be dominated by people from one particular cultural viewpoint.


>
>
>
>>> 3) "... members must be sensitive to conduct that may be considered
>>> offensive by fellow members and must refrain from engaging in such conduct.
>>> "
>>>
>>
> Again overly broad, especially given the hypersensitivity of people these
>>> days. I have found that it is enough to disagree with someone to have it
>>> called offensive. This section should be removed as proscribed behavior is
>>> called out in detail in the paragraphs above it.
>>>
>>
> "considered offensive by fellow members"
>
> Is definitely too broad. The problem comes in here:
>
> I might possibly say that "I'm the master in this area" when talking to
>> someone on a technical subject.  In the sense that I'm better at that
>> particular skill, but some hypersensitive American could get their knickers
>> in a twist (notice, that in this context, no gender is implied -- also in
>> using that that expression "get their knickers in a twist" could offend
>> some snowflake) claiming that I'm suggesting that whoever
>>
>
> "snowflake", I find that term hilarious others find it highly offensive.
> Which is correct?


I agree with both concerns in the above exchange.

This is an economic common project.  The goal should be for people to come
together and act civilly.  Waging culture war using the code of conduct
itself should be a violation of the code of conduct and this goes on *all*
(not just one or two) sides.


>
>
> I'm talking to is my slave!  I heard of an American university that
>> doesn't want people to use the term master, like in an MSc, because of the
>> history of slavery.
>>
>
> The PostgreSQL project already has this problem, note we don't use the
> terms Master and Slave in reference to replication anymore.
>
>
>> I've used the expressions "sacrifice a willing virgin" and "offering my
>> first born to the gods" as ways to ensure success of resolving a technical
>> issue.  The people I say that to, know what I mean -- and they implicitly
>> know that I'm not seriously suggesting such conduct.  Yet, if I wrote that
>> publicly, it is conceivable that someone might object!
>>
>
> Yes and that is a problem. We need to have some simple barrier of
> acceptance that we are all adults here (or should act like adults). Knowing
> your audience is important.


I would point out also that the PostgreSQL community is nice and mature.
At PGConf US I saw what appeared to be two individuals with red MAGA hats.
And yet everyone managed to be civil.  We manage to do better than the US
does on the whole in this regard and we should be proud of ourselves.


>
>
> Consider a past advertising campaign in Australia to sell government
>> Bonds.  They used two very common hand gestures that are very Australian.
>> Bond sales dropped.  On investigation, they found the bonds were mainly
>> bought by old Greek people, who found the gestures obscene. The gestures?
>> Thumbs up, and the okay gesture formed by touching the thumb with the next
>> finger -- nothing sexually suggestive to most Australians, but traditional
>> Greeks found them offensive.
>>
>
> Using Australia as an example, my understanding is that the word c**t is
> part of nomenclature but in the states the word is taboo and highly frowned
> upon.


Again key point that a CoC committee needs to be international and used to
addressing these sorts of issues.


>
>
> Be very careful in attempting to codify 'correct' behaviour!
>>
>>
> Correct. I think one way to look at all of this is, "if you wouldn't say
> it to your boss or a client don't say it here". That too has problems but
> generally speaking I think it keeps the restrictions rational.
>
>
I will post a more specific set of thoughts here but in general I think the
presumption ought to be that people are trying to work t

Re: commitfest 2018-07

2018-06-05 Thread Andres Freund

On 2018-06-05 10:20:55 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > I'd rather create a new 2018-07, and just manually move old patches to
> > it. Otherwise we'll not really focus on the glut of old things, but
> > everyone just restarts working on their own new thing.
> 
> I thought the idea was to clear out the underbrush of small, ready-to-go
> patches.  How old they are doesn't enter into that.
> 
> There's a separate issue about what to do to prioritize old patches so
> they don't linger indefinitely.  We had a discussion about that at the
> dev meeting, but I don't think any specific mechanism was agreed to?

I think we've not fully agreed on that.  I'd argue we should manually
filter things into the next CF. And both small patches and older things
should qualify.

Greetings,

Andres Freund

Re: Concurrency bug in UPDATE of partition-key

2018-06-05 Thread Amit Khandekar

Attached is a rebased patch version. Also included it in the upcoming
commitfest :
https://commitfest.postgresql.org/18/1660/

In the rebased version, the new test cases are added in the existing
isolation/specs/partition-key-update-1.spec test.


-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


fix_concurrency_bug_rebased.patch
Description: Binary data

automating perl compile time checking

2018-06-05 Thread Andrew Dunstan



Here is a follow up patch to last weeks commit allowing all perl files 
to be checked clean for compile time errors and warnings.



The patch contains a simple script to run the checks. The code that 
finds perl files is put in a function in a single file that is sourced 
by the three locations that need it.



The directory pgperlcritic is renamed to perlcheck, as it not contains 
the new script as well as pgperlcritic.



cheers


andrew


--
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

>From 6247cd1fbba16426bb1745521d7b444d60ebbd0a Mon Sep 17 00:00:00 2001
From: Andrew Dunstan 
Date: Tue, 5 Jun 2018 10:25:55 -0400
Subject: [PATCH] Extend and slightly refactor perl checking

A script is added to check for perl compile time errors and warnings.
The code to detect all the perl files is put in a central place and
included in the three places it is now needed. The directory
pgperlcritic is renamed to perlcheck, and contains the new check script
as well as pgperlcritic.
---
 src/tools/perlcheck/find_perl_files| 15 
 src/tools/{pgperlcritic => perlcheck}/perlcriticrc |  0
 src/tools/perlcheck/pgperlcritic   | 20 
 src/tools/perlcheck/pgperlsyncheck | 17 +
 src/tools/pgindent/pgperltidy  | 14 +++
 src/tools/pgperlcritic/pgperlcritic| 28 --
 6 files changed, 55 insertions(+), 39 deletions(-)
 create mode 100644 src/tools/perlcheck/find_perl_files
 rename src/tools/{pgperlcritic => perlcheck}/perlcriticrc (100%)
 create mode 100755 src/tools/perlcheck/pgperlcritic
 create mode 100755 src/tools/perlcheck/pgperlsyncheck
 delete mode 100755 src/tools/pgperlcritic/pgperlcritic

diff --git a/src/tools/perlcheck/find_perl_files b/src/tools/perlcheck/find_perl_files
new file mode 100644
index 000..e10466a
--- /dev/null
+++ b/src/tools/perlcheck/find_perl_files
@@ -0,0 +1,15 @@
+
+# src/tools/perlcheck/find_perl_files
+
+# shell function to find all perl files in the source tree
+
+find_perl_files () {
+{
+		# take all .pl and .pm files
+		find . -type f -name '*.p[lm]' -print
+		# take executable files that file(1) thinks are perl files
+		find . -type f -perm -100 -exec file {} \; -print |
+		egrep -i ':.*perl[0-9]*\>' |
+		cut -d: -f1
+	} | sort -u
+}
diff --git a/src/tools/pgperlcritic/perlcriticrc b/src/tools/perlcheck/perlcriticrc
similarity index 100%
rename from src/tools/pgperlcritic/perlcriticrc
rename to src/tools/perlcheck/perlcriticrc
diff --git a/src/tools/perlcheck/pgperlcritic b/src/tools/perlcheck/pgperlcritic
new file mode 100755
index 000..6a31f67
--- /dev/null
+++ b/src/tools/perlcheck/pgperlcritic
@@ -0,0 +1,20 @@
+#!/bin/sh
+
+# src/tools/pgperlcritic/pgperlcritic
+
+test -f src/tools/perlcheck/perlcriticrc || {
+	echo could not find src/tools/perlcheck/perlcriticrc
+	exit 1
+	}
+
+set -e
+
+# set this to override default perlcritic program:
+PERLCRITIC=${PERLCRITIC:-perlcritic}
+
+. src/tools/perlcheck/find_perl_files
+
+find_perl_files | xargs $PERLCRITIC \
+	  --quiet \
+	  --program-extensions .pl \
+	  --profile=src/tools/perlcheck/perlcriticrc
diff --git a/src/tools/perlcheck/pgperlsyncheck b/src/tools/perlcheck/pgperlsyncheck
new file mode 100755
index 000..4595d16
--- /dev/null
+++ b/src/tools/perlcheck/pgperlsyncheck
@@ -0,0 +1,17 @@
+#!/bin/sh
+
+# script to detect compile time errors and warnings in all perl files
+
+INCLUDES="-I src/tools/msvc -I src/tools/msvc/dummylib -I src/backend/catalog"
+INCLUDES="-I src/test/perl -I src/backend/utils/mb/Unicode $INCLUDES"
+INCLUDES="-I src/bin/pg_rewind -I src/test/ssl $INCLUDES"
+
+set -e
+
+. src/tools/perlcheck/find_perl_files
+
+# for zsh
+setopt shwordsplit 2>/dev/null || true
+
+find_perl_files | xargs -L 1 perl $INCLUDES -cw 2>&1 | grep -v OK
+
diff --git a/src/tools/pgindent/pgperltidy b/src/tools/pgindent/pgperltidy
index 5d9aa7c..5e70411 100755
--- a/src/tools/pgindent/pgperltidy
+++ b/src/tools/pgindent/pgperltidy
@@ -7,14 +7,6 @@ set -e
 # set this to override default perltidy program:
 PERLTIDY=${PERLTIDY:-perltidy}
 
-# locate all Perl files in the tree
-(
-	# take all .pl and .pm files
-	find . -type f -a \( -name '*.pl' -o -name '*.pm' \)
-	# take executable files that file(1) thinks are perl files
-	find . -type f -perm -100 -exec file {} \; |
-	egrep -i ':.*perl[0-9]*\>' |
-	cut -d: -f1
-) |
-sort -u |
-xargs $PERLTIDY --profile=src/tools/pgindent/perltidyrc
+. src/tools/perlcheck/find_perl_files
+
+find_perl_files | xargs $PERLTIDY --profile=src/tools/pgindent/perltidyrc
diff --git a/src/tools/pgperlcritic/pgperlcritic b/src/tools/pgperlcritic/pgperlcritic
deleted file mode 100755
index 28264b1..000
--- a/src/tools/pgperlcritic/pgperlcritic
+++ /dev/null
@@ -1,28 +0,0 @@
-#!/bin/sh
-
-# src/tools/pgperlcritic/pgperlcritic
-
-test -f src/tools/pgperlcritic/perlcriticrc || {

Re: plans for PostgreSQL 12

2018-06-05 Thread Pavel Stehule

2018-06-05 15:00 GMT+02:00 Andres Freund :

> Hi,
>
> On 2018-06-05 06:32:31 +0200, Pavel Stehule wrote:
> > ./configure --with-libxml --enable-tap-tests --enable-debug --with-perl
> > CFLAGS="-ggdb -Og -g3 -fno-omit-frame-pointer"
> >
> > [pavel@nemesis postgresql]$ gcc --version
> > gcc (GCC) 8.1.1 20180502 (Red Hat 8.1.1-1)
> >
> > I executed simple script
> >
> >  do $$ declare i bigint = 1; s bigint = 0; begin while i <= 1
> loop
> > s := s + i; i := i + 1; end loop;  raise notice '%', s; end $$;
> >
> >7,68%  postmaster  postgres   [.]
> > GetSnapshotData   ▒
> >7,53%  postmaster  plpgsql.so [.]
> > exec_eval_simple_expr ▒
> >6,49%  postmaster  postgres   [.]
>
> It seems to me the right fix here isn't a new class of functions, but
> rather support for delaying the computation of the snapshot to the point
> it's needed.  That'll be far more generically applicable and doesn't
> require user interaction.
>

good idea. Can be quick fix.

>
>
> > ExecInterpExpr▒
> >4,13%  postmaster  postgres   [.]
>
> So we're going to need to optimize this further as well, I've a pending
> patch for that, luckily ;)
>

nice :)


> > LWLockRelease ▒
> >4,12%  postmaster  postgres   [.]
>
> That's also GetSnapshotData()...
>

there are about 10% locking, unlocking plan cache still.

Regards

Pavel


> Greetings,
>
> Andres Freund
>

Re: adding tab completions

2018-06-05 Thread Arthur Zakirov

On Sun, Jun 03, 2018 at 10:39:22PM -0500, Justin Pryzby wrote:
> Find attached an update which also supports column completion using the legacy
> non-parenthesized syntax.

Thank you!

> BTW..should that be toast.tuple_target ??

I think shouldn't. From the documentation "This parameter cannot be set for
TOAST tables".

> > Also I think it could be good to list column names after parentheses,
> > but I'm not sure if it easy to implement.
> I tried this and nearly gave up, but see attached.

After some thought now I think that this is not so good idea. The code
doesn't look good too. It doesn't worth it and sorry for the distraction.

Moreover there is no such completion for example for the command (it shows
only first column):

CREATE INDEX ON test (

> - "SERVER", "INDEX", "LANGUAGE", "POLICY", "PUBLICATION", 
> "RULE",
> + "SERVER", "INDEX", "LANGUAGE", "POLICY", "PUBLICATION",

Is this a typo? I think still there is a possibility to comment rules.

> else if (HeadMatches1("EXPLAIN") && previous_words_count==2 && 
> prev_wd[0]=='(' && ends_with(prev_wd, ')'))

I think this condition can be replaced by:

else if (TailMatches2("EXPLAIN", MatchAny) && ends_with(prev_wd, ')'))

It looks better to me. Such condition is used for other commands and
works the same way.

The last point I've noticed, there is no VERBOSE entry after VACUUM FULL
ANALYZE command anymore.

I'm not sure how this patch should be commited. Can it be commited
outside the commitfest? Otherwise add it to the next commitfest please
in order not to forget it.

Thoughts?

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Re: commitfest 2018-07

2018-06-05 Thread Tom Lane

Andres Freund  writes:
> I'd rather create a new 2018-07, and just manually move old patches to
> it. Otherwise we'll not really focus on the glut of old things, but
> everyone just restarts working on their own new thing.

I thought the idea was to clear out the underbrush of small, ready-to-go
patches.  How old they are doesn't enter into that.

There's a separate issue about what to do to prioritize old patches so
they don't linger indefinitely.  We had a discussion about that at the
dev meeting, but I don't think any specific mechanism was agreed to?

regards, tom lane

Re: install doesn't work on HEAD

2018-06-05 Thread Ashutosh Sharma

Hi,


On Tue, Jun 5, 2018 at 7:08 PM, Amit Kapila  wrote:
>
> Hi,
>
> On my win-7, I am facing $SUBJECT.  I am consistently getting below error:
> Copying build output files...Could not copy postgres.exe
>  at install.pl line 26.
>
> On digging, I found that it started failing with commit 3a7cc727 (Don't fall 
> off the end of perl functions).  It seems that we have added empty 'return' 
> in all of the functions which seems to break one of the case:
>
> lcopy("$conf\\$pf\\$pf.$ext", "$target\\$dir\\$pf.$ext")
> || croak "Could not copy $pf.$ext\n";
>
> I think the return from lcopy implies 0 (or undef) which cause it to fail.  
> By reading couple of articles on net [1][2] related to this, it seems we 
> can't blindly return in all cases.  On changing, the lcopy as below, install 
> appears to be working.
>
> @@ -40,7 +40,7 @@ sub lcopy
> copy($src, $target)
>   || confess "Could not copy $src to $target\n";
> -   return;
> +   return 1;
>  }
>
> I have randomly check some of the other functions where this patch has added 
> 'return' and those seem to be fine.
>
> Is it by any chance related Perl version or some other settings?  If not, 
> then we should do something for it.
>
> [1] - https://perlmaven.com/how-to-return-undef-from-a-function
> [2] - 
> http://search.cpan.org/dist/Perl-Critic/lib/Perl/Critic/Policy/Subroutines/RequireFinalReturn.pm

I am not able to reproduce this issue on HEAD. Seems like the
following git-commit fixes it,

commit 01deec5f8ae64b5120cc8c93d54fe0e19e477b02
Author: Andrew Dunstan 
Date:   Mon May 28 16:44:13 2018 -0400

Return a value from Install.pm's lcopy function

Commit 3a7cc727c was a little over eager about adding an explicit return
to this function, whose value is checked in most call sites. This change
reverses that and returns the expected value explicitly. It also adds a
check to the one call site lacking one.

-- 
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

>
>
> --
> With Regards,
> Amit Kapila.
> EnterpriseDB: http://www.enterprisedb.com

Re: Loaded footgun open_datasync on Windows

2018-06-05 Thread Laurenz Albe

Amit Kapila wrote:
> On Fri, Jun 1, 2018 at 8:18 PM, Laurenz Albe  wrote:
> > Amit Kapila wrote:
> > > On Fri, Jun 1, 2018 at 3:13 PM, Laurenz Albe  
> > > wrote:
> > > > I recently read our documentation about reliability on Windows:
> > > > 
> > > > > On Windows, if wal_sync_method is open_datasync (the default), write 
> > > > > caching can
> > > > > be disabled by unchecking
> > > > > My Computer\Open\disk 
> > > > > drive\Properties\Hardware\Properties\Policies\Enable write caching
> > > > > on the disk. Alternatively, set wal_sync_method to fsync or 
> > > > > fsync_writethrough,
> > > > > which prevent write caching.
> > > > 
> > > > It seems dangerous to me to initialize "wal_sync_method" to a method 
> > > > that is unsafe
> > > > by default.  Admittedly I am not a Windows man, but the fact that this 
> > > > has eluded me
> > > > up to now leads me to believe that other people running PostgreSQL on 
> > > > Windows might
> > > > also have missed that important piece of advice and are consequently 
> > > > running with
> > > > an unsafe setup.
> > > > 
> > > > Wouldn't it be smarter to set a different default value on Windows, 
> > > > like we do on
> > > > Linux (for other reasons)?
> > > > 
> > > 
> > > One thing to note is that it seems that in code we use 
> > > FILE_FLAG_WRITE_THROUGH for
> > > open_datasync which according to MSDN [1] will bypass any intermediate 
> > > cache .
> > > See pgwin32_open.  Have you experimented to set any other option as we 
> > > have a comment
> > > in code which say Win32 only has O_DSYNC?
> > > 
> > > 
> > > [1] - 
> > > https://msdn.microsoft.com/en-us/library/windows/desktop/aa363858(v=vs.85).aspx
> > 
> > After studying the code I feel somewhat safer; it looks like the code is ok.
> > I have no Windows at hand, so I cannot test right now.
> > 
> > What happened is that I ran "pg_test_fsync" at a customer site on Windows, 
> > and
> > it returned ridiculously high rates got open_datasync.
> > 
> > So I think that the following should be fixed:
> > 
> > - Change pg_test_fsync to actually use FILE_FLAG_WRITE_THROUGH.
> 
> It sounds sensible to me to make a Windows specific change in pg_test_fsync 
> for open_datasync method.
> That will make pg_test_fsync behave similar to server.

The attached patch makes pg_test_fsync use pgwin32_open on Windows, which is 
what
we use elsewhere.  That should fix the problem.
Ist there a better way to do this?  The problem is that "c.h" is only included
at the very end of "postgres-fe.h", which makes front-end code use "open"
rather than "pgwin32_open" on Windows.

Having read it again, I think that the documentation is fine as it is:
After all, this is just advice what you can do if you are running on unsafe 
hardware,
which doesn't flush to disk like it should.

Yours,
Laurenz AlbeFrom 6ab651c48df66ec93b8d6730bebeaf60e2bb865b Mon Sep 17 00:00:00 2001
From: Laurenz Albe 
Date: Tue, 5 Jun 2018 16:05:10 +0200
Subject: [PATCH] Make pg_test_fsync use pgwin32_open on Windows

---
 src/bin/pg_test_fsync/pg_test_fsync.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/src/bin/pg_test_fsync/pg_test_fsync.c b/src/bin/pg_test_fsync/pg_test_fsync.c
index e6f7ef8557..a0139a1302 100644
--- a/src/bin/pg_test_fsync/pg_test_fsync.c
+++ b/src/bin/pg_test_fsync/pg_test_fsync.c
@@ -3,6 +3,8 @@
  *		tests all supported fsync() methods
  */
 
+/* we have to include c.h first so that pgwin32_open is used on Windows */
+#include "c.h"
 #include "postgres_fe.h"
 
 #include 
-- 
2.14.3

Re: commitfest 2018-07

2018-06-05 Thread Tom Lane

Michael Paquier  writes:
> Okay.  If we tend toward this direction, I propose to do this switch in
> two days my time (Thursday afternoon in Tokyo) if there are no
> objections, so as anybody has hopefully time to argue back.

I think we probably have to wait longer.  It's not clear to me when the
RMT will decide that the 07 fest is go or no-go, but surely they've
not decided yet.

regards, tom lane

Re: libpq compression

2018-06-05 Thread Konstantin Knizhnik





On 05.06.2018 09:04, Thomas Munro wrote:

On Thu, May 17, 2018 at 3:54 AM, Konstantin Knizhnik
 wrote:

Concerning specification of compression level: I have made many experiments
with different data sets and both zlib/zstd and in both cases using
compression level higher than default doesn't cause some noticeable increase
of compression ratio, but quite significantly reduce speed. Moreover, for
"pgbench -i" zstd provides better compression ratio (63 times!) with
compression level 1 than with with largest recommended compression level 22!
This is why I decided not to allow user to choose compression level.

Speaking of configuration, are you planning to support multiple
compression libraries at the same time?  It looks like the current
patch implicitly requires client and server to use the same configure
option, without any attempt to detect or negotiate.  Do I guess
correctly that a library mismatch would produce an incomprehensible
corrupt stream message?

Frankly speaking I am not sure that support of multiple compression 
libraries at the same time is actually needed.
If we build Postgres from sources, then both frontend and backend 
libraries will use the same compression library.

zlib is available almost everywhere and Postgres in any case is using it.
zstd is faster and provides better compression ratio. So in principle it 
may be useful to try first to use zstd and if it is not available then 
use zlib.

It will require dynamic loading of this libraries.

libpq stream compression is not the only place where compression is used 
in Postgres. So I think that the problem of  choosing compression algorithm
and supporting custom compression methods should be fixed at some upper 
level. There is patch for custom compression method at commit fest.

May be it should be combined with this one.

Right now if client and server libpq libraries were built with different 
compression libraries, then decompress error will be reported.
Supporting multiple compression methods will require more sophisticated 
handshake protocol so that client and server can choose compression 
method which is supported by both of them.

But certainly it can be done.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: libpq compression

2018-06-05 Thread Konstantin Knizhnik


On 05.06.2018 08:26, Thomas Munro wrote:

On Thu, May 17, 2018 at 3:54 AM, Konstantin Knizhnik
 wrote:

Thank you for this notice.
Updated and rebased patch is attached.

Hi Konstantin,

Seems very useful.  +1.

+ rc = inflate(&zs->rx, Z_SYNC_FLUSH);
+ if (rc != Z_OK)
+ {
+ return ZPQ_DECOMPRESS_ERROR;
+ }

Does this actually guarantee that zs->rx.msg is set to a string?  I
looked at some documentation here:

https://www.zlib.net/manual.html

It looks like return value Z_DATA_ERROR means that msg is set, but for
the other error codes Z_STREAM_ERROR, Z_BUF_ERROR, Z_MEM_ERROR it
doesn't explicitly say that.  From a casual glance at
https://github.com/madler/zlib/blob/master/inflate.c I think it might
be set to Z_NULL and then never set to a string except in the mode =
BAD paths that produce the Z_DATA_ERROR return code.  That's
interesting because later we do this:

+ if (r == ZPQ_DECOMPRESS_ERROR)
+ {
+ ereport(COMMERROR,
+ (errcode_for_socket_access(),
+ errmsg("Failed to decompress data: %s", zpq_error(PqStream;
+ return EOF;

... where zpq_error() returns zs->rx.msg.  That might crash or show
"(null)" depending on libc.

Also, message style: s/F/f/

+ssize_t zpq_read(ZpqStream* zs, void* buf, size_t size, size_t* processed)

Code style:  We write "Type *foo", not "Type* var".  We put the return
type of a function definition on its own line.

It looks like there is at least one place where zpq_stream.o's symbols
are needed but it isn't being linked in, so the build fails in some
ecpg stuff reached by make check-world:

gcc -Wall -Wmissing-prototypes -Wpointer-arith
-Wdeclaration-after-statement -Wendif-labels
-Wmissing-format-attribute -Wformat-security -fno-strict-aliasing
-fwrapv -fexcess-precision=standard -g -O2 -pthread -D_REENTRANT
-D_THREAD_SAFE -D_POSIX_PTHREAD_SEMANTICS test1.o
-L../../../../../src/port -L../../../../../src/common -L../../ecpglib
-lecpg -L../../pgtypeslib -lpgtypes
-L../../../../../src/interfaces/libpq -lpq   -Wl,--as-needed
-Wl,-rpath,'/usr/local/pgsql/lib',--enable-new-dtags  -lpgcommon
-lpgport -lpthread -lz -lreadline -lrt -lcrypt -ldl -lm   -o test1
../../../../../src/interfaces/libpq/libpq.so: undefined reference to `zpq_free'
../../../../../src/interfaces/libpq/libpq.so: undefined reference to `zpq_error'
../../../../../src/interfaces/libpq/libpq.so: undefined reference to `zpq_read'
../../../../../src/interfaces/libpq/libpq.so: undefined reference to
`zpq_buffered'
../../../../../src/interfaces/libpq/libpq.so: undefined reference to
`zpq_create'
../../../../../src/interfaces/libpq/libpq.so: undefined reference to `zpq_write'



Hi Thomas,
Thank you for review. Updated version of the patch fixing all reported 
problems is attached.


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/configure b/configure
index 0aafd9c..fc5685c 100755
--- a/configure
+++ b/configure
@@ -699,6 +699,7 @@ ELF_SYS
 EGREP
 GREP
 with_zlib
+with_zstd
 with_system_tzdata
 with_libxslt
 with_libxml
@@ -863,6 +864,7 @@ with_libxml
 with_libxslt
 with_system_tzdata
 with_zlib
+with_zstd
 with_gnu_ld
 enable_largefile
 enable_float4_byval
@@ -8017,6 +8019,86 @@ fi
 
 
 #
+# ZStd
+#
+
+
+
+# Check whether --with-zstd was given.
+if test "${with_zstd+set}" = set; then :
+  withval=$with_zstd;
+  case $withval in
+yes)
+  ;;
+no)
+  :
+  ;;
+*)
+  as_fn_error $? "no argument expected for --with-zstd option" "$LINENO" 5
+  ;;
+  esac
+
+else
+  with_zstd=no
+
+fi
+
+
+
+
+if test "$with_zstd" = yes ; then
+  { $as_echo "$as_me:${as_lineno-$LINENO}: checking for ZSTD_compress in -lzstd" >&5
+$as_echo_n "checking for ZSTD_compress in -lzstd... " >&6; }
+if ${ac_cv_lib_zstd_ZSTD_compress+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  ac_check_lib_save_LIBS=$LIBS
+LIBS="-lzstd  $LIBS"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+
+/* Override any GCC internal prototype to avoid an error.
+   Use char because int might match the return type of a GCC
+   builtin and then its argument prototype would still apply.  */
+#ifdef __cplusplus
+extern "C"
+#endif
+char ZSTD_compress ();
+int
+main ()
+{
+return ZSTD_compress ();
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  ac_cv_lib_zstd_ZSTD_compress=yes
+else
+  ac_cv_lib_zstd_ZSTD_compress=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+conftest$ac_exeext conftest.$ac_ext
+LIBS=$ac_check_lib_save_LIBS
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_lib_zstd_ZSTD_compress" >&5
+$as_echo "$ac_cv_lib_zstd_ZSTD_compress" >&6; }
+if test "x$ac_cv_lib_zstd_ZSTD_compress" = xyes; then :
+  cat >>confdefs.h <<_ACEOF
+#define HAVE_LIBZSTD 1
+_ACEOF
+
+  LIBS="-lzstd $LIBS"
+
+else
+  as_fn_error $? "library 'zstd' is required for ZSTD support" "$LINENO" 5
+fi
+
+fi
+
+
+
+#
 # Elf
 #
 
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 95d090e..b59629c 100644
--- a/src/Makefi

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

2018-06-05 Thread Alexander Korotkov

On Tue, Jun 5, 2018 at 4:02 PM Andres Freund  wrote:
> On 2018-06-05 13:09:08 +0300, Alexander Korotkov wrote:
> > It appears that buffer replacement happening inside relation
> > extension lock is affected by starvation on exclusive buffer mapping
> > lwlocks and buffer content lwlocks, caused by many concurrent shared
> > lockers.  So, fair lwlock patch have no direct influence to relation
> > extension lock, which is naturally not even lwlock...
>
> Yea, that makes sense. I wonder how much the fix here is to "pre-clear"
> a victim buffer, and how much is a saner buffer replacement
> implementation (either by going away from O(NBuffers), or by having a
> queue of clean victim buffers like my bgwriter replacement).

The particular thing I observed on our environment is BufferAlloc()
waiting hours on buffer partition lock.  Increasing NUM_BUFFER_PARTITIONS
didn't give any significant help.  It appears that very hot page (root page of
some frequently used index) reside on that partition, so this partition was
continuously under shared lock.  So, in order to resolve without changing
LWLock, we probably should move our buffers hash table to something
lockless.

> > I'll post fair lwlock path in a separate thread.  It requires detailed
> > consideration and benchmarking, because there is a risk of regression
> > on specific workloads.
>
> I bet that doing it naively will regress massively in a number of cases.

Yes, I suspect the same.  However, I tend to think that something is wrong
with LWLock itself.  It seems that it is the only of our locks, which provides
some lockers almost infinite starvations under certain workloads.  In contrast,
even our SpinLock gives all the waiting processes nearly same chances to
acquire it.  So, I think idea of improving LWLock in this aspect deserves
at least further investigation.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Spilling hashed SetOps and aggregates to disk

2018-06-05 Thread Tomas Vondra


On 06/05/2018 03:41 PM, se...@rielau.com wrote:
In our code base we have added a WithStats-Flavor for creating memory 
contexts.
This api accepts a pointer to metric for accounting and it is inherited 
by all subcontexts unless overridden.
So we only needed to change context creation API where we wanted (such 
as TopTansactionContext, Message Context, ..)

That's quite trivial, actually.


I think that's pretty much what we tried when this patch was first 
discussed in 2015, but it added measurable overhead (1-3%) even when the 
accounting was disabled. Which is not quite desirable, of course.


Also we have fixed all those missing hash spills - albeit based on the 
9.6 hash table design I think.


I don't think this part of the code changed all that much.


If there is interest by the community we are very willing to share.


Absolutely! I think it'd be great to share both the code and the 
reasoning behind the design.



Cheers
Serge
Salesforce


cheers

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Locking B-tree leafs immediately in exclusive mode

2018-06-05 Thread Alexander Korotkov

Hi!

Currently _bt_search() always locks leaf buffer in shared mode (aka BT_READ),
while caller can relock it later.  However, I don't see what prevents
_bt_search()
from locking leaf immediately in exclusive mode (aka BT_WRITE) when required.
When traversing downlink from non-leaf page of level 1 (lowest level of non-leaf
pages just prior to leaf pages), we know that next page is going to be leaf.  In
this case, we can immediately lock next page in exclusive mode if required.
For sure, it might happen that we didn't manage to exclusively lock leaf in this
way when _bt_getroot() points us to leaf page.  But this case could be handled
in _bt_search() by relock.  Please, find implementation of this approach in the
attached patch.

I've run following simple test of this patch on 72-cores machine.

# postgresql.conf
max_connections = 300
shared_buffers = 32GB
fsync = off
synchronous_commit = off

#  DDL:
CREATE UNLOGGED TABLE ordered (id serial primary key, value text not null);
CREATE UNLOGGED TABLE unordered (i integer not null, value text not null);

# script_ordered.sql
INSERT INTO ordered (value) VALUES ('abcdefghijklmnoprsqtuvwxyz');

# script_unordered.sql
\set i random(1, 100)
INSERT INTO unordered VALUES (:i, 'abcdefghijklmnoprsqtuvwxyz');

# commands
pgbench -T 60 -P 1 -M prepared -f script_ordered.sql -c 150 -j 150 postgres
pgbench -T 60 -P 1 -M prepared -f script_unordered.sql -c 150 -j 150 postgres

# results
ordered, master: 157473 TPS
ordered, patched 231374 TPS
unordered, master: 232372 TPS
unordered, patched: 232535 TPS

As you can see, difference in unordered case is negligible  Due to random
inserts, concurrency for particular leafs is low. But ordered
insertion is almost
50% faster on patched version.

I wonder how could we miss such a simple optimization till now, but I also don't
see this patch to brake anything.

In patched version, it might appear that we have to traverse
rightlinks in exclusive
mode due to splits concurrent to downlink traversal.  However, the same might
happen in current master due to splits concurrent to relocks.  So, I
don't expect
performance regression to be caused by this patch.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


btree-search-write-lock-1.patch
Description: Binary data

RE: Re: Spilling hashed SetOps and aggregates to disk

2018-06-05 Thread serge

In our code base we have added a WithStats-Flavor for creating memory contexts.
This api accepts a pointer to metric for accounting and it is inherited by all 
subcontexts unless overridden.
So we only needed to change context creation API where we wanted (such as 
TopTansactionContext, Message Context, ..)
That's quite trivial, actually.
 
Also we have fixed all those missing hash spills - albeit based on the 9.6 hash 
table design I think.
 
If there is interest by the community we are very willing to share.
 
Cheers
Serge
Salesforce

Re: why partition pruning doesn't work?

2018-06-05 Thread Ashutosh Bapat

On Tue, Jun 5, 2018 at 6:24 PM, Dmitry Dolgov <9erthali...@gmail.com> wrote:
>> On 5 June 2018 at 12:31, Amit Langote  wrote:
>>
>> doesn't look quite right.  What says expr is really a Param?  The patch
>> appears to work because, by setting pinfo->execparams to *something*, it
>> triggers execution-time pruning to run; its contents aren't necessarily
>> used during execution pruning.  In fact, it would've crashed if the
>> execution-time pruning code had required execparams to contain *valid*
>> param id, but currently it doesn't.
>>
>> What I think we'd need to do to make this work is to make execution-time
>> pruning be invoked even if there aren't any Params involved.  IOW, let's
>> try to teach make_partition_pruneinfo that it can go ahead also in the
>> cases where there are expressions being compared with the partition key
>> that contain (only) stable functions.  Then, go and fix the
>> execution-pruning code to not *always* expect there to be Params to prune
>> with.
>
> Yeah, I agree - I copied this approach mindlessly from the original hacky
> patch. So, looks like it's necessary to have something like got_stable_expr
> together with gotparam.

I think the current code is heavily relying on Params to be present
for partition pruning, which isn't true. Runtime partition pruning is
possible when there are comparison conditions with partition key
expressions on one side and "execution time constant" expressions on
the other side. By "execution time constant" expression, I mean any
expression that evaluates to a constant at the time of execution like
a stable expressions (not just functions) or a Param expression. I can
think of only these two at this time, but there can be more. So,
gotparam should be renamed as "gotprunable_cond" to be generic.
pull_partkey_params() should be renamed as "pull_partkey_conds" or
something generic. That function would return true if there exists an
expression in steps which can be evaluated to a constant at runtime,
otherwise it returns false. My guess is there will be false-positives
which need to be dealt with later, but there will be no
false-negatives.

> And after that the only place where I see Params
> are in use is partkey_datum_from_expr where all the stuff is actually
> evaluated. So apparently this part about "fix the execution-pruning code to 
> not
> *always* expect there to be Params to prune with" will be only about this
> function - am I correct or there is something else that I missed?

Yes. But I think trying to evaluate parameters in this function is not
good. The approach of folding constant expressions before or
immediately after the execution starts doesn't require the expressions
to be evaluated in partkey_datum_from_expr and might benefit other
places where stable expressions or params can appear.

Other problem with partkey_datum_from_expr() seems to be that it
evaluated only param nodes but not the expressions involving
parameters which can folded into constants at runtime. Take for
example following queries on table t1 with two partitions (0, 100) and
(100, 200), populated using "insert into t1 select i, i from
generate_series(0, 199) i;". There's an index on t1(a).

explain analyze select * from t1 x left join t1 y on x.a = y.b where y.a = 5;
 QUERY PLAN

 Nested Loop  (cost=0.00..6.78 rows=1 width=16) (actual
time=0.033..0.066 rows=1 loops=1)
   ->  Append  (cost=0.00..2.25 rows=1 width=8) (actual
time=0.019..0.035 rows=1 loops=1)
 ->  Seq Scan on t1p1 y  (cost=0.00..2.25 rows=1 width=8)
(actual time=0.018..0.035 rows=1 loops=1)
   Filter: (a = 5)
   Rows Removed by Filter: 99
   ->  Append  (cost=0.00..4.51 rows=2 width=8) (actual
time=0.011..0.027 rows=1 loops=1)
 ->  Seq Scan on t1p1 x  (cost=0.00..2.25 rows=1 width=8)
(actual time=0.006..0.022 rows=1 loops=1)
   Filter: (y.b = a)
   Rows Removed by Filter: 99
 ->  Seq Scan on t1p2 x_1  (cost=0.00..2.25 rows=1 width=8)
(never executed)
   Filter: (y.b = a)
 Planning Time: 0.644 ms
 Execution Time: 0.115 ms
(13 rows)

t1p2 x_1 is never scanned indicating that run time partition pruning
happened. But then see the following query

postgres:17889=#explain analyze select * from t1 x left join t1 y on
x.a = y.b + 100 where y.a = 5;
  QUERY PLAN
--
 Nested Loop  (cost=0.00..7.28 rows=1 width=16) (actual
time=0.055..0.093 rows=1 loops=1)
   ->  Append  (cost=0.00..2.25 rows=1 width=8) (actual
time=0.017..0.034 rows=1 loops=1)
 ->  Seq Scan on t1p1 y  (cost=0.00..2.25 rows=1 width=8)
(actual time=0.016..0.033 rows=1 loops=1)
   Filter: (a = 5)
   Rows Removed by Filter: 99

install doesn't work on HEAD

2018-06-05 Thread Amit Kapila

Hi,

On my win-7, I am facing $SUBJECT.  I am consistently getting below error:
Copying build output files...Could not copy postgres.exe
 at install.pl line 26.

On digging, I found that it started failing with commit 3a7cc727 (Don't
fall off the end of perl functions).  It seems that we have added empty
'return' in all of the functions which seems to break one of the case:

lcopy("$conf\\$pf\\$pf.$ext", "$target\\$dir\\$pf.$ext")
|| croak "Could not copy $pf.$ext\n";

I think the return from lcopy implies 0 (or undef) which cause it to fail.
By reading couple of articles on net [1][2] related to this, it seems we
can't blindly return in all cases.  On changing, the lcopy as below,
install appears to be working.

@@ -40,7 +40,7 @@ sub lcopy
copy($src, $target)
  || confess "Could not copy $src to $target\n";
-   return;
+   return 1;
 }

I have randomly check some of the other functions where this patch has
added 'return' and those seem to be fine.

Is it by any chance related Perl version or some other settings?  If not,
then we should do something for it.

[1] - https://perlmaven.com/how-to-return-undef-from-a-function
[2] -
http://search.cpan.org/dist/Perl-Critic/lib/Perl/Critic/Policy/Subroutines/RequireFinalReturn.pm

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Spilling hashed SetOps and aggregates to disk

2018-06-05 Thread Tomas Vondra


On 06/05/2018 03:17 PM, David Rowley wrote:

On 6 June 2018 at 01:09, Andres Freund  wrote:

On 2018-06-06 01:06:39 +1200, David Rowley wrote:

My concern is that only accounting memory for the group and not the
state is only solving half the problem. It might be fine for
aggregates that don't stray far from their aggtransspace, but for the
other ones, we could still see OOM.



If solving the problem completely is too hard, then a half fix (maybe
3/4) is better than nothing, but if we can get a design for a full fix
before too much work is done, then isn't that better?


I don't think we actually disagree.  I was really primarily talking
about the case where we can't really do better because we don't have
serialization support.  I mean we could just rescan from scratch, using
a groupagg, but that obviously sucks.


I don't think we do. To take yours to the 100% solution might just
take adding the memory accounting to palloc that Jeff proposed a few
years ago and use that accounting to decide when we should switch
method.

However, I don't quite fully recall how the patch accounted for
memory consumed by sub-contexts and if getting the entire
consumption required recursively looking at subcontexts. If that's
the case then checking the consumption would likely cost too much if
it was done after each tuple was aggregated.


Yeah, a simple wrapper would not really fix the issue, because 
allocating memory in the subcontext would not update the accounting 
information. So we'd be oblivious of possibly large amounts of memory. I 
don't quite see how is this related to (not) having serialization 
support, as it's more about not knowing we already hit the memory limit.



That being said, I don't think array_agg actually has the issue, at 
least since b419865a814abbca12bdd6eef6a3d5ed67f432e1 (it does behave 
differently in aggregate and non-aggregate contexts, IIRC).


I don't know how common this issue is, though. I don't think any 
built-in aggregates create additional contexts, but I may be wrong. But 
we have this damn extensibility thing, where users can write their own 
aggregates ...


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Spilling hashed SetOps and aggregates to disk

2018-06-05 Thread David Rowley

On 6 June 2018 at 01:09, Andres Freund  wrote:
> On 2018-06-06 01:06:39 +1200, David Rowley wrote:
>> My concern is that only accounting memory for the group and not the
>> state is only solving half the problem. It might be fine for
>> aggregates that don't stray far from their aggtransspace, but for the
>> other ones, we could still see OOM.
>
>> If solving the problem completely is too hard, then a half fix (maybe
>> 3/4) is better than nothing, but if we can get a design for a full fix
>> before too much work is done, then isn't that better?
>
> I don't think we actually disagree.  I was really primarily talking
> about the case where we can't really do better because we don't have
> serialization support.  I mean we could just rescan from scratch, using
> a groupagg, but that obviously sucks.

I don't think we do. To take yours to the 100% solution might just
take adding the memory accounting to palloc that Jeff proposed a few
years ago and use that accounting to decide when we should switch
method.

However, I don't quite fully recall how the patch accounted for memory
consumed by sub-contexts and if getting the entire consumption
required recursively looking at subcontexts. If that's the case then
checking the consumption would likely cost too much if it was done
after each tuple was aggregated.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Spilling hashed SetOps and aggregates to disk

2018-06-05 Thread Tomas Vondra


On 06/05/2018 02:49 PM, Andres Freund wrote:

Hi,

On 2018-06-05 10:05:35 +0200, Tomas Vondra wrote:

My concern is more about what happens when the input tuple ordering is
inherently incompatible with the eviction strategy, greatly increasing the
amount of data written to disk during evictions.

Say for example that we can fit 1000 groups into work_mem, and that there
are 2000 groups in the input data set. If the input is correlated with the
groups, everything is peachy because we'll evict the first batch, and then
group the remaining 1000 groups (or something like that). But if the input
data is random (which can easily happen, e.g. for IP addresses, UUIDs and
such) we'll hit the limit repeatedly and will evict much sooner.



I know you suggested simply dumping the whole hash table and starting from
scratch while we talked about this at pgcon, but ISTM it has exactly this
issue.


Yea, that's the case I was thinking of where going to sorting will very
likely have better performance.

I think it'd even be sensible to have a "skew tuple" like
optimization. When detecting getting closer to memory exhaustion, move
new groups to the tuplesort, but already hashed tuples stay in the
hashtable.  That'd still need tuples being moved to the sort in the
cases where the transition values get to big (say array_agg), but that
should be comparatively rare.  I'm sure we could do better in selecting
the hash-tabled values than just taking the first incoming ones, but
that shouldn't be too bad.



Not sure. I'd imagine the "compression" due to aggregating many tuples 
into much smaller amount of memory would be a clear win, so I don't find 
the "let's just dump all input tuples into a file" very attractive.


I think evicting a fraction of the aggregate state (say, ~25%) would 
work better - choosing the "oldest" items seems OK, as those likely 
represent many input tuples (thus having a high "compaction" ratio). Or 
we could evict items representing rare groups, to make space for the 
common ones. Perhaps a combined eviction strategy (10% of each type) 
would work well. I'm conveniently ignoring the fact that we don't have 
information to determine this, at the moment, of course.


I'm sure it's possible to make better decisions based on some cost 
estimates, both at plan time and then during execution.


That being said, I don't want to over-think / over-engineer this. 
Getting something that reduces the risk of OOM in the first step is a 
good enough improvement. If we can do something better next, great.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: commitfest 2018-07

2018-06-05 Thread Andres Freund

On 2018-06-04 23:32:18 -0400, Tom Lane wrote:
> Michael Paquier  writes:
> > On Mon, Jun 04, 2018 at 07:16:33PM -0400, Peter Eisentraut wrote:
> >> There were some discussions about renaming the existing 2018-09 entry
> >> versus inserting a new one at -07 and requiring patches to be moved back
> >> explicitly.
> 
> > I would do that to reduce unnecessary log noise, but I was unsure of the
> > actual status we are at.  I am pretty sure that nobody is going to
> > complain if what they submitted gets looked up two months earlier than
> > what was previously planned, so I would vote to rename the existing
> > 2018-09 to 2018-07, to rename the existing 2018-11 to 2018-09, and to
> > create three new CF entries.
> 
> +1 for just renaming 2018-09 to 2018-07, if we can do that.  We'll end
> up postponing some entries back to -09, but that seems like less churn
> than the other way.

I'd rather create a new 2018-07, and just manually move old patches to
it. Otherwise we'll not really focus on the glut of old things, but
everyone just restarts working on their own new thing.

Greetings,

Andres Freund

Re: Spilling hashed SetOps and aggregates to disk

2018-06-05 Thread Andres Freund

Hi,

On 2018-06-06 01:06:39 +1200, David Rowley wrote:
> On 6 June 2018 at 00:57, Andres Freund  wrote:
> > I think it's ok to only handle this gracefully if serialization is
> > supported.
> >
> > But I think my proposal to continue use a hashtable for the already
> > known groups, and sorting for additional groups would largely address
> > that largely, right?  We couldn't deal with groups becoming too large,
> > but easily with the number of groups becoming too large.
> 
> My concern is that only accounting memory for the group and not the
> state is only solving half the problem. It might be fine for
> aggregates that don't stray far from their aggtransspace, but for the
> other ones, we could still see OOM.

> If solving the problem completely is too hard, then a half fix (maybe
> 3/4) is better than nothing, but if we can get a design for a full fix
> before too much work is done, then isn't that better?

I don't think we actually disagree.  I was really primarily talking
about the case where we can't really do better because we don't have
serialization support.  I mean we could just rescan from scratch, using
a groupagg, but that obviously sucks.

Greetings,

Andres Freund

Re: commitfest 2018-07

2018-06-05 Thread Stephen Frost

Greetings,

* Ashutosh Bapat (ashutosh.ba...@enterprisedb.com) wrote:
> On Tue, Jun 5, 2018 at 9:02 AM, Tom Lane  wrote:
> > Michael Paquier  writes:
> >> On Mon, Jun 04, 2018 at 07:16:33PM -0400, Peter Eisentraut wrote:
> >>> There were some discussions about renaming the existing 2018-09 entry
> >>> versus inserting a new one at -07 and requiring patches to be moved back
> >>> explicitly.
> >
> >> I would do that to reduce unnecessary log noise, but I was unsure of the
> >> actual status we are at.  I am pretty sure that nobody is going to
> >> complain if what they submitted gets looked up two months earlier than
> >> what was previously planned, so I would vote to rename the existing
> >> 2018-09 to 2018-07, to rename the existing 2018-11 to 2018-09, and to
> >> create three new CF entries.
> >
> > +1 for just renaming 2018-09 to 2018-07, if we can do that.  We'll end
> > up postponing some entries back to -09, but that seems like less churn
> > than the other way.
> 
> Notes at [1] about keeping this commitfest for small patches. Just
> renaming the commitfest would mean all the patches, big and small, can
> be reviewed and committed.

"Yes and no."

While there were concerns raised that larger patches committed earlier
might cause back-patching issues, the general consensus from my read of
it was that it wasn't likely to be that big of an issue.  While I
suspect we'll still generally focus on getting smaller changes in during
the 2018-07 cycle, to try and "clear the way" for the larger patches,
I'm no longer concerned about larger patches which are ready being
committed during that cycle.

As such, I'm also +1 on the proposal to rename 2018-09 to 2018-07, and
make the other renames and add a new one to the end.

> [1] https://wiki.postgresql.org/wiki/PgCon_2018_Developer_Meeting

Thanks!

Stephen

signature.asc
Description: PGP signature

Re: Spilling hashed SetOps and aggregates to disk

2018-06-05 Thread David Rowley

On 6 June 2018 at 00:57, Andres Freund  wrote:
> On 2018-06-06 00:53:42 +1200, David Rowley wrote:
>> On 6 June 2018 at 00:45, Andres Freund  wrote:
>> > On 2018-06-05 09:35:13 +0200, Tomas Vondra wrote:
>> >> I wonder if an aggregate might use a custom context
>> >> internally (I don't recall anything like that). The accounting capability
>> >> seems potentially useful for other places, and those might not use 
>> >> AllocSet
>> >> (or at least not directly).
>> >
>> > Yea, that seems like a big issue.
>>
>> Unfortunately, at least one of the built-in ones do. See initArrayResultArr.
>
> I think it's ok to only handle this gracefully if serialization is
> supported.
>
> But I think my proposal to continue use a hashtable for the already
> known groups, and sorting for additional groups would largely address
> that largely, right?  We couldn't deal with groups becoming too large,
> but easily with the number of groups becoming too large.

My concern is that only accounting memory for the group and not the
state is only solving half the problem. It might be fine for
aggregates that don't stray far from their aggtransspace, but for the
other ones, we could still see OOM.

If solving the problem completely is too hard, then a half fix (maybe
3/4) is better than nothing, but if we can get a design for a full fix
before too much work is done, then isn't that better?


-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

2018-06-05 Thread Andres Freund

On 2018-06-05 13:09:08 +0300, Alexander Korotkov wrote:
> On Tue, Jun 5, 2018 at 12:48 PM Konstantin Knizhnik
>  wrote:
> > Workload is combination of inserts and selects.
> > Looks like shared locks obtained by select cause starvation of inserts, 
> > trying to get exclusive relation extension lock.
> > The problem is fixed by fair lwlock patch, implemented by Alexander 
> > Korotkov. This patch prevents granting of shared lock if wait queue is not 
> > empty.
> > May be we should use this patch or find some other way to prevent 
> > starvation of writers on relation extension locks for such workloads.
> 
> Fair lwlock patch really fixed starvation of exclusive lwlock waiters.
> But that starvation happens not on relation extension lock – selects
> don't get shared relation extension lock.  The real issue there was
> not relation extension lock itself, but the time spent inside this
> lock.

Yea, that makes a lot more sense to me.


> It appears that buffer replacement happening inside relation
> extension lock is affected by starvation on exclusive buffer mapping
> lwlocks and buffer content lwlocks, caused by many concurrent shared
> lockers.  So, fair lwlock patch have no direct influence to relation
> extension lock, which is naturally not even lwlock...

Yea, that makes sense. I wonder how much the fix here is to "pre-clear"
a victim buffer, and how much is a saner buffer replacement
implementation (either by going away from O(NBuffers), or by having a
queue of clean victim buffers like my bgwriter replacement).


> I'll post fair lwlock path in a separate thread.  It requires detailed
> consideration and benchmarking, because there is a risk of regression
> on specific workloads.

I bet that doing it naively will regress massively in a number of cases.

Greetings,

Andres Freund

Re: plans for PostgreSQL 12

2018-06-05 Thread Andres Freund

Hi,

On 2018-06-05 06:32:31 +0200, Pavel Stehule wrote:
> ./configure --with-libxml --enable-tap-tests --enable-debug --with-perl
> CFLAGS="-ggdb -Og -g3 -fno-omit-frame-pointer"
> 
> [pavel@nemesis postgresql]$ gcc --version
> gcc (GCC) 8.1.1 20180502 (Red Hat 8.1.1-1)
> 
> I executed simple script
> 
>  do $$ declare i bigint = 1; s bigint = 0; begin while i <= 1 loop
> s := s + i; i := i + 1; end loop;  raise notice '%', s; end $$;
> 
>7,68%  postmaster  postgres   [.]
> GetSnapshotData   ▒
>7,53%  postmaster  plpgsql.so [.]
> exec_eval_simple_expr ▒
>6,49%  postmaster  postgres   [.]

It seems to me the right fix here isn't a new class of functions, but
rather support for delaying the computation of the snapshot to the point
it's needed.  That'll be far more generically applicable and doesn't
require user interaction.


> ExecInterpExpr▒
>4,13%  postmaster  postgres   [.]

So we're going to need to optimize this further as well, I've a pending
patch for that, luckily ;)

> LWLockRelease ▒
>4,12%  postmaster  postgres   [.]

That's also GetSnapshotData()...

Greetings,

Andres Freund

Re: Spilling hashed SetOps and aggregates to disk

2018-06-05 Thread Andres Freund

On 2018-06-06 00:53:42 +1200, David Rowley wrote:
> On 6 June 2018 at 00:45, Andres Freund  wrote:
> > On 2018-06-05 09:35:13 +0200, Tomas Vondra wrote:
> >> I wonder if an aggregate might use a custom context
> >> internally (I don't recall anything like that). The accounting capability
> >> seems potentially useful for other places, and those might not use AllocSet
> >> (or at least not directly).
> >
> > Yea, that seems like a big issue.
> 
> Unfortunately, at least one of the built-in ones do. See initArrayResultArr.

I think it's ok to only handle this gracefully if serialization is
supported.

But I think my proposal to continue use a hashtable for the already
known groups, and sorting for additional groups would largely address
that largely, right?  We couldn't deal with groups becoming too large,
but easily with the number of groups becoming too large.

Greetings,

Andres Freund

Re: why partition pruning doesn't work?

2018-06-05 Thread Dmitry Dolgov

> On 5 June 2018 at 12:31, Amit Langote  wrote:
>
> doesn't look quite right.  What says expr is really a Param?  The patch
> appears to work because, by setting pinfo->execparams to *something*, it
> triggers execution-time pruning to run; its contents aren't necessarily
> used during execution pruning.  In fact, it would've crashed if the
> execution-time pruning code had required execparams to contain *valid*
> param id, but currently it doesn't.
>
> What I think we'd need to do to make this work is to make execution-time
> pruning be invoked even if there aren't any Params involved.  IOW, let's
> try to teach make_partition_pruneinfo that it can go ahead also in the
> cases where there are expressions being compared with the partition key
> that contain (only) stable functions.  Then, go and fix the
> execution-pruning code to not *always* expect there to be Params to prune
> with.

Yeah, I agree - I copied this approach mindlessly from the original hacky
patch. So, looks like it's necessary to have something like got_stable_expr
together with gotparam. And after that the only place where I see Params
are in use is partkey_datum_from_expr where all the stuff is actually
evaluated. So apparently this part about "fix the execution-pruning code to not
*always* expect there to be Params to prune with" will be only about this
function - am I correct or there is something else that I missed?

Re: Spilling hashed SetOps and aggregates to disk

2018-06-05 Thread David Rowley

On 6 June 2018 at 00:45, Andres Freund  wrote:
> On 2018-06-05 09:35:13 +0200, Tomas Vondra wrote:
>> I wonder if an aggregate might use a custom context
>> internally (I don't recall anything like that). The accounting capability
>> seems potentially useful for other places, and those might not use AllocSet
>> (or at least not directly).
>
> Yea, that seems like a big issue.

Unfortunately, at least one of the built-in ones do. See initArrayResultArr.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Spilling hashed SetOps and aggregates to disk

2018-06-05 Thread Andres Freund

Hi,

On 2018-06-05 10:05:35 +0200, Tomas Vondra wrote:
> My concern is more about what happens when the input tuple ordering is
> inherently incompatible with the eviction strategy, greatly increasing the
> amount of data written to disk during evictions.
> 
> Say for example that we can fit 1000 groups into work_mem, and that there
> are 2000 groups in the input data set. If the input is correlated with the
> groups, everything is peachy because we'll evict the first batch, and then
> group the remaining 1000 groups (or something like that). But if the input
> data is random (which can easily happen, e.g. for IP addresses, UUIDs and
> such) we'll hit the limit repeatedly and will evict much sooner.

> I know you suggested simply dumping the whole hash table and starting from
> scratch while we talked about this at pgcon, but ISTM it has exactly this
> issue.

Yea, that's the case I was thinking of where going to sorting will very
likely have better performance.

I think it'd even be sensible to have a "skew tuple" like
optimization. When detecting getting closer to memory exhaustion, move
new groups to the tuplesort, but already hashed tuples stay in the
hashtable.  That'd still need tuples being moved to the sort in the
cases where the transition values get to big (say array_agg), but that
should be comparatively rare.  I'm sure we could do better in selecting
the hash-tabled values than just taking the first incoming ones, but
that shouldn't be too bad.

Greetings,

Andres Freund

Re: Spilling hashed SetOps and aggregates to disk

2018-06-05 Thread Andres Freund

Hi,

On 2018-06-05 09:35:13 +0200, Tomas Vondra wrote:
> But I don't think we've considered copying the whole AllocSet. Maybe that
> would work, not sure.

Don't think you'd need to fully copy it - you can just have a "wrapper"
memory context implementation that does the accounting and then hands
the actual work to AllocSet. That limits some things it can account for,
but I don't see those being really relevant.

> I wonder if an aggregate might use a custom context
> internally (I don't recall anything like that). The accounting capability
> seems potentially useful for other places, and those might not use AllocSet
> (or at least not directly).

Yea, that seems like a big issue.

Greetings,

Andres Freund

Re: Spilling hashed SetOps and aggregates to disk

2018-06-05 Thread Andres Freund

Hi,

On 2018-06-04 22:18:56 -0700, Jeff Davis wrote:
> On Mon, 2018-06-04 at 11:52 -0700, Andres Freund wrote:
> > I wonder whether, at least for aggregates, the better fix wouldn't be
> > to
> > switch to feeding the tuples into tuplesort upon memory exhaustion
> > and
> > doing a sort based aggregate.  We have most of the infrastructure to
> > do
> 
> That's an interesting idea, but it seems simpler to stick to hashing
> rather than using a combination strategy. It also seems like it would
> take less CPU effort.

Isn't the locality of access going to considerably better with the sort
based approach?

> What advantages do you have in mind? My patch partitions the spilled
> data, so it should have similar disk costs as a sort approach.

I think one part of it is that I think the amount of code is going to be
lower - we essentially have already all the code to handle sort based
aggs, and to have both sort and hash based aggs in the same query. We'd
mostly need a way to scan the hashtable and stuff it into a tuplesort,
that's not hard.  nodeAgg.c is already more than complex enough, I'm not
sure that full blown partitioning is worth the cost.

Greetings,

Andres Freund

Re: commitfest 2018-07

2018-06-05 Thread Ashutosh Bapat

On Tue, Jun 5, 2018 at 9:02 AM, Tom Lane  wrote:
> Michael Paquier  writes:
>> On Mon, Jun 04, 2018 at 07:16:33PM -0400, Peter Eisentraut wrote:
>>> There were some discussions about renaming the existing 2018-09 entry
>>> versus inserting a new one at -07 and requiring patches to be moved back
>>> explicitly.
>
>> I would do that to reduce unnecessary log noise, but I was unsure of the
>> actual status we are at.  I am pretty sure that nobody is going to
>> complain if what they submitted gets looked up two months earlier than
>> what was previously planned, so I would vote to rename the existing
>> 2018-09 to 2018-07, to rename the existing 2018-11 to 2018-09, and to
>> create three new CF entries.
>
> +1 for just renaming 2018-09 to 2018-07, if we can do that.  We'll end
> up postponing some entries back to -09, but that seems like less churn
> than the other way.
>

Notes at [1] about keeping this commitfest for small patches. Just
renaming the commitfest would mean all the patches, big and small, can
be reviewed and committed.

[1] https://wiki.postgresql.org/wiki/PgCon_2018_Developer_Meeting

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: \d t: ERROR: XX000: cache lookup failed for relation

2018-06-05 Thread Teodor Sigaev


Ah, I think this is the missing, essential component:
CREATE INDEX ON t(right(i::text,1)) WHERE i::text LIKE '%1';

Finally, I reproduce it with attached script.

INSERT 0 99  <- first insertion
ERROR:  cache lookup failed for relation 1032219
ALTER TABLE
ERROR:  cache lookup failed for relation 1033478
ALTER TABLE
ERROR:  cache lookup failed for relation 1034073
ALTER TABLE
ERROR:  cache lookup failed for relation 1034650
ALTER TABLE
ERROR:  cache lookup failed for relation 1035238
ALTER TABLE
ERROR:  cache lookup failed for relation 1035837

will investigate
--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/


1.sh
Description: application/shellscript

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

2018-06-05 Thread Konstantin Knizhnik





On 05.06.2018 13:29, Masahiko Sawada wrote:

On Tue, Jun 5, 2018 at 6:47 PM, Konstantin Knizhnik
 wrote:


On 05.06.2018 07:22, Masahiko Sawada wrote:

On Mon, Jun 4, 2018 at 10:47 PM, Konstantin Knizhnik
 wrote:


On 26.04.2018 09:10, Masahiko Sawada wrote:

On Thu, Apr 26, 2018 at 3:30 AM, Robert Haas 
wrote:

On Tue, Apr 10, 2018 at 9:08 PM, Masahiko Sawada

wrote:

Never mind. There was a lot of items especially at the last
CommitFest.


In terms of moving forward, I'd still like to hear what
Andres has to say about the comments I made on March 1st.

Yeah, agreed.

$ ping -n andres.freund
Request timeout for icmp_seq 0
Request timeout for icmp_seq 1
Request timeout for icmp_seq 2
Request timeout for icmp_seq 3
Request timeout for icmp_seq 4
^C
--- andres.freund ping statistics ---
6 packets transmitted, 0 packets received, 100.0% packet loss

Meanwhile,

https://www.postgresql.org/message-id/4c171ffe-e3ee-acc5-9066-a40d52bc5...@postgrespro.ru
shows that this patch has some benefits for other cases, which is a
point in favor IMHO.

Thank you for sharing. That's good to know.

Andres pointed out the performance degradation due to hash collision
when multiple loading. I think the point is that it happens at where
users don't know.  Therefore even if we make N_RELEXTLOCK_ENTS
configurable parameter, since users don't know the hash collision they
don't know when they should tune it.

So it's just an idea but how about adding an SQL-callable function
that returns the estimated number of lock waiters of the given
relation? Since user knows how many processes are loading to the
relation, if a returned value by the function is greater than the
expected value user  can know hash collision and will be able to start
to consider to increase N_RELEXTLOCK_ENTS.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


We in PostgresProc were faced with lock extension contention problem at
two
more customers and tried to use this patch (v13) to address this issue.
Unfortunately replacing heavy lock with lwlock couldn't completely
eliminate
contention, now most of backends are blocked on conditional variable:

0x7fb03a318903 in __epoll_wait_nocancel () from /lib64/libc.so.6
#0  0x7fb03a318903 in __epoll_wait_nocancel () from /lib64/libc.so.6
#1  0x007024ee in WaitEventSetWait ()
#2  0x00718fa6 in ConditionVariableSleep ()
#3  0x0071954d in RelExtLockAcquire ()
#4  0x004ba99d in RelationGetBufferForTuple ()
#5  0x004b3f18 in heap_insert ()
#6  0x006109c8 in ExecInsert ()
#7  0x00611a49 in ExecModifyTable ()
#8  0x005ef97a in standard_ExecutorRun ()
#9  0x0072440a in ProcessQuery ()
#10 0x00724631 in PortalRunMulti ()
#11 0x007250ec in PortalRun ()
#12 0x00721287 in exec_simple_query ()
#13 0x00722532 in PostgresMain ()
#14 0x0047a9eb in ServerLoop ()
#15 0x006b9fe9 in PostmasterMain ()
#16 0x0047b431 in main ()

Obviously there is nothing surprising here: if a lot of processes try to
acquire the same exclusive lock, then high contention is expected.
I just want to notice that this patch is not able to completely eliminate
the problem with large number of concurrent inserts to the same table.

Second problem we observed was even more critical: if backed is granted
relation extension lock and then got some error before releasing this
lock,
then abort of the current transaction doesn't release this lock (unlike
heavy weight lock) and the relation is kept locked.
So database is actually stalled and server has to be restarted.


Thank you for reporting.

Regarding the second problem, I tried to reproduce that bug with
latest version patch (v13) but could not. When transaction aborts, we
call
ReousrceOwnerRelease()->ResourceOwnerReleaseInternal()->ProcReleaseLocks()->RelExtLockCleanup()
and clear either relext lock bits we are holding or waiting. If we
raise an error after we added a relext lock bit but before we
increment its holding count then the relext lock is remained, but I
couldn't see the code raises an error between them. Could you please
share the concrete reproduction


Sorry, my original guess that LW-locks are not released in case of
transaction abort is not correct.
There was really situation when all backends were blocked in relation
extension lock and looks like it happens after disk write error,

You're saying that it is not correct that LWlock are not released but
it's correct that all backends were blocked in relext lock, but in
other your mail you're saying something opposite. Which is correct?
I am sorry for confusion. I have not investigated core files myself and 
just share information received from our engineer.

Looks like this problem may be related with relation extension locks at all.
Sorry for false alarm.

AtEOXact_ApplyLauncher() and subtransactions

2018-06-05 Thread Amit Khandekar

Hi,

When a SUBSCRIPTION is altered, then the currently running
table-synchronization workers that are no longer needed for the
altered subscription, are terminated. This is done by the function
AtEOXact_ApplyLauncher() inside CommitTransaction(). So during each
ALTER-SUBSCRIPTION command, the on_commit_stop_workers list is
appended with new set of workers to be stopped. And then at commit,
AtEOXact_ApplyLauncher() stops the workers in the list.

But there is no handling for sub-transaction abort. Consider this :

-- On secondary, create a subscription that will initiate the table sync
CREATE SUBSCRIPTION mysub CONNECTION
'dbname=postgres host=localhost user=rep password=Password port=5432'
PUBLICATION mypub;

-- While the sync is going on, change the publication of the
-- subscription in a subtransaction, so that it adds up the synchronization
-- workers for the earlier publication into the 'on_commit_stop_workers' list
select pg_sleep(1);
begin;
savepoint a;
ALTER SUBSCRIPTION mysub SET PUBLICATION mypub_2;
rollback to a;
-- Commit will stop the above sync.
commit;

So the ALTER-SUBSCRIPTION has been rolled back. But the
on_commit_stop_workers is not reverted back to its earlier state. And
then at commit, the workers will be killed unnecessarily. I have
observed that when the workers are killed, they are restarted. But
consider the case where the original subscription was synchronizing
hundreds of tables. And just when it is about to finish, someone
changes the subscription inside a subtransaction and rolls it back,
and then commits the transaction. The whole synchronization gets
re-started from scratch.

Below log snippets show this behaviour :

< CREATE-SUBSCRIPTION command spawns workers >
2018-06-05 15:04:07.841 IST [39951] LOG:  logical replication apply
worker for subscription "mysub" has started
2018-06-05 15:04:07.848 IST [39953] LOG:  logical replication table
synchronization worker for subscription "mysub", table "article" has
started

< After some time, ALTER-SUBSCRIPTION rollbacks inside subtransaction,
and commits transaction. Workers are killed >
2018-06-05 15:04:32.903 IST [39953] FATAL:  terminating logical
replication worker due to administrator command
2018-06-05 15:04:32.903 IST [39953] CONTEXT:  COPY article, line 37116293
2018-06-05 15:04:32.904 IST [39666] LOG:  background worker "logical
replication worker" (PID 39953) exited with exit code 1

< Workers are restarted >
2018-06-05 15:04:32.909 IST [40003] LOG:  logical replication table
synchronization worker for subscription "mysub", table "article" has
started
< Synchronization done after some time >
2018-06-05 15:05:10.042 IST [40003] LOG:  logical replication table
synchronization worker for subscription "mysub", table "article" has
finished

---

To reproduce the issue :
1. On master server, create the tables and publications using attached
setup_master.sql.
2. On secondary server, run the attached run_on_secondary.sql. This
will reproduce the issue as can be seen from the log output.

---

Fix :

I haven't written a patch for it, but I think we should have a
separate on_commit_stop_workers for each subtransaction. At
subtransaction commit, we replace the on_commit_stop_workers list of
the parent subtransaction with the one from the committed
subtransaction; and on abort, discard the list of the current
subtransaction. So have a stack of the lists.

---

Furthermore, before fixing this, we may have to fix another issue
which, I suspect, would surface even without subtransactions, as
explained below :

Suppose publication pubx is set for tables tx1 and and tx2.
And publication puby is for tables ty1 and ty2.

Subscription mysub is set to synchronise tables tx1 and tx2 :
CREATE SUBSCRIPTION mysub ... PUBLICATION pubx;

Now suppose the subscription is altered to synchronise ty1 and ty2,
and then again altered back to synchronise tx1 and tx2 in the same
transaction.
begin;
ALTER SUBSCRIPTION mysub set publication puby;
ALTER SUBSCRIPTION mysub set publication pubx;
commit;

Here, workers for tx1 and tx2 are added to on_commit_stop_workers
after the publication is set to puby. And then workers for ty1 and ty2
are further added to that list after the 2nd ALTER command where
publication is set to pubx, because the earlier ALTER command has
already changed the catalogs to denote that ty1 and ty2 are being
synchronised. Effectively, at commit, all the workers are targetted to
be stopped, when actually at commit time there is no final change in
the tables to be synchronised. What actually we should do is : for
each ALTER command we should compare the very first set of tables
being synchronised since the last committed ALTER command, rather than
checking the catalogs.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


run_on_secondary.sql
Description: Binary data


setup_master.sql
Description: Binary data

Re: pg_replication_slot_advance to return NULL instead of 0/0 if slot not advanced

2018-06-05 Thread Petr Jelinek

On 05/06/18 06:28, Michael Paquier wrote:
> On Mon, Jun 04, 2018 at 11:51:35AM +0200, Petr Jelinek wrote:
>> On 01/06/18 21:13, Michael Paquier wrote:
>>> -startlsn =3D MyReplicationSlot->data.confirmed_flush;
>>> +if (OidIsValid(MyReplicationSlot->data.database))
>>> +startlsn =3D MyReplicationSlot->data.confirmed_flush;
>>> +else
>>> +startlsn =3D MyReplicationSlot->data.restart_lsn;
>>> +
>>>  if (moveto < startlsn)
>>>  {
>>>  ReplicationSlotRelease();
>>
>> This part looks correct for the checking that we are not moving
>> backwards. However, there is another existing issue with this code
>> which
>> is that we are later using the confirmed_flush (via startlsn) as start
>> point of logical decoding (XLogReadRecord parameter in
>> pg_logical_replication_slot_advance) which is not correct. The
>> restart_lsn should be used for that. I think it would make sense to
>> fix
>> that as part of this patch as well.
> 
> I am not sure I understand what you are coming at here.  Could you
> explain your point a bit more please as 9c7d06d is yours?

As I said, it's an existing issue. I just had chance to reread the code
when looking and your patch.

> When creating
> the decoding context, all other code paths use the confirmed LSN as a
> start point if not explicitely defined by the caller of
> CreateDecodingContext, as it points to the last LSN where a transaction
> has been committed and decoded.

I didn't say anything about CreateDecodingContext though. I am talking
about the fact that we then use the same variable as input to
XLogReadRecord later in the logical slot code path. The whole point of
having restart_lsn for logical slot is to have correct start point for
XLogReadRecord so using the confirmed_flush there is wrong (and has been
wrong since the original commit).


-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services

Re: why partition pruning doesn't work?

2018-06-05 Thread Amit Langote

Hi Dmitry,

Thanks for creating the patch.  I looked at it and have some comments.

On 2018/06/04 22:30, Dmitry Dolgov wrote:
>> On 3 June 2018 at 19:11, Tom Lane  wrote:
>> Dmitry Dolgov <9erthali...@gmail.com> writes:
>>> Just to clarify for myself, for evaluating any stable function here would 
>>> it be
>>> enough to handle all function-like expressions (FuncExpr / OpExpr /
>>> DistinctExpr / NullIfExpr) and check a corresponding function for 
>>> provolatile,
>>> like in the attached patch?
>>
>> I think the entire approach is wrong here.  Rather than concerning
>> yourself with Params, or any other specific expression type, you
>> should be using !contain_volatile_functions() to decide whether
>> an expression is run-time-constant.  If it is, use the regular
>> expression evaluation machinery to extract the value.
> 
> Yes, it makes sense. Then, to my understanding, the attached code is close to
> what was described above (although I'm not sure about the Const part).

This:

@@ -2727,6 +2730,13 @@ pull_partkey_params(PartitionPruneInfo *pinfo, List
*steps)
 }
 gotone = true;
 }
+else if (!IsA(expr, Const))
+{
+Param   *param = (Param *) expr;
+pinfo->execparams = bms_add_member(pinfo->execparams,
+   param->paramid);
+gotone = true;
+}

doesn't look quite right.  What says expr is really a Param?  The patch
appears to work because, by setting pinfo->execparams to *something*, it
triggers execution-time pruning to run; its contents aren't necessarily
used during execution pruning.  In fact, it would've crashed if the
execution-time pruning code had required execparams to contain *valid*
param id, but currently it doesn't.

What I think we'd need to do to make this work is to make execution-time
pruning be invoked even if there aren't any Params involved.  IOW, let's
try to teach make_partition_pruneinfo that it can go ahead also in the
cases where there are expressions being compared with the partition key
that contain (only) stable functions.  Then, go and fix the
execution-pruning code to not *always* expect there to be Params to prune
with.

Maybe, David (added to cc) has something to say about all this, especially
whether he considers this a feature and not a bug fix.

Thanks,
Amit

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

2018-06-05 Thread Masahiko Sawada

On Tue, Jun 5, 2018 at 6:47 PM, Konstantin Knizhnik
 wrote:
>
>
> On 05.06.2018 07:22, Masahiko Sawada wrote:
>>
>> On Mon, Jun 4, 2018 at 10:47 PM, Konstantin Knizhnik
>>  wrote:
>>>
>>>
>>> On 26.04.2018 09:10, Masahiko Sawada wrote:

 On Thu, Apr 26, 2018 at 3:30 AM, Robert Haas 
 wrote:
>
> On Tue, Apr 10, 2018 at 9:08 PM, Masahiko Sawada
> 
> wrote:
>>
>> Never mind. There was a lot of items especially at the last
>> CommitFest.
>>
>>> In terms of moving forward, I'd still like to hear what
>>> Andres has to say about the comments I made on March 1st.
>>
>> Yeah, agreed.
>
> $ ping -n andres.freund
> Request timeout for icmp_seq 0
> Request timeout for icmp_seq 1
> Request timeout for icmp_seq 2
> Request timeout for icmp_seq 3
> Request timeout for icmp_seq 4
> ^C
> --- andres.freund ping statistics ---
> 6 packets transmitted, 0 packets received, 100.0% packet loss
>
> Meanwhile,
>
> https://www.postgresql.org/message-id/4c171ffe-e3ee-acc5-9066-a40d52bc5...@postgrespro.ru
> shows that this patch has some benefits for other cases, which is a
> point in favor IMHO.

 Thank you for sharing. That's good to know.

 Andres pointed out the performance degradation due to hash collision
 when multiple loading. I think the point is that it happens at where
 users don't know.  Therefore even if we make N_RELEXTLOCK_ENTS
 configurable parameter, since users don't know the hash collision they
 don't know when they should tune it.

 So it's just an idea but how about adding an SQL-callable function
 that returns the estimated number of lock waiters of the given
 relation? Since user knows how many processes are loading to the
 relation, if a returned value by the function is greater than the
 expected value user  can know hash collision and will be able to start
 to consider to increase N_RELEXTLOCK_ENTS.

 Regards,

 --
 Masahiko Sawada
 NIPPON TELEGRAPH AND TELEPHONE CORPORATION
 NTT Open Source Software Center

>>> We in PostgresProc were faced with lock extension contention problem at
>>> two
>>> more customers and tried to use this patch (v13) to address this issue.
>>> Unfortunately replacing heavy lock with lwlock couldn't completely
>>> eliminate
>>> contention, now most of backends are blocked on conditional variable:
>>>
>>> 0x7fb03a318903 in __epoll_wait_nocancel () from /lib64/libc.so.6
>>> #0  0x7fb03a318903 in __epoll_wait_nocancel () from /lib64/libc.so.6
>>> #1  0x007024ee in WaitEventSetWait ()
>>> #2  0x00718fa6 in ConditionVariableSleep ()
>>> #3  0x0071954d in RelExtLockAcquire ()
>>> #4  0x004ba99d in RelationGetBufferForTuple ()
>>> #5  0x004b3f18 in heap_insert ()
>>> #6  0x006109c8 in ExecInsert ()
>>> #7  0x00611a49 in ExecModifyTable ()
>>> #8  0x005ef97a in standard_ExecutorRun ()
>>> #9  0x0072440a in ProcessQuery ()
>>> #10 0x00724631 in PortalRunMulti ()
>>> #11 0x007250ec in PortalRun ()
>>> #12 0x00721287 in exec_simple_query ()
>>> #13 0x00722532 in PostgresMain ()
>>> #14 0x0047a9eb in ServerLoop ()
>>> #15 0x006b9fe9 in PostmasterMain ()
>>> #16 0x0047b431 in main ()
>>>
>>> Obviously there is nothing surprising here: if a lot of processes try to
>>> acquire the same exclusive lock, then high contention is expected.
>>> I just want to notice that this patch is not able to completely eliminate
>>> the problem with large number of concurrent inserts to the same table.
>>>
>>> Second problem we observed was even more critical: if backed is granted
>>> relation extension lock and then got some error before releasing this
>>> lock,
>>> then abort of the current transaction doesn't release this lock (unlike
>>> heavy weight lock) and the relation is kept locked.
>>> So database is actually stalled and server has to be restarted.
>>>
>> Thank you for reporting.
>>
>> Regarding the second problem, I tried to reproduce that bug with
>> latest version patch (v13) but could not. When transaction aborts, we
>> call
>> ReousrceOwnerRelease()->ResourceOwnerReleaseInternal()->ProcReleaseLocks()->RelExtLockCleanup()
>> and clear either relext lock bits we are holding or waiting. If we
>> raise an error after we added a relext lock bit but before we
>> increment its holding count then the relext lock is remained, but I
>> couldn't see the code raises an error between them. Could you please
>> share the concrete reproduction
>
>
> Sorry, my original guess that LW-locks are not released in case of
> transaction abort is not correct.
> There was really situation when all backends were blocked in relation
> extension lock and looks like it happens after disk write error,

You're saying that it is not correct that LWlock are not released but
it's correct

Re: [HACKERS] Transactions involving multiple postgres foreign servers

2018-06-05 Thread Masahiko Sawada

On Sat, May 26, 2018 at 12:25 AM, Robert Haas  wrote:
> On Fri, May 18, 2018 at 11:21 AM, Masahiko Sawada  
> wrote:
>> Regarding to API design, should we use 2PC for a distributed
>> transaction if both two or more 2PC-capable foreign servers and
>> 2PC-non-capable foreign server are involved with it?  Or should we end
>> up with an error? the 2PC-non-capable server might be either that has
>> 2PC functionality but just disables it or that doesn't have it.
>
> It seems to me that this is functionality that many people will not
> want to use.  First, doing a PREPARE and then a COMMIT for each FDW
> write transaction is bound to be more expensive than just doing a
> COMMIT.  Second, because the default value of
> max_prepared_transactions is 0, this can only work at all if special
> configuration has been done on the remote side.  Because of the second
> point in particular, it seems to me that the default for this new
> feature must be "off".  It would make to ship a default configuration
> of PostgreSQL that doesn't work with the default configuration of
> postgres_fdw, and I do not think we want to change the default value
> of max_prepared_transactions.  It was changed from 5 to 0 a number of
> years back for good reason.

I'm not sure that many people will not want to use this feature
because it seems to me that there are many people who don't want to
use the database that is missing transaction atomicity. But I agree
that this feature should not be enabled by default as we disable 2PC
by default.

>
> So, I think the question could be broadened a bit: how you enable this
> feature if you want it, and what happens if you want it but it's not
> available for your choice of FDW?  One possible enabling method is a
> GUC (e.g. foreign_twophase_commit).  It could be true/false, with true
> meaning use PREPARE for all FDW writes and fail if that's not
> supported, or it could be three-valued, like require/prefer/disable,
> with require throwing an error if PREPARE support is not available and
> prefer using PREPARE where available but without failing when it isn't
> available.  Another possibility could be to make it an FDW option,
> possibly capable of being set at multiple levels (e.g. server or
> foreign table).  If any FDW involved in the transaction demands
> distributed 2PC semantics then the whole transaction must have those
> semantics or it fails.  I was previous leaning toward the latter
> approach, but I guess now the former approach is sounding better.  I'm
> not totally certain I know what's best here.
>

I agree that the former is better. That way, we also can control that
parameter at transaction level. If we allow the 'prefer' behavior we
need to manage not only 2PC-capable foreign server but also
2PC-non-capable foreign server. It requires all FDW to call the
registration function. So I think two-values parameter would be
better.

BTW, sorry for late submitting the updated patch. I'll post the
updated patch in this week but I'd like to share the new APIs design
beforehand.

APIs that I'd like to add are 4 functions and 1 registration function:
PrepareForeignTransaction, CommitForeignTransaction,
RollbackForeignTransaction, IsTwoPhaseCommitEnabled and
FdwXactRegisterForeignServer. All FDWs that want to support atomic
commit have to support all APIs and to call the registration function
when foreign transaction opens.

Transaction processing sequence with atomic commit will be like follows.

1. FDW begins a transaction on a 2PC-capable foreign server.
2. FDW registers the foreign server with/without a foreign transaction
identifier by calling FdwXactRegisterForeignServer().
* The passing foreign transaction identifier can be NULL. If it's
NULL, the core code constructs it like 'fx_<4 random
chars>__'.
* Providing foreign transaction identifier at beginning of
transaction is useful because some DBMS such as Oracle database or
MySQL requires a transaction identifier at beginning of its XA
transaction.
* Registration the foreign transaction guarantees that its
transaction is controlled by the core and APIs are called at an
appropriate time.
3. Perform 1 and 2 whenever the distributed transaction opens a
transaction on 2PC-capable foreign servers.
* When the distributed transaction modifies a foreign server, we
mark it as 'modified'.
* This mark is used at commit to check if it's necessary to use 2PC.
* At the same time, we also check if the foreign server enables
2PC by calling IsTwoPhaseCommitEnabled().
* If an FDW disables or doesn't provide that function, we mark
XACT_FALGS_FDWNONPREPARE. This is necessary because we need to
remember wrote 2PC-non-capable foreign server.
* When the distributed transaction modifies temp table locally,
mark XACT_FALGS_WROTENONTEMREL.
* This is used at commit to check i it's necessary to use 2PC as well.
4. During pre-commit, we prepare all foreign transaction if 2PC is
required by calling PrepareFOreignTransaciton(

Re: Problem while updating a foreign table pointing to a partitioned table on foreign server

2018-06-05 Thread Kyotaro HORIGUCHI

Hello.

At Mon, 04 Jun 2018 20:58:28 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
 wrote in 
<20180604.205828.208262556.horiguchi.kyot...@lab.ntt.co.jp>
> It fails on some join-pushdown cases since it doesn't add tid
> columns to join tlist.  I suppose that build_tlist_to_deparse
> needs something but I'll consider further tomorrow.

I made it work with a few exceptions and bumped.  PARAM_EXEC
doesn't work at all in a case where Sort exists between
ForeignUpdate and ForeignScan.

=
explain (verbose, costs off)
update bar set f2 = f2 + 100
from
  ( select f1 from foo union all select f1+3 from foo ) ss
where bar.f1 = ss.f1;
  QUERY PLAN
-
 Update on public.bar
   Update on public.bar
   Foreign Update on public.bar2
 Remote SQL: UPDATE public.loct2 SET f2 = $3 WHERE tableoid = $1 AND ctid = 
$2
...
   ->  Merge Join
 Output: bar2.f1, (bar2.f2 + 100), bar2.f3, (ROW(foo.f1))
 Merge Cond: (bar2.f1 = foo.f1)
 ->  Sort
   Output: bar2.f1, bar2.f2, bar2.f3, bar2.tableoid, bar2.ctid
   Sort Key: bar2.f1
   ->  Foreign Scan on public.bar2
 Output: bar2.f1, bar2.f2, bar2.f3, bar2.tableoid, bar2.ctid
 Remote SQL: SELECT f1, f2, f3, ctid, tableoid FROM 
public.loct2 FOR UPDATE
=

Even if this worked fine, it cannot be back-patched.  We need an
extra storage moves together with tuples or prevent sorts or
something like from being inserted there.


At Fri, 1 Jun 2018 10:21:39 -0400, Ashutosh Bapat 
 wrote in 

> I am not suggesting to commit 0003 in my patch set, but just 0001 and
> 0002 which just raise an error when multiple rows get updated when
> only one row is expected to be updated.

So I agree to commit the two at least in order to prevent doing
wrong silently.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/contrib/postgres_fdw/deparse.c b/contrib/postgres_fdw/deparse.c
index d272719ff4..bff216f29d 100644
--- a/contrib/postgres_fdw/deparse.c
+++ b/contrib/postgres_fdw/deparse.c
@@ -1049,9 +1049,16 @@ deparseSelectSql(List *tlist, bool is_subquery, List **retrieved_attrs,
 		 * can use NoLock here.
 		 */
 		Relation	rel = heap_open(rte->relid, NoLock);
+		Bitmapset   *attrs = fpinfo->attrs_used;
+
+		if (root->parse->commandType != CMD_UPDATE &&
+			root->parse->commandType != CMD_DELETE)
+			attrs = bms_del_member(bms_copy(attrs),
+   TableOidAttributeNumber -
+   FirstLowInvalidHeapAttributeNumber);
 
 		deparseTargetList(buf, rte, foreignrel->relid, rel, false,
-		  fpinfo->attrs_used, false, retrieved_attrs);
+		  attrs, false, retrieved_attrs);
 		heap_close(rel, NoLock);
 	}
 }
@@ -1107,11 +1114,17 @@ deparseTargetList(StringInfo buf,
   bool qualify_col,
   List **retrieved_attrs)
 {
+	static int	check_attrs[4];
+	static char *check_attr_names[] = {"ctid", "oid", "tableoid"};
 	TupleDesc	tupdesc = RelationGetDescr(rel);
 	bool		have_wholerow;
 	bool		first;
 	int			i;
 
+	check_attrs[0] = SelfItemPointerAttributeNumber;
+	check_attrs[1] = ObjectIdAttributeNumber;
+	check_attrs[2] = TableOidAttributeNumber;
+	check_attrs[3] = FirstLowInvalidHeapAttributeNumber;
 	*retrieved_attrs = NIL;
 
 	/* If there's a whole-row reference, we'll need all the columns. */
@@ -1143,13 +1156,16 @@ deparseTargetList(StringInfo buf,
 		}
 	}
 
-	/*
-	 * Add ctid and oid if needed.  We currently don't support retrieving any
-	 * other system columns.
-	 */
-	if (bms_is_member(SelfItemPointerAttributeNumber - FirstLowInvalidHeapAttributeNumber,
-	  attrs_used))
+	for (i = 0 ; check_attrs[i] != FirstLowInvalidHeapAttributeNumber ; i++)
 	{
+		int	attr = check_attrs[i];
+		char *attr_name = check_attr_names[i];
+
+		/* Add system columns if needed. */
+		if (!bms_is_member(attr - FirstLowInvalidHeapAttributeNumber,
+		   attrs_used))
+			continue;
+
 		if (!first)
 			appendStringInfoString(buf, ", ");
 		else if (is_returning)
@@ -1158,26 +1174,9 @@ deparseTargetList(StringInfo buf,
 
 		if (qualify_col)
 			ADD_REL_QUALIFIER(buf, rtindex);
-		appendStringInfoString(buf, "ctid");
+		appendStringInfoString(buf, attr_name);
 
-		*retrieved_attrs = lappend_int(*retrieved_attrs,
-	   SelfItemPointerAttributeNumber);
-	}
-	if (bms_is_member(ObjectIdAttributeNumber - FirstLowInvalidHeapAttributeNumber,
-	  attrs_used))
-	{
-		if (!first)
-			appendStringInfoString(buf, ", ");
-		else if (is_returning)
-			appendStringInfoString(buf, " RETURNING ");
-		first = false;
-
-		if (qualify_col)
-			ADD_REL_QUALIFIER(buf, rtindex);
-		appendStringInfoString(buf, "oid");
-
-		*retrieved_attrs = lappend_int(*retrieved_attrs,
-	   ObjectIdAttributeNumber);
+		*retrieved_attrs = lappend_int(*retrieved_attrs, attr);
 	}
 
 	/* Don't generate bad syntax if no undropped columns */
@@ -1725,7 +1724,7 @@ deparseUpdateSql(StringInfo buf, Range

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

2018-06-05 Thread Alexander Korotkov

On Tue, Jun 5, 2018 at 12:48 PM Konstantin Knizhnik
 wrote:
> Workload is combination of inserts and selects.
> Looks like shared locks obtained by select cause starvation of inserts, 
> trying to get exclusive relation extension lock.
> The problem is fixed by fair lwlock patch, implemented by Alexander Korotkov. 
> This patch prevents granting of shared lock if wait queue is not empty.
> May be we should use this patch or find some other way to prevent starvation 
> of writers on relation extension locks for such workloads.

Fair lwlock patch really fixed starvation of exclusive lwlock waiters.
But that starvation happens not on relation extension lock – selects
don't get shared relation extension lock.  The real issue there was
not relation extension lock itself, but the time spent inside this
lock.  It appears that buffer replacement happening inside relation
extension lock is affected by starvation on exclusive buffer mapping
lwlocks and buffer content lwlocks, caused by many concurrent shared
lockers.  So, fair lwlock patch have no direct influence to relation
extension lock, which is naturally not even lwlock...

I'll post fair lwlock path in a separate thread.  It requires detailed
consideration and benchmarking, because there is a risk of regression
on specific workloads.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

2018-06-05 Thread Konstantin Knizhnik




On 04.06.2018 21:42, Andres Freund wrote:

Hi,

On 2018-06-04 16:47:29 +0300, Konstantin Knizhnik wrote:

We in PostgresProc were faced with lock extension contention problem at two
more customers and tried to use this patch (v13) to address this issue.
Unfortunately replacing heavy lock with lwlock couldn't completely eliminate
contention, now most of backends are blocked on conditional variable:

0x7fb03a318903 in __epoll_wait_nocancel () from /lib64/libc.so.6
#0  0x7fb03a318903 in __epoll_wait_nocancel () from /lib64/libc.so.6
#1  0x007024ee in WaitEventSetWait ()
#2  0x00718fa6 in ConditionVariableSleep ()
#3  0x0071954d in RelExtLockAcquire ()

That doesn't necessarily mean that the postgres code is to fault
here. It's entirely possible that the filesystem or storage is the
bottleneck.  Could you briefly describe workload & hardware?


Workload is combination of inserts and selects.
Looks like shared locks obtained by select cause starvation of inserts, 
trying to get exclusive relation extension lock.
The problem is fixed by fair lwlock patch, implemented by Alexander 
Korotkov. This patch prevents granting of shared lock if wait queue is 
not empty.
May be we should use this patch or find some other way to prevent 
starvation of writers on relation extension locks for such workloads.






Second problem we observed was even more critical: if backed is granted
relation extension lock and then got some error before releasing this lock,
then abort of the current transaction doesn't release this lock (unlike
heavy weight lock) and the relation is kept locked.
So database is actually stalled and server has to be restarted.

That obvioulsy needs to be fixed...


Sorry, looks like the problem is more obscure than I expected.
What we have observed is that all backends are blocked in lwlock (sorry 
stack trace is not complete):


#0  0x7ff5a9c566d6 in futex_abstimed_wait_cancelable (private=128, 
abstime=0x0, expected=0, futex_word=0x7ff3c57b9b38) at ../sysdeps/unix/sysv/lin
ux/futex-internal.h:205
#1  do_futex_wait (sem=sem@entry=0x7ff3c57b9b38, abstime=0x0) at 
sem_waitcommon.c:111
#2  0x7ff5a9c567c8 in __new_sem_wait_slow (sem=sem@entry=0x7ff3c57b9b38, 
abstime=0x0) at sem_waitcommon.c:181 #3  
0x7ff5a9c56839 in __new_sem_wait (sem=sem@entry=0x7ff3c57b9b38) at 
sem_wait.c:42  #4  
0x56290c901582 in PGSemaphoreLock (sema=0x7ff3c57b9b38) at pg_sema.c:310
#5  0x56290c97923c in LWLockAcquire (lock=0x7ff3c7038c64, mode=LW_SHARED) 
at ./build/../src/backend/storage/lmgr/lwlock.c:1233


I happen after error in disk write operation. Unfortunately we do not have core 
files and not able to reproduce the problem.
All LW locks should be cleared by LWLockReleaseAll but ... for some reasons it 
doesn't happen.
We will continue investigation and try to reproduce the problem.
I will let you know if we find the reason of the problem.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

2018-06-05 Thread Konstantin Knizhnik





On 05.06.2018 07:22, Masahiko Sawada wrote:

On Mon, Jun 4, 2018 at 10:47 PM, Konstantin Knizhnik
 wrote:


On 26.04.2018 09:10, Masahiko Sawada wrote:

On Thu, Apr 26, 2018 at 3:30 AM, Robert Haas 
wrote:

On Tue, Apr 10, 2018 at 9:08 PM, Masahiko Sawada 
wrote:

Never mind. There was a lot of items especially at the last CommitFest.


In terms of moving forward, I'd still like to hear what
Andres has to say about the comments I made on March 1st.

Yeah, agreed.

$ ping -n andres.freund
Request timeout for icmp_seq 0
Request timeout for icmp_seq 1
Request timeout for icmp_seq 2
Request timeout for icmp_seq 3
Request timeout for icmp_seq 4
^C
--- andres.freund ping statistics ---
6 packets transmitted, 0 packets received, 100.0% packet loss

Meanwhile,
https://www.postgresql.org/message-id/4c171ffe-e3ee-acc5-9066-a40d52bc5...@postgrespro.ru
shows that this patch has some benefits for other cases, which is a
point in favor IMHO.

Thank you for sharing. That's good to know.

Andres pointed out the performance degradation due to hash collision
when multiple loading. I think the point is that it happens at where
users don't know.  Therefore even if we make N_RELEXTLOCK_ENTS
configurable parameter, since users don't know the hash collision they
don't know when they should tune it.

So it's just an idea but how about adding an SQL-callable function
that returns the estimated number of lock waiters of the given
relation? Since user knows how many processes are loading to the
relation, if a returned value by the function is greater than the
expected value user  can know hash collision and will be able to start
to consider to increase N_RELEXTLOCK_ENTS.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


We in PostgresProc were faced with lock extension contention problem at two
more customers and tried to use this patch (v13) to address this issue.
Unfortunately replacing heavy lock with lwlock couldn't completely eliminate
contention, now most of backends are blocked on conditional variable:

0x7fb03a318903 in __epoll_wait_nocancel () from /lib64/libc.so.6
#0  0x7fb03a318903 in __epoll_wait_nocancel () from /lib64/libc.so.6
#1  0x007024ee in WaitEventSetWait ()
#2  0x00718fa6 in ConditionVariableSleep ()
#3  0x0071954d in RelExtLockAcquire ()
#4  0x004ba99d in RelationGetBufferForTuple ()
#5  0x004b3f18 in heap_insert ()
#6  0x006109c8 in ExecInsert ()
#7  0x00611a49 in ExecModifyTable ()
#8  0x005ef97a in standard_ExecutorRun ()
#9  0x0072440a in ProcessQuery ()
#10 0x00724631 in PortalRunMulti ()
#11 0x007250ec in PortalRun ()
#12 0x00721287 in exec_simple_query ()
#13 0x00722532 in PostgresMain ()
#14 0x0047a9eb in ServerLoop ()
#15 0x006b9fe9 in PostmasterMain ()
#16 0x0047b431 in main ()

Obviously there is nothing surprising here: if a lot of processes try to
acquire the same exclusive lock, then high contention is expected.
I just want to notice that this patch is not able to completely eliminate
the problem with large number of concurrent inserts to the same table.

Second problem we observed was even more critical: if backed is granted
relation extension lock and then got some error before releasing this lock,
then abort of the current transaction doesn't release this lock (unlike
heavy weight lock) and the relation is kept locked.
So database is actually stalled and server has to be restarted.


Thank you for reporting.

Regarding the second problem, I tried to reproduce that bug with
latest version patch (v13) but could not. When transaction aborts, we
call 
ReousrceOwnerRelease()->ResourceOwnerReleaseInternal()->ProcReleaseLocks()->RelExtLockCleanup()
and clear either relext lock bits we are holding or waiting. If we
raise an error after we added a relext lock bit but before we
increment its holding count then the relext lock is remained, but I
couldn't see the code raises an error between them. Could you please
share the concrete reproduction


Sorry, my original guess that LW-locks are not released in case of 
transaction abort is not correct.
There was really situation when all backends were blocked in relation 
extension lock and looks like it happens after disk write error,
but as far as it happens at customer's site, we have no time for 
investigation and not able to reproduce this problem locally.


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Remove mention in docs that foreign keys on partitioned tables are not supported

2018-06-05 Thread Amit Langote

On 2018/06/05 16:41, Ashutosh Bapat wrote:
> On Tue, Jun 5, 2018 at 1:07 PM, Amit Langote
>  wrote:
>> On 2018/06/05 1:25, Alvaro Herrera wrote:
>>> In the meantime, my inclination is to fix the documentation to point out
>>> that AFTER triggers are allowed but BEFORE triggers are not.
>>
>> Wasn't that already fixed by bcded2609ade6?
>>
>> We don't say anything about AFTER triggers per se, but the following under
>> the limitations of partitioned tables:
>>
>> BEFORE ROW triggers, if necessary, must be defined on individual
>> partitions, not the partitioned table.
> 
> Thought correct that suggestion is problematic for upgrades. Probably
> users will create triggers using same function on all the partitions.
> After the upgrade when we start supporting BEFORE TRIGGER on
> partitioned table, the user will have to drop these triggers and
> create one trigger on the partitioned table to have the trigger to be
> applicable to the new partitions created.

Hmm yes, maybe there is scope for rewording then.

Thanks,
Amit

Re: Performance regression with PostgreSQL 11 and partitioning

2018-06-05 Thread Amit Langote

On 2018/06/05 13:44, Ashutosh Bapat wrote:
> On Tue, Jun 5, 2018 at 5:50 AM, David Rowley wrote:
>> On 5 June 2018 at 01:35, Ashutosh Bapat  
>> wrote:
>>> I was wondering if we can get rid of append_rel_list altogether. In
>>> your patch, you have saved AppendRelInfos by child_relid. So all the
>>> slots indexed by parent relid are empty. We could place AppendRelInfos
>>> by parent relid. Thus a given AppendRelInfo would be places at two
>>> places, one at the index of child relid and second at the index
>>> pointed by parent relid. That's fine even in case of multilevel
>>> inheritance since an intermediate parent has two relids one as a
>>> parent and other as a child.
>>>
>>> One problem with that we do not know how long the array would be to
>>> start with. But we could keep on repallocing the array to increase its
>>> size.
>>
>> To be honest, the patch was meant as a discussion topic for ways to
>> improve the reported regression in PG11. The patch was put together
>> fairly quickly.
> 
> I think the idea is brilliant. I do not have objections for trying
> something in that direction. I am suggesting that we take this a bit
> forward and try to eliminate append_rel_list altogether.

Imho, we should try to refrain from implementing something that needs
repalloc'ing the array, as long as we're doing it as a fix for PG 11.
Also, because getting rid of append_rel_list altogether seems a bit involved.

Let the AppendRelInfo's be added to append_rel_list when created, as is
done now (expand_inherited_tables), and transfer them to the array when
setting up various arrays (setup_simple_rel_arrays) as the David's patch does.

>> I've not really considered what happens in the case where an inherited
>> table has multiple parents.  The patch happens to pass regression
>> tests, but I've not checked to see if the existing tests create any
>> tables this way.
> 
> AFAIK, that case doesn't arise while processing a query. Examining the
> way we create AppendRelInfos during expand_inherited_tables(), it's
> clear that we create only one AppendRelInfo for a given child and also
> for a given child and parent pair. If there are two tables from which
> a child inherits, the child's RTE/RelOptInfo is associated only with
> the top-most parent that appears in the query. See following comment
> from find_all_inheritors()
> /*
>  * Add to the queue only those children not already seen. This avoids
>  * making duplicate entries in case of multiple inheritance paths from
>  * the same parent.  (It'll also keep us from getting into an infinite
>  * loop, though theoretically there can't be any cycles in the
>  * inheritance graph anyway.)
>  */

That's right.

create table p1 (a int);
create table p2 (b int);
create table c () inherits (p1, p2);

Child table 'c' has two parents but if a query specifies only p1, then
just one AppendRelInfo corresponding to p1-c pair will be created.  If p2
was also specified in the query, there will be an AppendRelInfo for p2-c
pair too, but its child_relid will refer to another RT entry that's
created for 'c', that is, the one created when expanding p2.

>> Perhaps it's okay to change things this way for just partitioned
>> tables and leave inheritance hierarchies alone.
> 
> There's no point having two separate code paths when internally we are
> treating them as same.

+1

> The only difference between partitioning and
> inheritance is that for multi-level partitioned table, we preserve the
> inheritance hierarchy whereas for multi-level inheritance, we flatten
> it.

Yeah, the proposed patch doesn't change anything about how many
AppendRelInfo's are created, nor about what's contained in them, so it
seems safe to say that it would work unchanged for both regular
inheritance and partitioning.

Thanks,
Amit

Re: Spilling hashed SetOps and aggregates to disk

2018-06-05 Thread Tomas Vondra


On 06/05/2018 07:46 AM, Jeff Davis wrote:

On Tue, 2018-06-05 at 07:04 +0200, Tomas Vondra wrote:

I expect the eviction strategy to be the primary design challenge of
this patch. The other bits will be mostly determined by this one
piece.


Not sure I agree that this is the primary challenge.

The cases that benefit from eviction are probably a minority. I see two
categories that would benefit:

   * Natural clustering in the heap. This sounds fairly common, but a
lot of the cases that come to mind are too low-cardinality to be
compelling; e.g. timestamps grouped by hour/day/month. If someone has
run into a high-cardinality natural grouping case, let me know.
   * ARRAY_AGG (or similar): individual state values can be large enough
that we need to evict to avoid exceeding work_mem even if not adding
any new groups.

In either case, it seems like a fairly simple eviction strategy would
work. For instance, we could just evict the entire hash table if
work_mem is exceeded or if the hit rate on the hash table falls below a
certain threshold. If there was really something important that should
have stayed in the hash table, it will go back in soon anyway.

So why should eviction be a major driver for the entire design? I agree
it should be an area of improvement for the future, so let me know if
you see a major problem, but I haven't been as focused on eviction.



My concern is more about what happens when the input tuple ordering is 
inherently incompatible with the eviction strategy, greatly increasing 
the amount of data written to disk during evictions.


Say for example that we can fit 1000 groups into work_mem, and that 
there are 2000 groups in the input data set. If the input is correlated 
with the groups, everything is peachy because we'll evict the first 
batch, and then group the remaining 1000 groups (or something like 
that). But if the input data is random (which can easily happen, e.g. 
for IP addresses, UUIDs and such) we'll hit the limit repeatedly and 
will evict much sooner.


I know you suggested simply dumping the whole hash table and starting 
from scratch while we talked about this at pgcon, but ISTM it has 
exactly this issue.


But I don't know if there actually is a better option - maybe we simply 
have to accept this problem. After all, running slowly is still better 
than OOM (which may or may not happen now).


I wonder if we can somehow detect this at run-time and maybe fall-back 
to groupagg. E.g. we could compare number of groups vs. number of input 
tuples when we first hit the limit. It's a rough heuristics, but maybe 
sufficient.



While the primary goal of the patch is addressing the OOM risks in
hash
aggregate, I think it'd be a mistake to see it just that way. I see
it
could allow us to perform hash aggregate more often, even if we know
the
groups won't fit into work_mem. If we could estimate how much of the
aggregate state we'll have to spill to disk, we could still prefer
hashagg over groupagg. We would pay the price for eviction, but on
large
data sets that can be massively cheaper than having to do sort.


Agreed. I ran some tests of my patch in the last round, and they
strongly supported choosing HashAgg a lot more often. A lot of sort
improvements have been made though, so I really need to run some new
numbers.



Yeah, Peter made the sort stuff a lot faster. But I think there still is 
a lot of cases where the hashagg would greatly reduce the amount of data 
that needs to be written to disk, and that's not something the sort 
improvements could improve. If you need to aggregate a 1TB table, it's 
still going to be ~1TB of data you need to write to disk during on-disk 
sort. But if there is only a reasonable number of groups, that greatly 
simplifies things.



far away), but it would be unfortunate to make this improvement
impossible/more difficult in the future.


If you see anything that would make this difficult in the future, let
me know.



Sure. Sorry for being too hand-wavy in this thread, and possibly moving 
the goal posts further away :-/ I got excited by you planning to work on 
this again and perhaps a bit carried away ;-)



regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Remove mention in docs that foreign keys on partitioned tables are not supported

2018-06-05 Thread Amit Langote

On 2018/06/05 1:25, Alvaro Herrera wrote:
> In the meantime, my inclination is to fix the documentation to point out
> that AFTER triggers are allowed but BEFORE triggers are not.

Wasn't that already fixed by bcded2609ade6?

We don't say anything about AFTER triggers per se, but the following under
the limitations of partitioned tables:

BEFORE ROW triggers, if necessary, must be defined on individual
partitions, not the partitioned table.

Thanks,
Amit

Re: Remove mention in docs that foreign keys on partitioned tables are not supported

2018-06-05 Thread Ashutosh Bapat

On Tue, Jun 5, 2018 at 1:07 PM, Amit Langote
 wrote:
> On 2018/06/05 1:25, Alvaro Herrera wrote:
>> In the meantime, my inclination is to fix the documentation to point out
>> that AFTER triggers are allowed but BEFORE triggers are not.
>
> Wasn't that already fixed by bcded2609ade6?
>
> We don't say anything about AFTER triggers per se, but the following under
> the limitations of partitioned tables:
>
> BEFORE ROW triggers, if necessary, must be defined on individual
> partitions, not the partitioned table.

Thought correct that suggestion is problematic for upgrades. Probably
users will create triggers using same function on all the partitions.
After the upgrade when we start supporting BEFORE TRIGGER on
partitioned table, the user will have to drop these triggers and
create one trigger on the partitioned table to have the trigger to be
applicable to the new partitions created.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: Spilling hashed SetOps and aggregates to disk

2018-06-05 Thread David Rowley

On 5 June 2018 at 17:04, Tomas Vondra  wrote:
> On 06/05/2018 04:56 AM, David Rowley wrote:
>> Isn't there still a problem determining when the memory exhaustion
>> actually happens though?   As far as I know, we've still little
>> knowledge how much memory each aggregate state occupies.
>>
>> Jeff tried to solve this in [1], but from what I remember, there was
>> too much concern about the overhead of the additional accounting code.
>>
>> [1] 
>> https://www.postgresql.org/message-id/flat/CAKJS1f8yvvvj-sVDv_bcxkzcZKq0ZOTVhX0dHfnYDct2Mycq5Q%40mail.gmail.com#cakjs1f8yvvvj-svdv_bcxkzczkq0zotvhx0dhfnydct2myc...@mail.gmail.com
>>
>
> I had a chat with Jeff Davis at pgcon about this, and IIRC he suggested
> a couple of people who were originally worried about the overhead seem
> to be accepting it now.

Is there any great need to make everything pay the small price for
this? Couldn't we just create a new MemoryContextMethod that
duplicates aset.c, but has the require additional accounting built in
at the implementation level, then make execGrouping.c use that
allocator for its hash table? The code would not really need to be
duplicated, we could just do things the same way as like.c does with
like_match.c and include a .c file. We'd need another interface
function in MemoryContextMethods to support getting the total memory
allocated in the context, but that does not seem like a problem.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Spilling hashed SetOps and aggregates to disk

2018-06-05 Thread Tomas Vondra

On 06/05/2018 09:22 AM, David Rowley wrote:

On 5 June 2018 at 17:04, Tomas Vondra wrote:

On 06/05/2018 04:56 AM, David Rowley wrote:

Isn't there still a problem determining when the memory exhaustion
actually happens though? As far as I know, we've still little
knowledge how much memory each aggregate state occupies.

Jeff tried to solve this in [1], but from what I remember, there was
too much concern about the overhead of the additional accounting code.

[1]
https://www.postgresql.org/message-id/flat/CAKJS1f8yvvvj-sVDv_bcxkzcZKq0ZOTVhX0dHfnYDct2Mycq5Q%40mail.gmail.com#cakjs1f8yvvvj-svdv_bcxkzczkq0zotvhx0dhfnydct2myc...@mail.gmail.com

I had a chat with Jeff Davis at pgcon about this, and IIRC he suggested
a couple of people who were originally worried about the overhead seem
to be accepting it now.

Is there any great need to make everything pay the small price for
this? Couldn't we just create a new MemoryContextMethod that
duplicates aset.c, but has the require additional accounting built in
at the implementation level, then make execGrouping.c use that
allocator for its hash table? The code would not really need to be
duplicated, we could just do things the same way as like.c does with
like_match.c and include a .c file. We'd need another interface
function in MemoryContextMethods to support getting the total memory
allocated in the context, but that does not seem like a problem.

There probably is not a need, but there was not a great way to enable it
explicitly only for some contexts. IIRC what was originally considered
back in 2015 was some sort of a flag in the memory context, and the
overhead was about the same as with the int64 arithmetic alone.

But I don't think we've considered copying the whole AllocSet. Maybe
that would work, not sure. I wonder if an aggregate might use a custom
context internally (I don't recall anything like that). The accounting
capability seems potentially useful for other places, and those might
not use AllocSet (or at least not directly).

All of this of course assumes the overhead is still there. I sure will
do some new measurements.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: libpq compression

2018-06-05 Thread Michael Paquier

On Tue, Jun 05, 2018 at 06:04:21PM +1200, Thomas Munro wrote:
> On Thu, May 17, 2018 at 3:54 AM, Konstantin Knizhnik
> Speaking of configuration, are you planning to support multiple
> compression libraries at the same time?  It looks like the current
> patch implicitly requires client and server to use the same configure
> option, without any attempt to detect or negotiate.  Do I guess
> correctly that a library mismatch would produce an incomprehensible
> corrupt stream message?

I just had a quick look at this patch, lured by the smell of your latest
messages...  And it seems to me that this patch needs a heavy amount of
work as presented.  There are a couple of things which are not really
nice, like forcing the presentation of the compression option in the
startup packet to begin with.  The high-jacking around secure_read() is
not nice either as it is aimed at being a rather high-level API on top
of the method used with the backend.  On top of adding some
documentation, I think that you could get some inspiration from the
recent GSSAPI encription patch which has been submitted again for v12
cycle, which has spent a large amount of time designing its set of
options.
--
Michael


signature.asc
Description: PGP signature

< 1 2

101 - 153 of 153 matches

Mail list logo