Re: Code of Conduct plan
On Mon, Jun 4, 2018 at 6:59 PM, Joshua D. Drake wrote: > On 06/03/2018 04:08 PM, Gavin Flower wrote: > > My comments: >>> >>> 1) Reiterate my contention that this is a solution is search of problem. >>> Still it looks like it is going forward, so see below. >>> >>> 2) "... engaging in behavior that may bring the PostgreSQL project into >>> disrepute, ..." >>> This to me is overly broad and pulls in actions that may happen outside >>> the community. Those if they are actually an issue should be handled where >>> they occur not here. >>> >> > This is good point. There are those who would think that one has performed > an action that brings the project into disrepute and a similar sized bias > that suggests that in fact that isn't the case. This based on the CoC would > be judged by the CoC committee. > > It is my hope that PostgreSQL.Org -Core chooses members for that committee > that are exceedingly diverse otherwise it is just an echo chamber for a > single ideology and that will destroy this community. If I may suggest: The committee should be international as well and include people from around the world. The last thing we want is for it to be dominated by people from one particular cultural viewpoint. > > > >>> 3) "... members must be sensitive to conduct that may be considered >>> offensive by fellow members and must refrain from engaging in such conduct. >>> " >>> >> > Again overly broad, especially given the hypersensitivity of people these >>> days. I have found that it is enough to disagree with someone to have it >>> called offensive. This section should be removed as proscribed behavior is >>> called out in detail in the paragraphs above it. >>> >> > "considered offensive by fellow members" > > Is definitely too broad. The problem comes in here: > > I might possibly say that "I'm the master in this area" when talking to >> someone on a technical subject. In the sense that I'm better at that >> particular skill, but some hypersensitive American could get their knickers >> in a twist (notice, that in this context, no gender is implied -- also in >> using that that expression "get their knickers in a twist" could offend >> some snowflake) claiming that I'm suggesting that whoever >> > > "snowflake", I find that term hilarious others find it highly offensive. > Which is correct? I agree with both concerns in the above exchange. This is an economic common project. The goal should be for people to come together and act civilly. Waging culture war using the code of conduct itself should be a violation of the code of conduct and this goes on *all* (not just one or two) sides. > > > I'm talking to is my slave! I heard of an American university that >> doesn't want people to use the term master, like in an MSc, because of the >> history of slavery. >> > > The PostgreSQL project already has this problem, note we don't use the > terms Master and Slave in reference to replication anymore. > > >> I've used the expressions "sacrifice a willing virgin" and "offering my >> first born to the gods" as ways to ensure success of resolving a technical >> issue. The people I say that to, know what I mean -- and they implicitly >> know that I'm not seriously suggesting such conduct. Yet, if I wrote that >> publicly, it is conceivable that someone might object! >> > > Yes and that is a problem. We need to have some simple barrier of > acceptance that we are all adults here (or should act like adults). Knowing > your audience is important. I would point out also that the PostgreSQL community is nice and mature. At PGConf US I saw what appeared to be two individuals with red MAGA hats. And yet everyone managed to be civil. We manage to do better than the US does on the whole in this regard and we should be proud of ourselves. > > > Consider a past advertising campaign in Australia to sell government >> Bonds. They used two very common hand gestures that are very Australian. >> Bond sales dropped. On investigation, they found the bonds were mainly >> bought by old Greek people, who found the gestures obscene. The gestures? >> Thumbs up, and the okay gesture formed by touching the thumb with the next >> finger -- nothing sexually suggestive to most Australians, but traditional >> Greeks found them offensive. >> > > Using Australia as an example, my understanding is that the word c**t is > part of nomenclature but in the states the word is taboo and highly frowned > upon. Again key point that a CoC committee needs to be international and used to addressing these sorts of issues. > > > Be very careful in attempting to codify 'correct' behaviour! >> >> > Correct. I think one way to look at all of this is, "if you wouldn't say > it to your boss or a client don't say it here". That too has problems but > generally speaking I think it keeps the restrictions rational. > > I will post a more specific set of thoughts here but in general I think the presumption ought to be that people are trying to work t
Re: commitfest 2018-07
On 2018-06-05 10:20:55 -0400, Tom Lane wrote: > Andres Freund writes: > > I'd rather create a new 2018-07, and just manually move old patches to > > it. Otherwise we'll not really focus on the glut of old things, but > > everyone just restarts working on their own new thing. > > I thought the idea was to clear out the underbrush of small, ready-to-go > patches. How old they are doesn't enter into that. > > There's a separate issue about what to do to prioritize old patches so > they don't linger indefinitely. We had a discussion about that at the > dev meeting, but I don't think any specific mechanism was agreed to? I think we've not fully agreed on that. I'd argue we should manually filter things into the next CF. And both small patches and older things should qualify. Greetings, Andres Freund
Re: Concurrency bug in UPDATE of partition-key
Attached is a rebased patch version. Also included it in the upcoming commitfest : https://commitfest.postgresql.org/18/1660/ In the rebased version, the new test cases are added in the existing isolation/specs/partition-key-update-1.spec test. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company fix_concurrency_bug_rebased.patch Description: Binary data
automating perl compile time checking
Here is a follow up patch to last weeks commit allowing all perl files to be checked clean for compile time errors and warnings. The patch contains a simple script to run the checks. The code that finds perl files is put in a function in a single file that is sourced by the three locations that need it. The directory pgperlcritic is renamed to perlcheck, as it not contains the new script as well as pgperlcritic. cheers andrew -- Andrew Dunstanhttps://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services >From 6247cd1fbba16426bb1745521d7b444d60ebbd0a Mon Sep 17 00:00:00 2001 From: Andrew Dunstan Date: Tue, 5 Jun 2018 10:25:55 -0400 Subject: [PATCH] Extend and slightly refactor perl checking A script is added to check for perl compile time errors and warnings. The code to detect all the perl files is put in a central place and included in the three places it is now needed. The directory pgperlcritic is renamed to perlcheck, and contains the new check script as well as pgperlcritic. --- src/tools/perlcheck/find_perl_files| 15 src/tools/{pgperlcritic => perlcheck}/perlcriticrc | 0 src/tools/perlcheck/pgperlcritic | 20 src/tools/perlcheck/pgperlsyncheck | 17 + src/tools/pgindent/pgperltidy | 14 +++ src/tools/pgperlcritic/pgperlcritic| 28 -- 6 files changed, 55 insertions(+), 39 deletions(-) create mode 100644 src/tools/perlcheck/find_perl_files rename src/tools/{pgperlcritic => perlcheck}/perlcriticrc (100%) create mode 100755 src/tools/perlcheck/pgperlcritic create mode 100755 src/tools/perlcheck/pgperlsyncheck delete mode 100755 src/tools/pgperlcritic/pgperlcritic diff --git a/src/tools/perlcheck/find_perl_files b/src/tools/perlcheck/find_perl_files new file mode 100644 index 000..e10466a --- /dev/null +++ b/src/tools/perlcheck/find_perl_files @@ -0,0 +1,15 @@ + +# src/tools/perlcheck/find_perl_files + +# shell function to find all perl files in the source tree + +find_perl_files () { +{ + # take all .pl and .pm files + find . -type f -name '*.p[lm]' -print + # take executable files that file(1) thinks are perl files + find . -type f -perm -100 -exec file {} \; -print | + egrep -i ':.*perl[0-9]*\>' | + cut -d: -f1 + } | sort -u +} diff --git a/src/tools/pgperlcritic/perlcriticrc b/src/tools/perlcheck/perlcriticrc similarity index 100% rename from src/tools/pgperlcritic/perlcriticrc rename to src/tools/perlcheck/perlcriticrc diff --git a/src/tools/perlcheck/pgperlcritic b/src/tools/perlcheck/pgperlcritic new file mode 100755 index 000..6a31f67 --- /dev/null +++ b/src/tools/perlcheck/pgperlcritic @@ -0,0 +1,20 @@ +#!/bin/sh + +# src/tools/pgperlcritic/pgperlcritic + +test -f src/tools/perlcheck/perlcriticrc || { + echo could not find src/tools/perlcheck/perlcriticrc + exit 1 + } + +set -e + +# set this to override default perlcritic program: +PERLCRITIC=${PERLCRITIC:-perlcritic} + +. src/tools/perlcheck/find_perl_files + +find_perl_files | xargs $PERLCRITIC \ + --quiet \ + --program-extensions .pl \ + --profile=src/tools/perlcheck/perlcriticrc diff --git a/src/tools/perlcheck/pgperlsyncheck b/src/tools/perlcheck/pgperlsyncheck new file mode 100755 index 000..4595d16 --- /dev/null +++ b/src/tools/perlcheck/pgperlsyncheck @@ -0,0 +1,17 @@ +#!/bin/sh + +# script to detect compile time errors and warnings in all perl files + +INCLUDES="-I src/tools/msvc -I src/tools/msvc/dummylib -I src/backend/catalog" +INCLUDES="-I src/test/perl -I src/backend/utils/mb/Unicode $INCLUDES" +INCLUDES="-I src/bin/pg_rewind -I src/test/ssl $INCLUDES" + +set -e + +. src/tools/perlcheck/find_perl_files + +# for zsh +setopt shwordsplit 2>/dev/null || true + +find_perl_files | xargs -L 1 perl $INCLUDES -cw 2>&1 | grep -v OK + diff --git a/src/tools/pgindent/pgperltidy b/src/tools/pgindent/pgperltidy index 5d9aa7c..5e70411 100755 --- a/src/tools/pgindent/pgperltidy +++ b/src/tools/pgindent/pgperltidy @@ -7,14 +7,6 @@ set -e # set this to override default perltidy program: PERLTIDY=${PERLTIDY:-perltidy} -# locate all Perl files in the tree -( - # take all .pl and .pm files - find . -type f -a \( -name '*.pl' -o -name '*.pm' \) - # take executable files that file(1) thinks are perl files - find . -type f -perm -100 -exec file {} \; | - egrep -i ':.*perl[0-9]*\>' | - cut -d: -f1 -) | -sort -u | -xargs $PERLTIDY --profile=src/tools/pgindent/perltidyrc +. src/tools/perlcheck/find_perl_files + +find_perl_files | xargs $PERLTIDY --profile=src/tools/pgindent/perltidyrc diff --git a/src/tools/pgperlcritic/pgperlcritic b/src/tools/pgperlcritic/pgperlcritic deleted file mode 100755 index 28264b1..000 --- a/src/tools/pgperlcritic/pgperlcritic +++ /dev/null @@ -1,28 +0,0 @@ -#!/bin/sh - -# src/tools/pgperlcritic/pgperlcritic - -test -f src/tools/pgperlcritic/perlcriticrc || {
Re: plans for PostgreSQL 12
2018-06-05 15:00 GMT+02:00 Andres Freund : > Hi, > > On 2018-06-05 06:32:31 +0200, Pavel Stehule wrote: > > ./configure --with-libxml --enable-tap-tests --enable-debug --with-perl > > CFLAGS="-ggdb -Og -g3 -fno-omit-frame-pointer" > > > > [pavel@nemesis postgresql]$ gcc --version > > gcc (GCC) 8.1.1 20180502 (Red Hat 8.1.1-1) > > > > I executed simple script > > > > do $$ declare i bigint = 1; s bigint = 0; begin while i <= 1 > loop > > s := s + i; i := i + 1; end loop; raise notice '%', s; end $$; > > > >7,68% postmaster postgres [.] > > GetSnapshotData ▒ > >7,53% postmaster plpgsql.so [.] > > exec_eval_simple_expr ▒ > >6,49% postmaster postgres [.] > > It seems to me the right fix here isn't a new class of functions, but > rather support for delaying the computation of the snapshot to the point > it's needed. That'll be far more generically applicable and doesn't > require user interaction. > good idea. Can be quick fix. > > > > ExecInterpExpr▒ > >4,13% postmaster postgres [.] > > So we're going to need to optimize this further as well, I've a pending > patch for that, luckily ;) > nice :) > > LWLockRelease ▒ > >4,12% postmaster postgres [.] > > That's also GetSnapshotData()... > there are about 10% locking, unlocking plan cache still. Regards Pavel > Greetings, > > Andres Freund >
Re: adding tab completions
On Sun, Jun 03, 2018 at 10:39:22PM -0500, Justin Pryzby wrote: > Find attached an update which also supports column completion using the legacy > non-parenthesized syntax. Thank you! > BTW..should that be toast.tuple_target ?? I think shouldn't. From the documentation "This parameter cannot be set for TOAST tables". > > Also I think it could be good to list column names after parentheses, > > but I'm not sure if it easy to implement. > I tried this and nearly gave up, but see attached. After some thought now I think that this is not so good idea. The code doesn't look good too. It doesn't worth it and sorry for the distraction. Moreover there is no such completion for example for the command (it shows only first column): CREATE INDEX ON test ( > - "SERVER", "INDEX", "LANGUAGE", "POLICY", "PUBLICATION", > "RULE", > + "SERVER", "INDEX", "LANGUAGE", "POLICY", "PUBLICATION", Is this a typo? I think still there is a possibility to comment rules. > else if (HeadMatches1("EXPLAIN") && previous_words_count==2 && > prev_wd[0]=='(' && ends_with(prev_wd, ')')) I think this condition can be replaced by: else if (TailMatches2("EXPLAIN", MatchAny) && ends_with(prev_wd, ')')) It looks better to me. Such condition is used for other commands and works the same way. The last point I've noticed, there is no VERBOSE entry after VACUUM FULL ANALYZE command anymore. I'm not sure how this patch should be commited. Can it be commited outside the commitfest? Otherwise add it to the next commitfest please in order not to forget it. Thoughts? -- Arthur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company
Re: commitfest 2018-07
Andres Freund writes: > I'd rather create a new 2018-07, and just manually move old patches to > it. Otherwise we'll not really focus on the glut of old things, but > everyone just restarts working on their own new thing. I thought the idea was to clear out the underbrush of small, ready-to-go patches. How old they are doesn't enter into that. There's a separate issue about what to do to prioritize old patches so they don't linger indefinitely. We had a discussion about that at the dev meeting, but I don't think any specific mechanism was agreed to? regards, tom lane
Re: install doesn't work on HEAD
Hi, On Tue, Jun 5, 2018 at 7:08 PM, Amit Kapila wrote: > > Hi, > > On my win-7, I am facing $SUBJECT. I am consistently getting below error: > Copying build output files...Could not copy postgres.exe > at install.pl line 26. > > On digging, I found that it started failing with commit 3a7cc727 (Don't fall > off the end of perl functions). It seems that we have added empty 'return' > in all of the functions which seems to break one of the case: > > lcopy("$conf\\$pf\\$pf.$ext", "$target\\$dir\\$pf.$ext") > || croak "Could not copy $pf.$ext\n"; > > I think the return from lcopy implies 0 (or undef) which cause it to fail. > By reading couple of articles on net [1][2] related to this, it seems we > can't blindly return in all cases. On changing, the lcopy as below, install > appears to be working. > > @@ -40,7 +40,7 @@ sub lcopy > copy($src, $target) > || confess "Could not copy $src to $target\n"; > - return; > + return 1; > } > > I have randomly check some of the other functions where this patch has added > 'return' and those seem to be fine. > > Is it by any chance related Perl version or some other settings? If not, > then we should do something for it. > > [1] - https://perlmaven.com/how-to-return-undef-from-a-function > [2] - > http://search.cpan.org/dist/Perl-Critic/lib/Perl/Critic/Policy/Subroutines/RequireFinalReturn.pm I am not able to reproduce this issue on HEAD. Seems like the following git-commit fixes it, commit 01deec5f8ae64b5120cc8c93d54fe0e19e477b02 Author: Andrew Dunstan Date: Mon May 28 16:44:13 2018 -0400 Return a value from Install.pm's lcopy function Commit 3a7cc727c was a little over eager about adding an explicit return to this function, whose value is checked in most call sites. This change reverses that and returns the expected value explicitly. It also adds a check to the one call site lacking one. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com > > > -- > With Regards, > Amit Kapila. > EnterpriseDB: http://www.enterprisedb.com
Re: Loaded footgun open_datasync on Windows
Amit Kapila wrote: > On Fri, Jun 1, 2018 at 8:18 PM, Laurenz Albe wrote: > > Amit Kapila wrote: > > > On Fri, Jun 1, 2018 at 3:13 PM, Laurenz Albe > > > wrote: > > > > I recently read our documentation about reliability on Windows: > > > > > > > > > On Windows, if wal_sync_method is open_datasync (the default), write > > > > > caching can > > > > > be disabled by unchecking > > > > > My Computer\Open\disk > > > > > drive\Properties\Hardware\Properties\Policies\Enable write caching > > > > > on the disk. Alternatively, set wal_sync_method to fsync or > > > > > fsync_writethrough, > > > > > which prevent write caching. > > > > > > > > It seems dangerous to me to initialize "wal_sync_method" to a method > > > > that is unsafe > > > > by default. Admittedly I am not a Windows man, but the fact that this > > > > has eluded me > > > > up to now leads me to believe that other people running PostgreSQL on > > > > Windows might > > > > also have missed that important piece of advice and are consequently > > > > running with > > > > an unsafe setup. > > > > > > > > Wouldn't it be smarter to set a different default value on Windows, > > > > like we do on > > > > Linux (for other reasons)? > > > > > > > > > > One thing to note is that it seems that in code we use > > > FILE_FLAG_WRITE_THROUGH for > > > open_datasync which according to MSDN [1] will bypass any intermediate > > > cache . > > > See pgwin32_open. Have you experimented to set any other option as we > > > have a comment > > > in code which say Win32 only has O_DSYNC? > > > > > > > > > [1] - > > > https://msdn.microsoft.com/en-us/library/windows/desktop/aa363858(v=vs.85).aspx > > > > After studying the code I feel somewhat safer; it looks like the code is ok. > > I have no Windows at hand, so I cannot test right now. > > > > What happened is that I ran "pg_test_fsync" at a customer site on Windows, > > and > > it returned ridiculously high rates got open_datasync. > > > > So I think that the following should be fixed: > > > > - Change pg_test_fsync to actually use FILE_FLAG_WRITE_THROUGH. > > It sounds sensible to me to make a Windows specific change in pg_test_fsync > for open_datasync method. > That will make pg_test_fsync behave similar to server. The attached patch makes pg_test_fsync use pgwin32_open on Windows, which is what we use elsewhere. That should fix the problem. Ist there a better way to do this? The problem is that "c.h" is only included at the very end of "postgres-fe.h", which makes front-end code use "open" rather than "pgwin32_open" on Windows. Having read it again, I think that the documentation is fine as it is: After all, this is just advice what you can do if you are running on unsafe hardware, which doesn't flush to disk like it should. Yours, Laurenz AlbeFrom 6ab651c48df66ec93b8d6730bebeaf60e2bb865b Mon Sep 17 00:00:00 2001 From: Laurenz Albe Date: Tue, 5 Jun 2018 16:05:10 +0200 Subject: [PATCH] Make pg_test_fsync use pgwin32_open on Windows --- src/bin/pg_test_fsync/pg_test_fsync.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/src/bin/pg_test_fsync/pg_test_fsync.c b/src/bin/pg_test_fsync/pg_test_fsync.c index e6f7ef8557..a0139a1302 100644 --- a/src/bin/pg_test_fsync/pg_test_fsync.c +++ b/src/bin/pg_test_fsync/pg_test_fsync.c @@ -3,6 +3,8 @@ * tests all supported fsync() methods */ +/* we have to include c.h first so that pgwin32_open is used on Windows */ +#include "c.h" #include "postgres_fe.h" #include -- 2.14.3
Re: commitfest 2018-07
Michael Paquier writes: > Okay. If we tend toward this direction, I propose to do this switch in > two days my time (Thursday afternoon in Tokyo) if there are no > objections, so as anybody has hopefully time to argue back. I think we probably have to wait longer. It's not clear to me when the RMT will decide that the 07 fest is go or no-go, but surely they've not decided yet. regards, tom lane
Re: libpq compression
On 05.06.2018 09:04, Thomas Munro wrote: On Thu, May 17, 2018 at 3:54 AM, Konstantin Knizhnik wrote: Concerning specification of compression level: I have made many experiments with different data sets and both zlib/zstd and in both cases using compression level higher than default doesn't cause some noticeable increase of compression ratio, but quite significantly reduce speed. Moreover, for "pgbench -i" zstd provides better compression ratio (63 times!) with compression level 1 than with with largest recommended compression level 22! This is why I decided not to allow user to choose compression level. Speaking of configuration, are you planning to support multiple compression libraries at the same time? It looks like the current patch implicitly requires client and server to use the same configure option, without any attempt to detect or negotiate. Do I guess correctly that a library mismatch would produce an incomprehensible corrupt stream message? Frankly speaking I am not sure that support of multiple compression libraries at the same time is actually needed. If we build Postgres from sources, then both frontend and backend libraries will use the same compression library. zlib is available almost everywhere and Postgres in any case is using it. zstd is faster and provides better compression ratio. So in principle it may be useful to try first to use zstd and if it is not available then use zlib. It will require dynamic loading of this libraries. libpq stream compression is not the only place where compression is used in Postgres. So I think that the problem of choosing compression algorithm and supporting custom compression methods should be fixed at some upper level. There is patch for custom compression method at commit fest. May be it should be combined with this one. Right now if client and server libpq libraries were built with different compression libraries, then decompress error will be reported. Supporting multiple compression methods will require more sophisticated handshake protocol so that client and server can choose compression method which is supported by both of them. But certainly it can be done. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: libpq compression
On 05.06.2018 08:26, Thomas Munro wrote: On Thu, May 17, 2018 at 3:54 AM, Konstantin Knizhnik wrote: Thank you for this notice. Updated and rebased patch is attached. Hi Konstantin, Seems very useful. +1. + rc = inflate(&zs->rx, Z_SYNC_FLUSH); + if (rc != Z_OK) + { + return ZPQ_DECOMPRESS_ERROR; + } Does this actually guarantee that zs->rx.msg is set to a string? I looked at some documentation here: https://www.zlib.net/manual.html It looks like return value Z_DATA_ERROR means that msg is set, but for the other error codes Z_STREAM_ERROR, Z_BUF_ERROR, Z_MEM_ERROR it doesn't explicitly say that. From a casual glance at https://github.com/madler/zlib/blob/master/inflate.c I think it might be set to Z_NULL and then never set to a string except in the mode = BAD paths that produce the Z_DATA_ERROR return code. That's interesting because later we do this: + if (r == ZPQ_DECOMPRESS_ERROR) + { + ereport(COMMERROR, + (errcode_for_socket_access(), + errmsg("Failed to decompress data: %s", zpq_error(PqStream; + return EOF; ... where zpq_error() returns zs->rx.msg. That might crash or show "(null)" depending on libc. Also, message style: s/F/f/ +ssize_t zpq_read(ZpqStream* zs, void* buf, size_t size, size_t* processed) Code style: We write "Type *foo", not "Type* var". We put the return type of a function definition on its own line. It looks like there is at least one place where zpq_stream.o's symbols are needed but it isn't being linked in, so the build fails in some ecpg stuff reached by make check-world: gcc -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -g -O2 -pthread -D_REENTRANT -D_THREAD_SAFE -D_POSIX_PTHREAD_SEMANTICS test1.o -L../../../../../src/port -L../../../../../src/common -L../../ecpglib -lecpg -L../../pgtypeslib -lpgtypes -L../../../../../src/interfaces/libpq -lpq -Wl,--as-needed -Wl,-rpath,'/usr/local/pgsql/lib',--enable-new-dtags -lpgcommon -lpgport -lpthread -lz -lreadline -lrt -lcrypt -ldl -lm -o test1 ../../../../../src/interfaces/libpq/libpq.so: undefined reference to `zpq_free' ../../../../../src/interfaces/libpq/libpq.so: undefined reference to `zpq_error' ../../../../../src/interfaces/libpq/libpq.so: undefined reference to `zpq_read' ../../../../../src/interfaces/libpq/libpq.so: undefined reference to `zpq_buffered' ../../../../../src/interfaces/libpq/libpq.so: undefined reference to `zpq_create' ../../../../../src/interfaces/libpq/libpq.so: undefined reference to `zpq_write' Hi Thomas, Thank you for review. Updated version of the patch fixing all reported problems is attached. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company diff --git a/configure b/configure index 0aafd9c..fc5685c 100755 --- a/configure +++ b/configure @@ -699,6 +699,7 @@ ELF_SYS EGREP GREP with_zlib +with_zstd with_system_tzdata with_libxslt with_libxml @@ -863,6 +864,7 @@ with_libxml with_libxslt with_system_tzdata with_zlib +with_zstd with_gnu_ld enable_largefile enable_float4_byval @@ -8017,6 +8019,86 @@ fi # +# ZStd +# + + + +# Check whether --with-zstd was given. +if test "${with_zstd+set}" = set; then : + withval=$with_zstd; + case $withval in +yes) + ;; +no) + : + ;; +*) + as_fn_error $? "no argument expected for --with-zstd option" "$LINENO" 5 + ;; + esac + +else + with_zstd=no + +fi + + + + +if test "$with_zstd" = yes ; then + { $as_echo "$as_me:${as_lineno-$LINENO}: checking for ZSTD_compress in -lzstd" >&5 +$as_echo_n "checking for ZSTD_compress in -lzstd... " >&6; } +if ${ac_cv_lib_zstd_ZSTD_compress+:} false; then : + $as_echo_n "(cached) " >&6 +else + ac_check_lib_save_LIBS=$LIBS +LIBS="-lzstd $LIBS" +cat confdefs.h - <<_ACEOF >conftest.$ac_ext +/* end confdefs.h. */ + +/* Override any GCC internal prototype to avoid an error. + Use char because int might match the return type of a GCC + builtin and then its argument prototype would still apply. */ +#ifdef __cplusplus +extern "C" +#endif +char ZSTD_compress (); +int +main () +{ +return ZSTD_compress (); + ; + return 0; +} +_ACEOF +if ac_fn_c_try_link "$LINENO"; then : + ac_cv_lib_zstd_ZSTD_compress=yes +else + ac_cv_lib_zstd_ZSTD_compress=no +fi +rm -f core conftest.err conftest.$ac_objext \ +conftest$ac_exeext conftest.$ac_ext +LIBS=$ac_check_lib_save_LIBS +fi +{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_lib_zstd_ZSTD_compress" >&5 +$as_echo "$ac_cv_lib_zstd_ZSTD_compress" >&6; } +if test "x$ac_cv_lib_zstd_ZSTD_compress" = xyes; then : + cat >>confdefs.h <<_ACEOF +#define HAVE_LIBZSTD 1 +_ACEOF + + LIBS="-lzstd $LIBS" + +else + as_fn_error $? "library 'zstd' is required for ZSTD support" "$LINENO" 5 +fi + +fi + + + +# # Elf # diff --git a/src/Makefile.global.in b/src/Makefile.global.in index 95d090e..b59629c 100644 --- a/src/Makefi
Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager
On Tue, Jun 5, 2018 at 4:02 PM Andres Freund wrote: > On 2018-06-05 13:09:08 +0300, Alexander Korotkov wrote: > > It appears that buffer replacement happening inside relation > > extension lock is affected by starvation on exclusive buffer mapping > > lwlocks and buffer content lwlocks, caused by many concurrent shared > > lockers. So, fair lwlock patch have no direct influence to relation > > extension lock, which is naturally not even lwlock... > > Yea, that makes sense. I wonder how much the fix here is to "pre-clear" > a victim buffer, and how much is a saner buffer replacement > implementation (either by going away from O(NBuffers), or by having a > queue of clean victim buffers like my bgwriter replacement). The particular thing I observed on our environment is BufferAlloc() waiting hours on buffer partition lock. Increasing NUM_BUFFER_PARTITIONS didn't give any significant help. It appears that very hot page (root page of some frequently used index) reside on that partition, so this partition was continuously under shared lock. So, in order to resolve without changing LWLock, we probably should move our buffers hash table to something lockless. > > I'll post fair lwlock path in a separate thread. It requires detailed > > consideration and benchmarking, because there is a risk of regression > > on specific workloads. > > I bet that doing it naively will regress massively in a number of cases. Yes, I suspect the same. However, I tend to think that something is wrong with LWLock itself. It seems that it is the only of our locks, which provides some lockers almost infinite starvations under certain workloads. In contrast, even our SpinLock gives all the waiting processes nearly same chances to acquire it. So, I think idea of improving LWLock in this aspect deserves at least further investigation. -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: Spilling hashed SetOps and aggregates to disk
On 06/05/2018 03:41 PM, se...@rielau.com wrote: In our code base we have added a WithStats-Flavor for creating memory contexts. This api accepts a pointer to metric for accounting and it is inherited by all subcontexts unless overridden. So we only needed to change context creation API where we wanted (such as TopTansactionContext, Message Context, ..) That's quite trivial, actually. I think that's pretty much what we tried when this patch was first discussed in 2015, but it added measurable overhead (1-3%) even when the accounting was disabled. Which is not quite desirable, of course. Also we have fixed all those missing hash spills - albeit based on the 9.6 hash table design I think. I don't think this part of the code changed all that much. If there is interest by the community we are very willing to share. Absolutely! I think it'd be great to share both the code and the reasoning behind the design. Cheers Serge Salesforce cheers -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Locking B-tree leafs immediately in exclusive mode
Hi! Currently _bt_search() always locks leaf buffer in shared mode (aka BT_READ), while caller can relock it later. However, I don't see what prevents _bt_search() from locking leaf immediately in exclusive mode (aka BT_WRITE) when required. When traversing downlink from non-leaf page of level 1 (lowest level of non-leaf pages just prior to leaf pages), we know that next page is going to be leaf. In this case, we can immediately lock next page in exclusive mode if required. For sure, it might happen that we didn't manage to exclusively lock leaf in this way when _bt_getroot() points us to leaf page. But this case could be handled in _bt_search() by relock. Please, find implementation of this approach in the attached patch. I've run following simple test of this patch on 72-cores machine. # postgresql.conf max_connections = 300 shared_buffers = 32GB fsync = off synchronous_commit = off # DDL: CREATE UNLOGGED TABLE ordered (id serial primary key, value text not null); CREATE UNLOGGED TABLE unordered (i integer not null, value text not null); # script_ordered.sql INSERT INTO ordered (value) VALUES ('abcdefghijklmnoprsqtuvwxyz'); # script_unordered.sql \set i random(1, 100) INSERT INTO unordered VALUES (:i, 'abcdefghijklmnoprsqtuvwxyz'); # commands pgbench -T 60 -P 1 -M prepared -f script_ordered.sql -c 150 -j 150 postgres pgbench -T 60 -P 1 -M prepared -f script_unordered.sql -c 150 -j 150 postgres # results ordered, master: 157473 TPS ordered, patched 231374 TPS unordered, master: 232372 TPS unordered, patched: 232535 TPS As you can see, difference in unordered case is negligible Due to random inserts, concurrency for particular leafs is low. But ordered insertion is almost 50% faster on patched version. I wonder how could we miss such a simple optimization till now, but I also don't see this patch to brake anything. In patched version, it might appear that we have to traverse rightlinks in exclusive mode due to splits concurrent to downlink traversal. However, the same might happen in current master due to splits concurrent to relocks. So, I don't expect performance regression to be caused by this patch. -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company btree-search-write-lock-1.patch Description: Binary data
RE: Re: Spilling hashed SetOps and aggregates to disk
In our code base we have added a WithStats-Flavor for creating memory contexts. This api accepts a pointer to metric for accounting and it is inherited by all subcontexts unless overridden. So we only needed to change context creation API where we wanted (such as TopTansactionContext, Message Context, ..) That's quite trivial, actually. Also we have fixed all those missing hash spills - albeit based on the 9.6 hash table design I think. If there is interest by the community we are very willing to share. Cheers Serge Salesforce
Re: why partition pruning doesn't work?
On Tue, Jun 5, 2018 at 6:24 PM, Dmitry Dolgov <9erthali...@gmail.com> wrote: >> On 5 June 2018 at 12:31, Amit Langote wrote: >> >> doesn't look quite right. What says expr is really a Param? The patch >> appears to work because, by setting pinfo->execparams to *something*, it >> triggers execution-time pruning to run; its contents aren't necessarily >> used during execution pruning. In fact, it would've crashed if the >> execution-time pruning code had required execparams to contain *valid* >> param id, but currently it doesn't. >> >> What I think we'd need to do to make this work is to make execution-time >> pruning be invoked even if there aren't any Params involved. IOW, let's >> try to teach make_partition_pruneinfo that it can go ahead also in the >> cases where there are expressions being compared with the partition key >> that contain (only) stable functions. Then, go and fix the >> execution-pruning code to not *always* expect there to be Params to prune >> with. > > Yeah, I agree - I copied this approach mindlessly from the original hacky > patch. So, looks like it's necessary to have something like got_stable_expr > together with gotparam. I think the current code is heavily relying on Params to be present for partition pruning, which isn't true. Runtime partition pruning is possible when there are comparison conditions with partition key expressions on one side and "execution time constant" expressions on the other side. By "execution time constant" expression, I mean any expression that evaluates to a constant at the time of execution like a stable expressions (not just functions) or a Param expression. I can think of only these two at this time, but there can be more. So, gotparam should be renamed as "gotprunable_cond" to be generic. pull_partkey_params() should be renamed as "pull_partkey_conds" or something generic. That function would return true if there exists an expression in steps which can be evaluated to a constant at runtime, otherwise it returns false. My guess is there will be false-positives which need to be dealt with later, but there will be no false-negatives. > And after that the only place where I see Params > are in use is partkey_datum_from_expr where all the stuff is actually > evaluated. So apparently this part about "fix the execution-pruning code to > not > *always* expect there to be Params to prune with" will be only about this > function - am I correct or there is something else that I missed? Yes. But I think trying to evaluate parameters in this function is not good. The approach of folding constant expressions before or immediately after the execution starts doesn't require the expressions to be evaluated in partkey_datum_from_expr and might benefit other places where stable expressions or params can appear. Other problem with partkey_datum_from_expr() seems to be that it evaluated only param nodes but not the expressions involving parameters which can folded into constants at runtime. Take for example following queries on table t1 with two partitions (0, 100) and (100, 200), populated using "insert into t1 select i, i from generate_series(0, 199) i;". There's an index on t1(a). explain analyze select * from t1 x left join t1 y on x.a = y.b where y.a = 5; QUERY PLAN Nested Loop (cost=0.00..6.78 rows=1 width=16) (actual time=0.033..0.066 rows=1 loops=1) -> Append (cost=0.00..2.25 rows=1 width=8) (actual time=0.019..0.035 rows=1 loops=1) -> Seq Scan on t1p1 y (cost=0.00..2.25 rows=1 width=8) (actual time=0.018..0.035 rows=1 loops=1) Filter: (a = 5) Rows Removed by Filter: 99 -> Append (cost=0.00..4.51 rows=2 width=8) (actual time=0.011..0.027 rows=1 loops=1) -> Seq Scan on t1p1 x (cost=0.00..2.25 rows=1 width=8) (actual time=0.006..0.022 rows=1 loops=1) Filter: (y.b = a) Rows Removed by Filter: 99 -> Seq Scan on t1p2 x_1 (cost=0.00..2.25 rows=1 width=8) (never executed) Filter: (y.b = a) Planning Time: 0.644 ms Execution Time: 0.115 ms (13 rows) t1p2 x_1 is never scanned indicating that run time partition pruning happened. But then see the following query postgres:17889=#explain analyze select * from t1 x left join t1 y on x.a = y.b + 100 where y.a = 5; QUERY PLAN -- Nested Loop (cost=0.00..7.28 rows=1 width=16) (actual time=0.055..0.093 rows=1 loops=1) -> Append (cost=0.00..2.25 rows=1 width=8) (actual time=0.017..0.034 rows=1 loops=1) -> Seq Scan on t1p1 y (cost=0.00..2.25 rows=1 width=8) (actual time=0.016..0.033 rows=1 loops=1) Filter: (a = 5) Rows Removed by Filter: 99
install doesn't work on HEAD
Hi, On my win-7, I am facing $SUBJECT. I am consistently getting below error: Copying build output files...Could not copy postgres.exe at install.pl line 26. On digging, I found that it started failing with commit 3a7cc727 (Don't fall off the end of perl functions). It seems that we have added empty 'return' in all of the functions which seems to break one of the case: lcopy("$conf\\$pf\\$pf.$ext", "$target\\$dir\\$pf.$ext") || croak "Could not copy $pf.$ext\n"; I think the return from lcopy implies 0 (or undef) which cause it to fail. By reading couple of articles on net [1][2] related to this, it seems we can't blindly return in all cases. On changing, the lcopy as below, install appears to be working. @@ -40,7 +40,7 @@ sub lcopy copy($src, $target) || confess "Could not copy $src to $target\n"; - return; + return 1; } I have randomly check some of the other functions where this patch has added 'return' and those seem to be fine. Is it by any chance related Perl version or some other settings? If not, then we should do something for it. [1] - https://perlmaven.com/how-to-return-undef-from-a-function [2] - http://search.cpan.org/dist/Perl-Critic/lib/Perl/Critic/Policy/Subroutines/RequireFinalReturn.pm -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: Spilling hashed SetOps and aggregates to disk
On 06/05/2018 03:17 PM, David Rowley wrote: On 6 June 2018 at 01:09, Andres Freund wrote: On 2018-06-06 01:06:39 +1200, David Rowley wrote: My concern is that only accounting memory for the group and not the state is only solving half the problem. It might be fine for aggregates that don't stray far from their aggtransspace, but for the other ones, we could still see OOM. If solving the problem completely is too hard, then a half fix (maybe 3/4) is better than nothing, but if we can get a design for a full fix before too much work is done, then isn't that better? I don't think we actually disagree. I was really primarily talking about the case where we can't really do better because we don't have serialization support. I mean we could just rescan from scratch, using a groupagg, but that obviously sucks. I don't think we do. To take yours to the 100% solution might just take adding the memory accounting to palloc that Jeff proposed a few years ago and use that accounting to decide when we should switch method. However, I don't quite fully recall how the patch accounted for memory consumed by sub-contexts and if getting the entire consumption required recursively looking at subcontexts. If that's the case then checking the consumption would likely cost too much if it was done after each tuple was aggregated. Yeah, a simple wrapper would not really fix the issue, because allocating memory in the subcontext would not update the accounting information. So we'd be oblivious of possibly large amounts of memory. I don't quite see how is this related to (not) having serialization support, as it's more about not knowing we already hit the memory limit. That being said, I don't think array_agg actually has the issue, at least since b419865a814abbca12bdd6eef6a3d5ed67f432e1 (it does behave differently in aggregate and non-aggregate contexts, IIRC). I don't know how common this issue is, though. I don't think any built-in aggregates create additional contexts, but I may be wrong. But we have this damn extensibility thing, where users can write their own aggregates ... regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Spilling hashed SetOps and aggregates to disk
On 6 June 2018 at 01:09, Andres Freund wrote: > On 2018-06-06 01:06:39 +1200, David Rowley wrote: >> My concern is that only accounting memory for the group and not the >> state is only solving half the problem. It might be fine for >> aggregates that don't stray far from their aggtransspace, but for the >> other ones, we could still see OOM. > >> If solving the problem completely is too hard, then a half fix (maybe >> 3/4) is better than nothing, but if we can get a design for a full fix >> before too much work is done, then isn't that better? > > I don't think we actually disagree. I was really primarily talking > about the case where we can't really do better because we don't have > serialization support. I mean we could just rescan from scratch, using > a groupagg, but that obviously sucks. I don't think we do. To take yours to the 100% solution might just take adding the memory accounting to palloc that Jeff proposed a few years ago and use that accounting to decide when we should switch method. However, I don't quite fully recall how the patch accounted for memory consumed by sub-contexts and if getting the entire consumption required recursively looking at subcontexts. If that's the case then checking the consumption would likely cost too much if it was done after each tuple was aggregated. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Spilling hashed SetOps and aggregates to disk
On 06/05/2018 02:49 PM, Andres Freund wrote: Hi, On 2018-06-05 10:05:35 +0200, Tomas Vondra wrote: My concern is more about what happens when the input tuple ordering is inherently incompatible with the eviction strategy, greatly increasing the amount of data written to disk during evictions. Say for example that we can fit 1000 groups into work_mem, and that there are 2000 groups in the input data set. If the input is correlated with the groups, everything is peachy because we'll evict the first batch, and then group the remaining 1000 groups (or something like that). But if the input data is random (which can easily happen, e.g. for IP addresses, UUIDs and such) we'll hit the limit repeatedly and will evict much sooner. I know you suggested simply dumping the whole hash table and starting from scratch while we talked about this at pgcon, but ISTM it has exactly this issue. Yea, that's the case I was thinking of where going to sorting will very likely have better performance. I think it'd even be sensible to have a "skew tuple" like optimization. When detecting getting closer to memory exhaustion, move new groups to the tuplesort, but already hashed tuples stay in the hashtable. That'd still need tuples being moved to the sort in the cases where the transition values get to big (say array_agg), but that should be comparatively rare. I'm sure we could do better in selecting the hash-tabled values than just taking the first incoming ones, but that shouldn't be too bad. Not sure. I'd imagine the "compression" due to aggregating many tuples into much smaller amount of memory would be a clear win, so I don't find the "let's just dump all input tuples into a file" very attractive. I think evicting a fraction of the aggregate state (say, ~25%) would work better - choosing the "oldest" items seems OK, as those likely represent many input tuples (thus having a high "compaction" ratio). Or we could evict items representing rare groups, to make space for the common ones. Perhaps a combined eviction strategy (10% of each type) would work well. I'm conveniently ignoring the fact that we don't have information to determine this, at the moment, of course. I'm sure it's possible to make better decisions based on some cost estimates, both at plan time and then during execution. That being said, I don't want to over-think / over-engineer this. Getting something that reduces the risk of OOM in the first step is a good enough improvement. If we can do something better next, great. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: commitfest 2018-07
On 2018-06-04 23:32:18 -0400, Tom Lane wrote: > Michael Paquier writes: > > On Mon, Jun 04, 2018 at 07:16:33PM -0400, Peter Eisentraut wrote: > >> There were some discussions about renaming the existing 2018-09 entry > >> versus inserting a new one at -07 and requiring patches to be moved back > >> explicitly. > > > I would do that to reduce unnecessary log noise, but I was unsure of the > > actual status we are at. I am pretty sure that nobody is going to > > complain if what they submitted gets looked up two months earlier than > > what was previously planned, so I would vote to rename the existing > > 2018-09 to 2018-07, to rename the existing 2018-11 to 2018-09, and to > > create three new CF entries. > > +1 for just renaming 2018-09 to 2018-07, if we can do that. We'll end > up postponing some entries back to -09, but that seems like less churn > than the other way. I'd rather create a new 2018-07, and just manually move old patches to it. Otherwise we'll not really focus on the glut of old things, but everyone just restarts working on their own new thing. Greetings, Andres Freund
Re: Spilling hashed SetOps and aggregates to disk
Hi, On 2018-06-06 01:06:39 +1200, David Rowley wrote: > On 6 June 2018 at 00:57, Andres Freund wrote: > > I think it's ok to only handle this gracefully if serialization is > > supported. > > > > But I think my proposal to continue use a hashtable for the already > > known groups, and sorting for additional groups would largely address > > that largely, right? We couldn't deal with groups becoming too large, > > but easily with the number of groups becoming too large. > > My concern is that only accounting memory for the group and not the > state is only solving half the problem. It might be fine for > aggregates that don't stray far from their aggtransspace, but for the > other ones, we could still see OOM. > If solving the problem completely is too hard, then a half fix (maybe > 3/4) is better than nothing, but if we can get a design for a full fix > before too much work is done, then isn't that better? I don't think we actually disagree. I was really primarily talking about the case where we can't really do better because we don't have serialization support. I mean we could just rescan from scratch, using a groupagg, but that obviously sucks. Greetings, Andres Freund
Re: commitfest 2018-07
Greetings, * Ashutosh Bapat (ashutosh.ba...@enterprisedb.com) wrote: > On Tue, Jun 5, 2018 at 9:02 AM, Tom Lane wrote: > > Michael Paquier writes: > >> On Mon, Jun 04, 2018 at 07:16:33PM -0400, Peter Eisentraut wrote: > >>> There were some discussions about renaming the existing 2018-09 entry > >>> versus inserting a new one at -07 and requiring patches to be moved back > >>> explicitly. > > > >> I would do that to reduce unnecessary log noise, but I was unsure of the > >> actual status we are at. I am pretty sure that nobody is going to > >> complain if what they submitted gets looked up two months earlier than > >> what was previously planned, so I would vote to rename the existing > >> 2018-09 to 2018-07, to rename the existing 2018-11 to 2018-09, and to > >> create three new CF entries. > > > > +1 for just renaming 2018-09 to 2018-07, if we can do that. We'll end > > up postponing some entries back to -09, but that seems like less churn > > than the other way. > > Notes at [1] about keeping this commitfest for small patches. Just > renaming the commitfest would mean all the patches, big and small, can > be reviewed and committed. "Yes and no." While there were concerns raised that larger patches committed earlier might cause back-patching issues, the general consensus from my read of it was that it wasn't likely to be that big of an issue. While I suspect we'll still generally focus on getting smaller changes in during the 2018-07 cycle, to try and "clear the way" for the larger patches, I'm no longer concerned about larger patches which are ready being committed during that cycle. As such, I'm also +1 on the proposal to rename 2018-09 to 2018-07, and make the other renames and add a new one to the end. > [1] https://wiki.postgresql.org/wiki/PgCon_2018_Developer_Meeting Thanks! Stephen signature.asc Description: PGP signature
Re: Spilling hashed SetOps and aggregates to disk
On 6 June 2018 at 00:57, Andres Freund wrote: > On 2018-06-06 00:53:42 +1200, David Rowley wrote: >> On 6 June 2018 at 00:45, Andres Freund wrote: >> > On 2018-06-05 09:35:13 +0200, Tomas Vondra wrote: >> >> I wonder if an aggregate might use a custom context >> >> internally (I don't recall anything like that). The accounting capability >> >> seems potentially useful for other places, and those might not use >> >> AllocSet >> >> (or at least not directly). >> > >> > Yea, that seems like a big issue. >> >> Unfortunately, at least one of the built-in ones do. See initArrayResultArr. > > I think it's ok to only handle this gracefully if serialization is > supported. > > But I think my proposal to continue use a hashtable for the already > known groups, and sorting for additional groups would largely address > that largely, right? We couldn't deal with groups becoming too large, > but easily with the number of groups becoming too large. My concern is that only accounting memory for the group and not the state is only solving half the problem. It might be fine for aggregates that don't stray far from their aggtransspace, but for the other ones, we could still see OOM. If solving the problem completely is too hard, then a half fix (maybe 3/4) is better than nothing, but if we can get a design for a full fix before too much work is done, then isn't that better? -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager
On 2018-06-05 13:09:08 +0300, Alexander Korotkov wrote: > On Tue, Jun 5, 2018 at 12:48 PM Konstantin Knizhnik > wrote: > > Workload is combination of inserts and selects. > > Looks like shared locks obtained by select cause starvation of inserts, > > trying to get exclusive relation extension lock. > > The problem is fixed by fair lwlock patch, implemented by Alexander > > Korotkov. This patch prevents granting of shared lock if wait queue is not > > empty. > > May be we should use this patch or find some other way to prevent > > starvation of writers on relation extension locks for such workloads. > > Fair lwlock patch really fixed starvation of exclusive lwlock waiters. > But that starvation happens not on relation extension lock – selects > don't get shared relation extension lock. The real issue there was > not relation extension lock itself, but the time spent inside this > lock. Yea, that makes a lot more sense to me. > It appears that buffer replacement happening inside relation > extension lock is affected by starvation on exclusive buffer mapping > lwlocks and buffer content lwlocks, caused by many concurrent shared > lockers. So, fair lwlock patch have no direct influence to relation > extension lock, which is naturally not even lwlock... Yea, that makes sense. I wonder how much the fix here is to "pre-clear" a victim buffer, and how much is a saner buffer replacement implementation (either by going away from O(NBuffers), or by having a queue of clean victim buffers like my bgwriter replacement). > I'll post fair lwlock path in a separate thread. It requires detailed > consideration and benchmarking, because there is a risk of regression > on specific workloads. I bet that doing it naively will regress massively in a number of cases. Greetings, Andres Freund
Re: plans for PostgreSQL 12
Hi, On 2018-06-05 06:32:31 +0200, Pavel Stehule wrote: > ./configure --with-libxml --enable-tap-tests --enable-debug --with-perl > CFLAGS="-ggdb -Og -g3 -fno-omit-frame-pointer" > > [pavel@nemesis postgresql]$ gcc --version > gcc (GCC) 8.1.1 20180502 (Red Hat 8.1.1-1) > > I executed simple script > > do $$ declare i bigint = 1; s bigint = 0; begin while i <= 1 loop > s := s + i; i := i + 1; end loop; raise notice '%', s; end $$; > >7,68% postmaster postgres [.] > GetSnapshotData ▒ >7,53% postmaster plpgsql.so [.] > exec_eval_simple_expr ▒ >6,49% postmaster postgres [.] It seems to me the right fix here isn't a new class of functions, but rather support for delaying the computation of the snapshot to the point it's needed. That'll be far more generically applicable and doesn't require user interaction. > ExecInterpExpr▒ >4,13% postmaster postgres [.] So we're going to need to optimize this further as well, I've a pending patch for that, luckily ;) > LWLockRelease ▒ >4,12% postmaster postgres [.] That's also GetSnapshotData()... Greetings, Andres Freund
Re: Spilling hashed SetOps and aggregates to disk
On 2018-06-06 00:53:42 +1200, David Rowley wrote: > On 6 June 2018 at 00:45, Andres Freund wrote: > > On 2018-06-05 09:35:13 +0200, Tomas Vondra wrote: > >> I wonder if an aggregate might use a custom context > >> internally (I don't recall anything like that). The accounting capability > >> seems potentially useful for other places, and those might not use AllocSet > >> (or at least not directly). > > > > Yea, that seems like a big issue. > > Unfortunately, at least one of the built-in ones do. See initArrayResultArr. I think it's ok to only handle this gracefully if serialization is supported. But I think my proposal to continue use a hashtable for the already known groups, and sorting for additional groups would largely address that largely, right? We couldn't deal with groups becoming too large, but easily with the number of groups becoming too large. Greetings, Andres Freund
Re: why partition pruning doesn't work?
> On 5 June 2018 at 12:31, Amit Langote wrote: > > doesn't look quite right. What says expr is really a Param? The patch > appears to work because, by setting pinfo->execparams to *something*, it > triggers execution-time pruning to run; its contents aren't necessarily > used during execution pruning. In fact, it would've crashed if the > execution-time pruning code had required execparams to contain *valid* > param id, but currently it doesn't. > > What I think we'd need to do to make this work is to make execution-time > pruning be invoked even if there aren't any Params involved. IOW, let's > try to teach make_partition_pruneinfo that it can go ahead also in the > cases where there are expressions being compared with the partition key > that contain (only) stable functions. Then, go and fix the > execution-pruning code to not *always* expect there to be Params to prune > with. Yeah, I agree - I copied this approach mindlessly from the original hacky patch. So, looks like it's necessary to have something like got_stable_expr together with gotparam. And after that the only place where I see Params are in use is partkey_datum_from_expr where all the stuff is actually evaluated. So apparently this part about "fix the execution-pruning code to not *always* expect there to be Params to prune with" will be only about this function - am I correct or there is something else that I missed?
Re: Spilling hashed SetOps and aggregates to disk
On 6 June 2018 at 00:45, Andres Freund wrote: > On 2018-06-05 09:35:13 +0200, Tomas Vondra wrote: >> I wonder if an aggregate might use a custom context >> internally (I don't recall anything like that). The accounting capability >> seems potentially useful for other places, and those might not use AllocSet >> (or at least not directly). > > Yea, that seems like a big issue. Unfortunately, at least one of the built-in ones do. See initArrayResultArr. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Spilling hashed SetOps and aggregates to disk
Hi, On 2018-06-05 10:05:35 +0200, Tomas Vondra wrote: > My concern is more about what happens when the input tuple ordering is > inherently incompatible with the eviction strategy, greatly increasing the > amount of data written to disk during evictions. > > Say for example that we can fit 1000 groups into work_mem, and that there > are 2000 groups in the input data set. If the input is correlated with the > groups, everything is peachy because we'll evict the first batch, and then > group the remaining 1000 groups (or something like that). But if the input > data is random (which can easily happen, e.g. for IP addresses, UUIDs and > such) we'll hit the limit repeatedly and will evict much sooner. > I know you suggested simply dumping the whole hash table and starting from > scratch while we talked about this at pgcon, but ISTM it has exactly this > issue. Yea, that's the case I was thinking of where going to sorting will very likely have better performance. I think it'd even be sensible to have a "skew tuple" like optimization. When detecting getting closer to memory exhaustion, move new groups to the tuplesort, but already hashed tuples stay in the hashtable. That'd still need tuples being moved to the sort in the cases where the transition values get to big (say array_agg), but that should be comparatively rare. I'm sure we could do better in selecting the hash-tabled values than just taking the first incoming ones, but that shouldn't be too bad. Greetings, Andres Freund
Re: Spilling hashed SetOps and aggregates to disk
Hi, On 2018-06-05 09:35:13 +0200, Tomas Vondra wrote: > But I don't think we've considered copying the whole AllocSet. Maybe that > would work, not sure. Don't think you'd need to fully copy it - you can just have a "wrapper" memory context implementation that does the accounting and then hands the actual work to AllocSet. That limits some things it can account for, but I don't see those being really relevant. > I wonder if an aggregate might use a custom context > internally (I don't recall anything like that). The accounting capability > seems potentially useful for other places, and those might not use AllocSet > (or at least not directly). Yea, that seems like a big issue. Greetings, Andres Freund
Re: Spilling hashed SetOps and aggregates to disk
Hi, On 2018-06-04 22:18:56 -0700, Jeff Davis wrote: > On Mon, 2018-06-04 at 11:52 -0700, Andres Freund wrote: > > I wonder whether, at least for aggregates, the better fix wouldn't be > > to > > switch to feeding the tuples into tuplesort upon memory exhaustion > > and > > doing a sort based aggregate. We have most of the infrastructure to > > do > > That's an interesting idea, but it seems simpler to stick to hashing > rather than using a combination strategy. It also seems like it would > take less CPU effort. Isn't the locality of access going to considerably better with the sort based approach? > What advantages do you have in mind? My patch partitions the spilled > data, so it should have similar disk costs as a sort approach. I think one part of it is that I think the amount of code is going to be lower - we essentially have already all the code to handle sort based aggs, and to have both sort and hash based aggs in the same query. We'd mostly need a way to scan the hashtable and stuff it into a tuplesort, that's not hard. nodeAgg.c is already more than complex enough, I'm not sure that full blown partitioning is worth the cost. Greetings, Andres Freund
Re: commitfest 2018-07
On Tue, Jun 5, 2018 at 9:02 AM, Tom Lane wrote: > Michael Paquier writes: >> On Mon, Jun 04, 2018 at 07:16:33PM -0400, Peter Eisentraut wrote: >>> There were some discussions about renaming the existing 2018-09 entry >>> versus inserting a new one at -07 and requiring patches to be moved back >>> explicitly. > >> I would do that to reduce unnecessary log noise, but I was unsure of the >> actual status we are at. I am pretty sure that nobody is going to >> complain if what they submitted gets looked up two months earlier than >> what was previously planned, so I would vote to rename the existing >> 2018-09 to 2018-07, to rename the existing 2018-11 to 2018-09, and to >> create three new CF entries. > > +1 for just renaming 2018-09 to 2018-07, if we can do that. We'll end > up postponing some entries back to -09, but that seems like less churn > than the other way. > Notes at [1] about keeping this commitfest for small patches. Just renaming the commitfest would mean all the patches, big and small, can be reviewed and committed. [1] https://wiki.postgresql.org/wiki/PgCon_2018_Developer_Meeting -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Re: \d t: ERROR: XX000: cache lookup failed for relation
Ah, I think this is the missing, essential component: CREATE INDEX ON t(right(i::text,1)) WHERE i::text LIKE '%1'; Finally, I reproduce it with attached script. INSERT 0 99 <- first insertion ERROR: cache lookup failed for relation 1032219 ALTER TABLE ERROR: cache lookup failed for relation 1033478 ALTER TABLE ERROR: cache lookup failed for relation 1034073 ALTER TABLE ERROR: cache lookup failed for relation 1034650 ALTER TABLE ERROR: cache lookup failed for relation 1035238 ALTER TABLE ERROR: cache lookup failed for relation 1035837 will investigate -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/ 1.sh Description: application/shellscript
Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager
On 05.06.2018 13:29, Masahiko Sawada wrote: On Tue, Jun 5, 2018 at 6:47 PM, Konstantin Knizhnik wrote: On 05.06.2018 07:22, Masahiko Sawada wrote: On Mon, Jun 4, 2018 at 10:47 PM, Konstantin Knizhnik wrote: On 26.04.2018 09:10, Masahiko Sawada wrote: On Thu, Apr 26, 2018 at 3:30 AM, Robert Haas wrote: On Tue, Apr 10, 2018 at 9:08 PM, Masahiko Sawada wrote: Never mind. There was a lot of items especially at the last CommitFest. In terms of moving forward, I'd still like to hear what Andres has to say about the comments I made on March 1st. Yeah, agreed. $ ping -n andres.freund Request timeout for icmp_seq 0 Request timeout for icmp_seq 1 Request timeout for icmp_seq 2 Request timeout for icmp_seq 3 Request timeout for icmp_seq 4 ^C --- andres.freund ping statistics --- 6 packets transmitted, 0 packets received, 100.0% packet loss Meanwhile, https://www.postgresql.org/message-id/4c171ffe-e3ee-acc5-9066-a40d52bc5...@postgrespro.ru shows that this patch has some benefits for other cases, which is a point in favor IMHO. Thank you for sharing. That's good to know. Andres pointed out the performance degradation due to hash collision when multiple loading. I think the point is that it happens at where users don't know. Therefore even if we make N_RELEXTLOCK_ENTS configurable parameter, since users don't know the hash collision they don't know when they should tune it. So it's just an idea but how about adding an SQL-callable function that returns the estimated number of lock waiters of the given relation? Since user knows how many processes are loading to the relation, if a returned value by the function is greater than the expected value user can know hash collision and will be able to start to consider to increase N_RELEXTLOCK_ENTS. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center We in PostgresProc were faced with lock extension contention problem at two more customers and tried to use this patch (v13) to address this issue. Unfortunately replacing heavy lock with lwlock couldn't completely eliminate contention, now most of backends are blocked on conditional variable: 0x7fb03a318903 in __epoll_wait_nocancel () from /lib64/libc.so.6 #0 0x7fb03a318903 in __epoll_wait_nocancel () from /lib64/libc.so.6 #1 0x007024ee in WaitEventSetWait () #2 0x00718fa6 in ConditionVariableSleep () #3 0x0071954d in RelExtLockAcquire () #4 0x004ba99d in RelationGetBufferForTuple () #5 0x004b3f18 in heap_insert () #6 0x006109c8 in ExecInsert () #7 0x00611a49 in ExecModifyTable () #8 0x005ef97a in standard_ExecutorRun () #9 0x0072440a in ProcessQuery () #10 0x00724631 in PortalRunMulti () #11 0x007250ec in PortalRun () #12 0x00721287 in exec_simple_query () #13 0x00722532 in PostgresMain () #14 0x0047a9eb in ServerLoop () #15 0x006b9fe9 in PostmasterMain () #16 0x0047b431 in main () Obviously there is nothing surprising here: if a lot of processes try to acquire the same exclusive lock, then high contention is expected. I just want to notice that this patch is not able to completely eliminate the problem with large number of concurrent inserts to the same table. Second problem we observed was even more critical: if backed is granted relation extension lock and then got some error before releasing this lock, then abort of the current transaction doesn't release this lock (unlike heavy weight lock) and the relation is kept locked. So database is actually stalled and server has to be restarted. Thank you for reporting. Regarding the second problem, I tried to reproduce that bug with latest version patch (v13) but could not. When transaction aborts, we call ReousrceOwnerRelease()->ResourceOwnerReleaseInternal()->ProcReleaseLocks()->RelExtLockCleanup() and clear either relext lock bits we are holding or waiting. If we raise an error after we added a relext lock bit but before we increment its holding count then the relext lock is remained, but I couldn't see the code raises an error between them. Could you please share the concrete reproduction Sorry, my original guess that LW-locks are not released in case of transaction abort is not correct. There was really situation when all backends were blocked in relation extension lock and looks like it happens after disk write error, You're saying that it is not correct that LWlock are not released but it's correct that all backends were blocked in relext lock, but in other your mail you're saying something opposite. Which is correct? I am sorry for confusion. I have not investigated core files myself and just share information received from our engineer. Looks like this problem may be related with relation extension locks at all. Sorry for false alarm.
AtEOXact_ApplyLauncher() and subtransactions
Hi, When a SUBSCRIPTION is altered, then the currently running table-synchronization workers that are no longer needed for the altered subscription, are terminated. This is done by the function AtEOXact_ApplyLauncher() inside CommitTransaction(). So during each ALTER-SUBSCRIPTION command, the on_commit_stop_workers list is appended with new set of workers to be stopped. And then at commit, AtEOXact_ApplyLauncher() stops the workers in the list. But there is no handling for sub-transaction abort. Consider this : -- On secondary, create a subscription that will initiate the table sync CREATE SUBSCRIPTION mysub CONNECTION 'dbname=postgres host=localhost user=rep password=Password port=5432' PUBLICATION mypub; -- While the sync is going on, change the publication of the -- subscription in a subtransaction, so that it adds up the synchronization -- workers for the earlier publication into the 'on_commit_stop_workers' list select pg_sleep(1); begin; savepoint a; ALTER SUBSCRIPTION mysub SET PUBLICATION mypub_2; rollback to a; -- Commit will stop the above sync. commit; So the ALTER-SUBSCRIPTION has been rolled back. But the on_commit_stop_workers is not reverted back to its earlier state. And then at commit, the workers will be killed unnecessarily. I have observed that when the workers are killed, they are restarted. But consider the case where the original subscription was synchronizing hundreds of tables. And just when it is about to finish, someone changes the subscription inside a subtransaction and rolls it back, and then commits the transaction. The whole synchronization gets re-started from scratch. Below log snippets show this behaviour : < CREATE-SUBSCRIPTION command spawns workers > 2018-06-05 15:04:07.841 IST [39951] LOG: logical replication apply worker for subscription "mysub" has started 2018-06-05 15:04:07.848 IST [39953] LOG: logical replication table synchronization worker for subscription "mysub", table "article" has started < After some time, ALTER-SUBSCRIPTION rollbacks inside subtransaction, and commits transaction. Workers are killed > 2018-06-05 15:04:32.903 IST [39953] FATAL: terminating logical replication worker due to administrator command 2018-06-05 15:04:32.903 IST [39953] CONTEXT: COPY article, line 37116293 2018-06-05 15:04:32.904 IST [39666] LOG: background worker "logical replication worker" (PID 39953) exited with exit code 1 < Workers are restarted > 2018-06-05 15:04:32.909 IST [40003] LOG: logical replication table synchronization worker for subscription "mysub", table "article" has started < Synchronization done after some time > 2018-06-05 15:05:10.042 IST [40003] LOG: logical replication table synchronization worker for subscription "mysub", table "article" has finished --- To reproduce the issue : 1. On master server, create the tables and publications using attached setup_master.sql. 2. On secondary server, run the attached run_on_secondary.sql. This will reproduce the issue as can be seen from the log output. --- Fix : I haven't written a patch for it, but I think we should have a separate on_commit_stop_workers for each subtransaction. At subtransaction commit, we replace the on_commit_stop_workers list of the parent subtransaction with the one from the committed subtransaction; and on abort, discard the list of the current subtransaction. So have a stack of the lists. --- Furthermore, before fixing this, we may have to fix another issue which, I suspect, would surface even without subtransactions, as explained below : Suppose publication pubx is set for tables tx1 and and tx2. And publication puby is for tables ty1 and ty2. Subscription mysub is set to synchronise tables tx1 and tx2 : CREATE SUBSCRIPTION mysub ... PUBLICATION pubx; Now suppose the subscription is altered to synchronise ty1 and ty2, and then again altered back to synchronise tx1 and tx2 in the same transaction. begin; ALTER SUBSCRIPTION mysub set publication puby; ALTER SUBSCRIPTION mysub set publication pubx; commit; Here, workers for tx1 and tx2 are added to on_commit_stop_workers after the publication is set to puby. And then workers for ty1 and ty2 are further added to that list after the 2nd ALTER command where publication is set to pubx, because the earlier ALTER command has already changed the catalogs to denote that ty1 and ty2 are being synchronised. Effectively, at commit, all the workers are targetted to be stopped, when actually at commit time there is no final change in the tables to be synchronised. What actually we should do is : for each ALTER command we should compare the very first set of tables being synchronised since the last committed ALTER command, rather than checking the catalogs. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company run_on_secondary.sql Description: Binary data setup_master.sql Description: Binary data
Re: pg_replication_slot_advance to return NULL instead of 0/0 if slot not advanced
On 05/06/18 06:28, Michael Paquier wrote: > On Mon, Jun 04, 2018 at 11:51:35AM +0200, Petr Jelinek wrote: >> On 01/06/18 21:13, Michael Paquier wrote: >>> -startlsn =3D MyReplicationSlot->data.confirmed_flush; >>> +if (OidIsValid(MyReplicationSlot->data.database)) >>> +startlsn =3D MyReplicationSlot->data.confirmed_flush; >>> +else >>> +startlsn =3D MyReplicationSlot->data.restart_lsn; >>> + >>> if (moveto < startlsn) >>> { >>> ReplicationSlotRelease(); >> >> This part looks correct for the checking that we are not moving >> backwards. However, there is another existing issue with this code >> which >> is that we are later using the confirmed_flush (via startlsn) as start >> point of logical decoding (XLogReadRecord parameter in >> pg_logical_replication_slot_advance) which is not correct. The >> restart_lsn should be used for that. I think it would make sense to >> fix >> that as part of this patch as well. > > I am not sure I understand what you are coming at here. Could you > explain your point a bit more please as 9c7d06d is yours? As I said, it's an existing issue. I just had chance to reread the code when looking and your patch. > When creating > the decoding context, all other code paths use the confirmed LSN as a > start point if not explicitely defined by the caller of > CreateDecodingContext, as it points to the last LSN where a transaction > has been committed and decoded. I didn't say anything about CreateDecodingContext though. I am talking about the fact that we then use the same variable as input to XLogReadRecord later in the logical slot code path. The whole point of having restart_lsn for logical slot is to have correct start point for XLogReadRecord so using the confirmed_flush there is wrong (and has been wrong since the original commit). -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: why partition pruning doesn't work?
Hi Dmitry, Thanks for creating the patch. I looked at it and have some comments. On 2018/06/04 22:30, Dmitry Dolgov wrote: >> On 3 June 2018 at 19:11, Tom Lane wrote: >> Dmitry Dolgov <9erthali...@gmail.com> writes: >>> Just to clarify for myself, for evaluating any stable function here would >>> it be >>> enough to handle all function-like expressions (FuncExpr / OpExpr / >>> DistinctExpr / NullIfExpr) and check a corresponding function for >>> provolatile, >>> like in the attached patch? >> >> I think the entire approach is wrong here. Rather than concerning >> yourself with Params, or any other specific expression type, you >> should be using !contain_volatile_functions() to decide whether >> an expression is run-time-constant. If it is, use the regular >> expression evaluation machinery to extract the value. > > Yes, it makes sense. Then, to my understanding, the attached code is close to > what was described above (although I'm not sure about the Const part). This: @@ -2727,6 +2730,13 @@ pull_partkey_params(PartitionPruneInfo *pinfo, List *steps) } gotone = true; } +else if (!IsA(expr, Const)) +{ +Param *param = (Param *) expr; +pinfo->execparams = bms_add_member(pinfo->execparams, + param->paramid); +gotone = true; +} doesn't look quite right. What says expr is really a Param? The patch appears to work because, by setting pinfo->execparams to *something*, it triggers execution-time pruning to run; its contents aren't necessarily used during execution pruning. In fact, it would've crashed if the execution-time pruning code had required execparams to contain *valid* param id, but currently it doesn't. What I think we'd need to do to make this work is to make execution-time pruning be invoked even if there aren't any Params involved. IOW, let's try to teach make_partition_pruneinfo that it can go ahead also in the cases where there are expressions being compared with the partition key that contain (only) stable functions. Then, go and fix the execution-pruning code to not *always* expect there to be Params to prune with. Maybe, David (added to cc) has something to say about all this, especially whether he considers this a feature and not a bug fix. Thanks, Amit
Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager
On Tue, Jun 5, 2018 at 6:47 PM, Konstantin Knizhnik wrote: > > > On 05.06.2018 07:22, Masahiko Sawada wrote: >> >> On Mon, Jun 4, 2018 at 10:47 PM, Konstantin Knizhnik >> wrote: >>> >>> >>> On 26.04.2018 09:10, Masahiko Sawada wrote: On Thu, Apr 26, 2018 at 3:30 AM, Robert Haas wrote: > > On Tue, Apr 10, 2018 at 9:08 PM, Masahiko Sawada > > wrote: >> >> Never mind. There was a lot of items especially at the last >> CommitFest. >> >>> In terms of moving forward, I'd still like to hear what >>> Andres has to say about the comments I made on March 1st. >> >> Yeah, agreed. > > $ ping -n andres.freund > Request timeout for icmp_seq 0 > Request timeout for icmp_seq 1 > Request timeout for icmp_seq 2 > Request timeout for icmp_seq 3 > Request timeout for icmp_seq 4 > ^C > --- andres.freund ping statistics --- > 6 packets transmitted, 0 packets received, 100.0% packet loss > > Meanwhile, > > https://www.postgresql.org/message-id/4c171ffe-e3ee-acc5-9066-a40d52bc5...@postgrespro.ru > shows that this patch has some benefits for other cases, which is a > point in favor IMHO. Thank you for sharing. That's good to know. Andres pointed out the performance degradation due to hash collision when multiple loading. I think the point is that it happens at where users don't know. Therefore even if we make N_RELEXTLOCK_ENTS configurable parameter, since users don't know the hash collision they don't know when they should tune it. So it's just an idea but how about adding an SQL-callable function that returns the estimated number of lock waiters of the given relation? Since user knows how many processes are loading to the relation, if a returned value by the function is greater than the expected value user can know hash collision and will be able to start to consider to increase N_RELEXTLOCK_ENTS. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center >>> We in PostgresProc were faced with lock extension contention problem at >>> two >>> more customers and tried to use this patch (v13) to address this issue. >>> Unfortunately replacing heavy lock with lwlock couldn't completely >>> eliminate >>> contention, now most of backends are blocked on conditional variable: >>> >>> 0x7fb03a318903 in __epoll_wait_nocancel () from /lib64/libc.so.6 >>> #0 0x7fb03a318903 in __epoll_wait_nocancel () from /lib64/libc.so.6 >>> #1 0x007024ee in WaitEventSetWait () >>> #2 0x00718fa6 in ConditionVariableSleep () >>> #3 0x0071954d in RelExtLockAcquire () >>> #4 0x004ba99d in RelationGetBufferForTuple () >>> #5 0x004b3f18 in heap_insert () >>> #6 0x006109c8 in ExecInsert () >>> #7 0x00611a49 in ExecModifyTable () >>> #8 0x005ef97a in standard_ExecutorRun () >>> #9 0x0072440a in ProcessQuery () >>> #10 0x00724631 in PortalRunMulti () >>> #11 0x007250ec in PortalRun () >>> #12 0x00721287 in exec_simple_query () >>> #13 0x00722532 in PostgresMain () >>> #14 0x0047a9eb in ServerLoop () >>> #15 0x006b9fe9 in PostmasterMain () >>> #16 0x0047b431 in main () >>> >>> Obviously there is nothing surprising here: if a lot of processes try to >>> acquire the same exclusive lock, then high contention is expected. >>> I just want to notice that this patch is not able to completely eliminate >>> the problem with large number of concurrent inserts to the same table. >>> >>> Second problem we observed was even more critical: if backed is granted >>> relation extension lock and then got some error before releasing this >>> lock, >>> then abort of the current transaction doesn't release this lock (unlike >>> heavy weight lock) and the relation is kept locked. >>> So database is actually stalled and server has to be restarted. >>> >> Thank you for reporting. >> >> Regarding the second problem, I tried to reproduce that bug with >> latest version patch (v13) but could not. When transaction aborts, we >> call >> ReousrceOwnerRelease()->ResourceOwnerReleaseInternal()->ProcReleaseLocks()->RelExtLockCleanup() >> and clear either relext lock bits we are holding or waiting. If we >> raise an error after we added a relext lock bit but before we >> increment its holding count then the relext lock is remained, but I >> couldn't see the code raises an error between them. Could you please >> share the concrete reproduction > > > Sorry, my original guess that LW-locks are not released in case of > transaction abort is not correct. > There was really situation when all backends were blocked in relation > extension lock and looks like it happens after disk write error, You're saying that it is not correct that LWlock are not released but it's correct
Re: [HACKERS] Transactions involving multiple postgres foreign servers
On Sat, May 26, 2018 at 12:25 AM, Robert Haas wrote: > On Fri, May 18, 2018 at 11:21 AM, Masahiko Sawada > wrote: >> Regarding to API design, should we use 2PC for a distributed >> transaction if both two or more 2PC-capable foreign servers and >> 2PC-non-capable foreign server are involved with it? Or should we end >> up with an error? the 2PC-non-capable server might be either that has >> 2PC functionality but just disables it or that doesn't have it. > > It seems to me that this is functionality that many people will not > want to use. First, doing a PREPARE and then a COMMIT for each FDW > write transaction is bound to be more expensive than just doing a > COMMIT. Second, because the default value of > max_prepared_transactions is 0, this can only work at all if special > configuration has been done on the remote side. Because of the second > point in particular, it seems to me that the default for this new > feature must be "off". It would make to ship a default configuration > of PostgreSQL that doesn't work with the default configuration of > postgres_fdw, and I do not think we want to change the default value > of max_prepared_transactions. It was changed from 5 to 0 a number of > years back for good reason. I'm not sure that many people will not want to use this feature because it seems to me that there are many people who don't want to use the database that is missing transaction atomicity. But I agree that this feature should not be enabled by default as we disable 2PC by default. > > So, I think the question could be broadened a bit: how you enable this > feature if you want it, and what happens if you want it but it's not > available for your choice of FDW? One possible enabling method is a > GUC (e.g. foreign_twophase_commit). It could be true/false, with true > meaning use PREPARE for all FDW writes and fail if that's not > supported, or it could be three-valued, like require/prefer/disable, > with require throwing an error if PREPARE support is not available and > prefer using PREPARE where available but without failing when it isn't > available. Another possibility could be to make it an FDW option, > possibly capable of being set at multiple levels (e.g. server or > foreign table). If any FDW involved in the transaction demands > distributed 2PC semantics then the whole transaction must have those > semantics or it fails. I was previous leaning toward the latter > approach, but I guess now the former approach is sounding better. I'm > not totally certain I know what's best here. > I agree that the former is better. That way, we also can control that parameter at transaction level. If we allow the 'prefer' behavior we need to manage not only 2PC-capable foreign server but also 2PC-non-capable foreign server. It requires all FDW to call the registration function. So I think two-values parameter would be better. BTW, sorry for late submitting the updated patch. I'll post the updated patch in this week but I'd like to share the new APIs design beforehand. APIs that I'd like to add are 4 functions and 1 registration function: PrepareForeignTransaction, CommitForeignTransaction, RollbackForeignTransaction, IsTwoPhaseCommitEnabled and FdwXactRegisterForeignServer. All FDWs that want to support atomic commit have to support all APIs and to call the registration function when foreign transaction opens. Transaction processing sequence with atomic commit will be like follows. 1. FDW begins a transaction on a 2PC-capable foreign server. 2. FDW registers the foreign server with/without a foreign transaction identifier by calling FdwXactRegisterForeignServer(). * The passing foreign transaction identifier can be NULL. If it's NULL, the core code constructs it like 'fx_<4 random chars>__'. * Providing foreign transaction identifier at beginning of transaction is useful because some DBMS such as Oracle database or MySQL requires a transaction identifier at beginning of its XA transaction. * Registration the foreign transaction guarantees that its transaction is controlled by the core and APIs are called at an appropriate time. 3. Perform 1 and 2 whenever the distributed transaction opens a transaction on 2PC-capable foreign servers. * When the distributed transaction modifies a foreign server, we mark it as 'modified'. * This mark is used at commit to check if it's necessary to use 2PC. * At the same time, we also check if the foreign server enables 2PC by calling IsTwoPhaseCommitEnabled(). * If an FDW disables or doesn't provide that function, we mark XACT_FALGS_FDWNONPREPARE. This is necessary because we need to remember wrote 2PC-non-capable foreign server. * When the distributed transaction modifies temp table locally, mark XACT_FALGS_WROTENONTEMREL. * This is used at commit to check i it's necessary to use 2PC as well. 4. During pre-commit, we prepare all foreign transaction if 2PC is required by calling PrepareFOreignTransaciton(
Re: Problem while updating a foreign table pointing to a partitioned table on foreign server
Hello. At Mon, 04 Jun 2018 20:58:28 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI wrote in <20180604.205828.208262556.horiguchi.kyot...@lab.ntt.co.jp> > It fails on some join-pushdown cases since it doesn't add tid > columns to join tlist. I suppose that build_tlist_to_deparse > needs something but I'll consider further tomorrow. I made it work with a few exceptions and bumped. PARAM_EXEC doesn't work at all in a case where Sort exists between ForeignUpdate and ForeignScan. = explain (verbose, costs off) update bar set f2 = f2 + 100 from ( select f1 from foo union all select f1+3 from foo ) ss where bar.f1 = ss.f1; QUERY PLAN - Update on public.bar Update on public.bar Foreign Update on public.bar2 Remote SQL: UPDATE public.loct2 SET f2 = $3 WHERE tableoid = $1 AND ctid = $2 ... -> Merge Join Output: bar2.f1, (bar2.f2 + 100), bar2.f3, (ROW(foo.f1)) Merge Cond: (bar2.f1 = foo.f1) -> Sort Output: bar2.f1, bar2.f2, bar2.f3, bar2.tableoid, bar2.ctid Sort Key: bar2.f1 -> Foreign Scan on public.bar2 Output: bar2.f1, bar2.f2, bar2.f3, bar2.tableoid, bar2.ctid Remote SQL: SELECT f1, f2, f3, ctid, tableoid FROM public.loct2 FOR UPDATE = Even if this worked fine, it cannot be back-patched. We need an extra storage moves together with tuples or prevent sorts or something like from being inserted there. At Fri, 1 Jun 2018 10:21:39 -0400, Ashutosh Bapat wrote in > I am not suggesting to commit 0003 in my patch set, but just 0001 and > 0002 which just raise an error when multiple rows get updated when > only one row is expected to be updated. So I agree to commit the two at least in order to prevent doing wrong silently. regards. -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/contrib/postgres_fdw/deparse.c b/contrib/postgres_fdw/deparse.c index d272719ff4..bff216f29d 100644 --- a/contrib/postgres_fdw/deparse.c +++ b/contrib/postgres_fdw/deparse.c @@ -1049,9 +1049,16 @@ deparseSelectSql(List *tlist, bool is_subquery, List **retrieved_attrs, * can use NoLock here. */ Relation rel = heap_open(rte->relid, NoLock); + Bitmapset *attrs = fpinfo->attrs_used; + + if (root->parse->commandType != CMD_UPDATE && + root->parse->commandType != CMD_DELETE) + attrs = bms_del_member(bms_copy(attrs), + TableOidAttributeNumber - + FirstLowInvalidHeapAttributeNumber); deparseTargetList(buf, rte, foreignrel->relid, rel, false, - fpinfo->attrs_used, false, retrieved_attrs); + attrs, false, retrieved_attrs); heap_close(rel, NoLock); } } @@ -1107,11 +1114,17 @@ deparseTargetList(StringInfo buf, bool qualify_col, List **retrieved_attrs) { + static int check_attrs[4]; + static char *check_attr_names[] = {"ctid", "oid", "tableoid"}; TupleDesc tupdesc = RelationGetDescr(rel); bool have_wholerow; bool first; int i; + check_attrs[0] = SelfItemPointerAttributeNumber; + check_attrs[1] = ObjectIdAttributeNumber; + check_attrs[2] = TableOidAttributeNumber; + check_attrs[3] = FirstLowInvalidHeapAttributeNumber; *retrieved_attrs = NIL; /* If there's a whole-row reference, we'll need all the columns. */ @@ -1143,13 +1156,16 @@ deparseTargetList(StringInfo buf, } } - /* - * Add ctid and oid if needed. We currently don't support retrieving any - * other system columns. - */ - if (bms_is_member(SelfItemPointerAttributeNumber - FirstLowInvalidHeapAttributeNumber, - attrs_used)) + for (i = 0 ; check_attrs[i] != FirstLowInvalidHeapAttributeNumber ; i++) { + int attr = check_attrs[i]; + char *attr_name = check_attr_names[i]; + + /* Add system columns if needed. */ + if (!bms_is_member(attr - FirstLowInvalidHeapAttributeNumber, + attrs_used)) + continue; + if (!first) appendStringInfoString(buf, ", "); else if (is_returning) @@ -1158,26 +1174,9 @@ deparseTargetList(StringInfo buf, if (qualify_col) ADD_REL_QUALIFIER(buf, rtindex); - appendStringInfoString(buf, "ctid"); + appendStringInfoString(buf, attr_name); - *retrieved_attrs = lappend_int(*retrieved_attrs, - SelfItemPointerAttributeNumber); - } - if (bms_is_member(ObjectIdAttributeNumber - FirstLowInvalidHeapAttributeNumber, - attrs_used)) - { - if (!first) - appendStringInfoString(buf, ", "); - else if (is_returning) - appendStringInfoString(buf, " RETURNING "); - first = false; - - if (qualify_col) - ADD_REL_QUALIFIER(buf, rtindex); - appendStringInfoString(buf, "oid"); - - *retrieved_attrs = lappend_int(*retrieved_attrs, - ObjectIdAttributeNumber); + *retrieved_attrs = lappend_int(*retrieved_attrs, attr); } /* Don't generate bad syntax if no undropped columns */ @@ -1725,7 +1724,7 @@ deparseUpdateSql(StringInfo buf, Range
Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager
On Tue, Jun 5, 2018 at 12:48 PM Konstantin Knizhnik wrote: > Workload is combination of inserts and selects. > Looks like shared locks obtained by select cause starvation of inserts, > trying to get exclusive relation extension lock. > The problem is fixed by fair lwlock patch, implemented by Alexander Korotkov. > This patch prevents granting of shared lock if wait queue is not empty. > May be we should use this patch or find some other way to prevent starvation > of writers on relation extension locks for such workloads. Fair lwlock patch really fixed starvation of exclusive lwlock waiters. But that starvation happens not on relation extension lock – selects don't get shared relation extension lock. The real issue there was not relation extension lock itself, but the time spent inside this lock. It appears that buffer replacement happening inside relation extension lock is affected by starvation on exclusive buffer mapping lwlocks and buffer content lwlocks, caused by many concurrent shared lockers. So, fair lwlock patch have no direct influence to relation extension lock, which is naturally not even lwlock... I'll post fair lwlock path in a separate thread. It requires detailed consideration and benchmarking, because there is a risk of regression on specific workloads. -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager
On 04.06.2018 21:42, Andres Freund wrote: Hi, On 2018-06-04 16:47:29 +0300, Konstantin Knizhnik wrote: We in PostgresProc were faced with lock extension contention problem at two more customers and tried to use this patch (v13) to address this issue. Unfortunately replacing heavy lock with lwlock couldn't completely eliminate contention, now most of backends are blocked on conditional variable: 0x7fb03a318903 in __epoll_wait_nocancel () from /lib64/libc.so.6 #0 0x7fb03a318903 in __epoll_wait_nocancel () from /lib64/libc.so.6 #1 0x007024ee in WaitEventSetWait () #2 0x00718fa6 in ConditionVariableSleep () #3 0x0071954d in RelExtLockAcquire () That doesn't necessarily mean that the postgres code is to fault here. It's entirely possible that the filesystem or storage is the bottleneck. Could you briefly describe workload & hardware? Workload is combination of inserts and selects. Looks like shared locks obtained by select cause starvation of inserts, trying to get exclusive relation extension lock. The problem is fixed by fair lwlock patch, implemented by Alexander Korotkov. This patch prevents granting of shared lock if wait queue is not empty. May be we should use this patch or find some other way to prevent starvation of writers on relation extension locks for such workloads. Second problem we observed was even more critical: if backed is granted relation extension lock and then got some error before releasing this lock, then abort of the current transaction doesn't release this lock (unlike heavy weight lock) and the relation is kept locked. So database is actually stalled and server has to be restarted. That obvioulsy needs to be fixed... Sorry, looks like the problem is more obscure than I expected. What we have observed is that all backends are blocked in lwlock (sorry stack trace is not complete): #0 0x7ff5a9c566d6 in futex_abstimed_wait_cancelable (private=128, abstime=0x0, expected=0, futex_word=0x7ff3c57b9b38) at ../sysdeps/unix/sysv/lin ux/futex-internal.h:205 #1 do_futex_wait (sem=sem@entry=0x7ff3c57b9b38, abstime=0x0) at sem_waitcommon.c:111 #2 0x7ff5a9c567c8 in __new_sem_wait_slow (sem=sem@entry=0x7ff3c57b9b38, abstime=0x0) at sem_waitcommon.c:181 #3 0x7ff5a9c56839 in __new_sem_wait (sem=sem@entry=0x7ff3c57b9b38) at sem_wait.c:42 #4 0x56290c901582 in PGSemaphoreLock (sema=0x7ff3c57b9b38) at pg_sema.c:310 #5 0x56290c97923c in LWLockAcquire (lock=0x7ff3c7038c64, mode=LW_SHARED) at ./build/../src/backend/storage/lmgr/lwlock.c:1233 I happen after error in disk write operation. Unfortunately we do not have core files and not able to reproduce the problem. All LW locks should be cleared by LWLockReleaseAll but ... for some reasons it doesn't happen. We will continue investigation and try to reproduce the problem. I will let you know if we find the reason of the problem. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager
On 05.06.2018 07:22, Masahiko Sawada wrote: On Mon, Jun 4, 2018 at 10:47 PM, Konstantin Knizhnik wrote: On 26.04.2018 09:10, Masahiko Sawada wrote: On Thu, Apr 26, 2018 at 3:30 AM, Robert Haas wrote: On Tue, Apr 10, 2018 at 9:08 PM, Masahiko Sawada wrote: Never mind. There was a lot of items especially at the last CommitFest. In terms of moving forward, I'd still like to hear what Andres has to say about the comments I made on March 1st. Yeah, agreed. $ ping -n andres.freund Request timeout for icmp_seq 0 Request timeout for icmp_seq 1 Request timeout for icmp_seq 2 Request timeout for icmp_seq 3 Request timeout for icmp_seq 4 ^C --- andres.freund ping statistics --- 6 packets transmitted, 0 packets received, 100.0% packet loss Meanwhile, https://www.postgresql.org/message-id/4c171ffe-e3ee-acc5-9066-a40d52bc5...@postgrespro.ru shows that this patch has some benefits for other cases, which is a point in favor IMHO. Thank you for sharing. That's good to know. Andres pointed out the performance degradation due to hash collision when multiple loading. I think the point is that it happens at where users don't know. Therefore even if we make N_RELEXTLOCK_ENTS configurable parameter, since users don't know the hash collision they don't know when they should tune it. So it's just an idea but how about adding an SQL-callable function that returns the estimated number of lock waiters of the given relation? Since user knows how many processes are loading to the relation, if a returned value by the function is greater than the expected value user can know hash collision and will be able to start to consider to increase N_RELEXTLOCK_ENTS. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center We in PostgresProc were faced with lock extension contention problem at two more customers and tried to use this patch (v13) to address this issue. Unfortunately replacing heavy lock with lwlock couldn't completely eliminate contention, now most of backends are blocked on conditional variable: 0x7fb03a318903 in __epoll_wait_nocancel () from /lib64/libc.so.6 #0 0x7fb03a318903 in __epoll_wait_nocancel () from /lib64/libc.so.6 #1 0x007024ee in WaitEventSetWait () #2 0x00718fa6 in ConditionVariableSleep () #3 0x0071954d in RelExtLockAcquire () #4 0x004ba99d in RelationGetBufferForTuple () #5 0x004b3f18 in heap_insert () #6 0x006109c8 in ExecInsert () #7 0x00611a49 in ExecModifyTable () #8 0x005ef97a in standard_ExecutorRun () #9 0x0072440a in ProcessQuery () #10 0x00724631 in PortalRunMulti () #11 0x007250ec in PortalRun () #12 0x00721287 in exec_simple_query () #13 0x00722532 in PostgresMain () #14 0x0047a9eb in ServerLoop () #15 0x006b9fe9 in PostmasterMain () #16 0x0047b431 in main () Obviously there is nothing surprising here: if a lot of processes try to acquire the same exclusive lock, then high contention is expected. I just want to notice that this patch is not able to completely eliminate the problem with large number of concurrent inserts to the same table. Second problem we observed was even more critical: if backed is granted relation extension lock and then got some error before releasing this lock, then abort of the current transaction doesn't release this lock (unlike heavy weight lock) and the relation is kept locked. So database is actually stalled and server has to be restarted. Thank you for reporting. Regarding the second problem, I tried to reproduce that bug with latest version patch (v13) but could not. When transaction aborts, we call ReousrceOwnerRelease()->ResourceOwnerReleaseInternal()->ProcReleaseLocks()->RelExtLockCleanup() and clear either relext lock bits we are holding or waiting. If we raise an error after we added a relext lock bit but before we increment its holding count then the relext lock is remained, but I couldn't see the code raises an error between them. Could you please share the concrete reproduction Sorry, my original guess that LW-locks are not released in case of transaction abort is not correct. There was really situation when all backends were blocked in relation extension lock and looks like it happens after disk write error, but as far as it happens at customer's site, we have no time for investigation and not able to reproduce this problem locally. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: Remove mention in docs that foreign keys on partitioned tables are not supported
On 2018/06/05 16:41, Ashutosh Bapat wrote: > On Tue, Jun 5, 2018 at 1:07 PM, Amit Langote > wrote: >> On 2018/06/05 1:25, Alvaro Herrera wrote: >>> In the meantime, my inclination is to fix the documentation to point out >>> that AFTER triggers are allowed but BEFORE triggers are not. >> >> Wasn't that already fixed by bcded2609ade6? >> >> We don't say anything about AFTER triggers per se, but the following under >> the limitations of partitioned tables: >> >> BEFORE ROW triggers, if necessary, must be defined on individual >> partitions, not the partitioned table. > > Thought correct that suggestion is problematic for upgrades. Probably > users will create triggers using same function on all the partitions. > After the upgrade when we start supporting BEFORE TRIGGER on > partitioned table, the user will have to drop these triggers and > create one trigger on the partitioned table to have the trigger to be > applicable to the new partitions created. Hmm yes, maybe there is scope for rewording then. Thanks, Amit
Re: Performance regression with PostgreSQL 11 and partitioning
On 2018/06/05 13:44, Ashutosh Bapat wrote: > On Tue, Jun 5, 2018 at 5:50 AM, David Rowley wrote: >> On 5 June 2018 at 01:35, Ashutosh Bapat >> wrote: >>> I was wondering if we can get rid of append_rel_list altogether. In >>> your patch, you have saved AppendRelInfos by child_relid. So all the >>> slots indexed by parent relid are empty. We could place AppendRelInfos >>> by parent relid. Thus a given AppendRelInfo would be places at two >>> places, one at the index of child relid and second at the index >>> pointed by parent relid. That's fine even in case of multilevel >>> inheritance since an intermediate parent has two relids one as a >>> parent and other as a child. >>> >>> One problem with that we do not know how long the array would be to >>> start with. But we could keep on repallocing the array to increase its >>> size. >> >> To be honest, the patch was meant as a discussion topic for ways to >> improve the reported regression in PG11. The patch was put together >> fairly quickly. > > I think the idea is brilliant. I do not have objections for trying > something in that direction. I am suggesting that we take this a bit > forward and try to eliminate append_rel_list altogether. Imho, we should try to refrain from implementing something that needs repalloc'ing the array, as long as we're doing it as a fix for PG 11. Also, because getting rid of append_rel_list altogether seems a bit involved. Let the AppendRelInfo's be added to append_rel_list when created, as is done now (expand_inherited_tables), and transfer them to the array when setting up various arrays (setup_simple_rel_arrays) as the David's patch does. >> I've not really considered what happens in the case where an inherited >> table has multiple parents. The patch happens to pass regression >> tests, but I've not checked to see if the existing tests create any >> tables this way. > > AFAIK, that case doesn't arise while processing a query. Examining the > way we create AppendRelInfos during expand_inherited_tables(), it's > clear that we create only one AppendRelInfo for a given child and also > for a given child and parent pair. If there are two tables from which > a child inherits, the child's RTE/RelOptInfo is associated only with > the top-most parent that appears in the query. See following comment > from find_all_inheritors() > /* > * Add to the queue only those children not already seen. This avoids > * making duplicate entries in case of multiple inheritance paths from > * the same parent. (It'll also keep us from getting into an infinite > * loop, though theoretically there can't be any cycles in the > * inheritance graph anyway.) > */ That's right. create table p1 (a int); create table p2 (b int); create table c () inherits (p1, p2); Child table 'c' has two parents but if a query specifies only p1, then just one AppendRelInfo corresponding to p1-c pair will be created. If p2 was also specified in the query, there will be an AppendRelInfo for p2-c pair too, but its child_relid will refer to another RT entry that's created for 'c', that is, the one created when expanding p2. >> Perhaps it's okay to change things this way for just partitioned >> tables and leave inheritance hierarchies alone. > > There's no point having two separate code paths when internally we are > treating them as same. +1 > The only difference between partitioning and > inheritance is that for multi-level partitioned table, we preserve the > inheritance hierarchy whereas for multi-level inheritance, we flatten > it. Yeah, the proposed patch doesn't change anything about how many AppendRelInfo's are created, nor about what's contained in them, so it seems safe to say that it would work unchanged for both regular inheritance and partitioning. Thanks, Amit
Re: Spilling hashed SetOps and aggregates to disk
On 06/05/2018 07:46 AM, Jeff Davis wrote: On Tue, 2018-06-05 at 07:04 +0200, Tomas Vondra wrote: I expect the eviction strategy to be the primary design challenge of this patch. The other bits will be mostly determined by this one piece. Not sure I agree that this is the primary challenge. The cases that benefit from eviction are probably a minority. I see two categories that would benefit: * Natural clustering in the heap. This sounds fairly common, but a lot of the cases that come to mind are too low-cardinality to be compelling; e.g. timestamps grouped by hour/day/month. If someone has run into a high-cardinality natural grouping case, let me know. * ARRAY_AGG (or similar): individual state values can be large enough that we need to evict to avoid exceeding work_mem even if not adding any new groups. In either case, it seems like a fairly simple eviction strategy would work. For instance, we could just evict the entire hash table if work_mem is exceeded or if the hit rate on the hash table falls below a certain threshold. If there was really something important that should have stayed in the hash table, it will go back in soon anyway. So why should eviction be a major driver for the entire design? I agree it should be an area of improvement for the future, so let me know if you see a major problem, but I haven't been as focused on eviction. My concern is more about what happens when the input tuple ordering is inherently incompatible with the eviction strategy, greatly increasing the amount of data written to disk during evictions. Say for example that we can fit 1000 groups into work_mem, and that there are 2000 groups in the input data set. If the input is correlated with the groups, everything is peachy because we'll evict the first batch, and then group the remaining 1000 groups (or something like that). But if the input data is random (which can easily happen, e.g. for IP addresses, UUIDs and such) we'll hit the limit repeatedly and will evict much sooner. I know you suggested simply dumping the whole hash table and starting from scratch while we talked about this at pgcon, but ISTM it has exactly this issue. But I don't know if there actually is a better option - maybe we simply have to accept this problem. After all, running slowly is still better than OOM (which may or may not happen now). I wonder if we can somehow detect this at run-time and maybe fall-back to groupagg. E.g. we could compare number of groups vs. number of input tuples when we first hit the limit. It's a rough heuristics, but maybe sufficient. While the primary goal of the patch is addressing the OOM risks in hash aggregate, I think it'd be a mistake to see it just that way. I see it could allow us to perform hash aggregate more often, even if we know the groups won't fit into work_mem. If we could estimate how much of the aggregate state we'll have to spill to disk, we could still prefer hashagg over groupagg. We would pay the price for eviction, but on large data sets that can be massively cheaper than having to do sort. Agreed. I ran some tests of my patch in the last round, and they strongly supported choosing HashAgg a lot more often. A lot of sort improvements have been made though, so I really need to run some new numbers. Yeah, Peter made the sort stuff a lot faster. But I think there still is a lot of cases where the hashagg would greatly reduce the amount of data that needs to be written to disk, and that's not something the sort improvements could improve. If you need to aggregate a 1TB table, it's still going to be ~1TB of data you need to write to disk during on-disk sort. But if there is only a reasonable number of groups, that greatly simplifies things. far away), but it would be unfortunate to make this improvement impossible/more difficult in the future. If you see anything that would make this difficult in the future, let me know. Sure. Sorry for being too hand-wavy in this thread, and possibly moving the goal posts further away :-/ I got excited by you planning to work on this again and perhaps a bit carried away ;-) regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Remove mention in docs that foreign keys on partitioned tables are not supported
On 2018/06/05 1:25, Alvaro Herrera wrote: > In the meantime, my inclination is to fix the documentation to point out > that AFTER triggers are allowed but BEFORE triggers are not. Wasn't that already fixed by bcded2609ade6? We don't say anything about AFTER triggers per se, but the following under the limitations of partitioned tables: BEFORE ROW triggers, if necessary, must be defined on individual partitions, not the partitioned table. Thanks, Amit
Re: Remove mention in docs that foreign keys on partitioned tables are not supported
On Tue, Jun 5, 2018 at 1:07 PM, Amit Langote wrote: > On 2018/06/05 1:25, Alvaro Herrera wrote: >> In the meantime, my inclination is to fix the documentation to point out >> that AFTER triggers are allowed but BEFORE triggers are not. > > Wasn't that already fixed by bcded2609ade6? > > We don't say anything about AFTER triggers per se, but the following under > the limitations of partitioned tables: > > BEFORE ROW triggers, if necessary, must be defined on individual > partitions, not the partitioned table. Thought correct that suggestion is problematic for upgrades. Probably users will create triggers using same function on all the partitions. After the upgrade when we start supporting BEFORE TRIGGER on partitioned table, the user will have to drop these triggers and create one trigger on the partitioned table to have the trigger to be applicable to the new partitions created. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Re: Spilling hashed SetOps and aggregates to disk
On 5 June 2018 at 17:04, Tomas Vondra wrote: > On 06/05/2018 04:56 AM, David Rowley wrote: >> Isn't there still a problem determining when the memory exhaustion >> actually happens though? As far as I know, we've still little >> knowledge how much memory each aggregate state occupies. >> >> Jeff tried to solve this in [1], but from what I remember, there was >> too much concern about the overhead of the additional accounting code. >> >> [1] >> https://www.postgresql.org/message-id/flat/CAKJS1f8yvvvj-sVDv_bcxkzcZKq0ZOTVhX0dHfnYDct2Mycq5Q%40mail.gmail.com#cakjs1f8yvvvj-svdv_bcxkzczkq0zotvhx0dhfnydct2myc...@mail.gmail.com >> > > I had a chat with Jeff Davis at pgcon about this, and IIRC he suggested > a couple of people who were originally worried about the overhead seem > to be accepting it now. Is there any great need to make everything pay the small price for this? Couldn't we just create a new MemoryContextMethod that duplicates aset.c, but has the require additional accounting built in at the implementation level, then make execGrouping.c use that allocator for its hash table? The code would not really need to be duplicated, we could just do things the same way as like.c does with like_match.c and include a .c file. We'd need another interface function in MemoryContextMethods to support getting the total memory allocated in the context, but that does not seem like a problem. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Spilling hashed SetOps and aggregates to disk
On 06/05/2018 09:22 AM, David Rowley wrote: On 5 June 2018 at 17:04, Tomas Vondra wrote: On 06/05/2018 04:56 AM, David Rowley wrote: Isn't there still a problem determining when the memory exhaustion actually happens though? As far as I know, we've still little knowledge how much memory each aggregate state occupies. Jeff tried to solve this in [1], but from what I remember, there was too much concern about the overhead of the additional accounting code. [1] https://www.postgresql.org/message-id/flat/CAKJS1f8yvvvj-sVDv_bcxkzcZKq0ZOTVhX0dHfnYDct2Mycq5Q%40mail.gmail.com#cakjs1f8yvvvj-svdv_bcxkzczkq0zotvhx0dhfnydct2myc...@mail.gmail.com I had a chat with Jeff Davis at pgcon about this, and IIRC he suggested a couple of people who were originally worried about the overhead seem to be accepting it now. Is there any great need to make everything pay the small price for this? Couldn't we just create a new MemoryContextMethod that duplicates aset.c, but has the require additional accounting built in at the implementation level, then make execGrouping.c use that allocator for its hash table? The code would not really need to be duplicated, we could just do things the same way as like.c does with like_match.c and include a .c file. We'd need another interface function in MemoryContextMethods to support getting the total memory allocated in the context, but that does not seem like a problem. There probably is not a need, but there was not a great way to enable it explicitly only for some contexts. IIRC what was originally considered back in 2015 was some sort of a flag in the memory context, and the overhead was about the same as with the int64 arithmetic alone. But I don't think we've considered copying the whole AllocSet. Maybe that would work, not sure. I wonder if an aggregate might use a custom context internally (I don't recall anything like that). The accounting capability seems potentially useful for other places, and those might not use AllocSet (or at least not directly). All of this of course assumes the overhead is still there. I sure will do some new measurements. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: libpq compression
On Tue, Jun 05, 2018 at 06:04:21PM +1200, Thomas Munro wrote: > On Thu, May 17, 2018 at 3:54 AM, Konstantin Knizhnik > Speaking of configuration, are you planning to support multiple > compression libraries at the same time? It looks like the current > patch implicitly requires client and server to use the same configure > option, without any attempt to detect or negotiate. Do I guess > correctly that a library mismatch would produce an incomprehensible > corrupt stream message? I just had a quick look at this patch, lured by the smell of your latest messages... And it seems to me that this patch needs a heavy amount of work as presented. There are a couple of things which are not really nice, like forcing the presentation of the compression option in the startup packet to begin with. The high-jacking around secure_read() is not nice either as it is aimed at being a rather high-level API on top of the method used with the backend. On top of adding some documentation, I think that you could get some inspiration from the recent GSSAPI encription patch which has been submitted again for v12 cycle, which has spent a large amount of time designing its set of options. -- Michael signature.asc Description: PGP signature