Re: Should we add GUCs to allow partition pruning to be disabled?

2018-04-20 Thread David Rowley
Thank you both of you for looking at this.

On 21 April 2018 at 06:28, Alvaro Herrera  wrote:
> +   {"enable_partition_pruning", PGC_USERSET, QUERY_TUNING_METHOD,
> +   gettext_noop("Enables the planner's ability to remove 
> non-required partitions from the query plan."),
> +   NULL
> +   },
> +   _partition_pruning,
> +   true,
> +   NULL, NULL, NULL
> +   },
>
> I would make the short description shorter, and use the long description
> to elaborate.  So gettext_noop("Enable plan-time and run-time partition
> pruning.")
> followed by something like
>
> gettext_noop("Allows the query planner and executor to compare partition
> bounds to conditions in the query, and determine which partitions {can be
> skipped | must be scanned} ...")

I've taken a slight variation of this, but instead of ", and" I used
"to" and went with the "must be scanned" option.

select * from pg_settings where name like 'enable%'; does show that
this is the only enable_* GUC to have a long description, but perhaps
that does not matter.

On 20 April 2018 at 20:51, Amit Langote  wrote:
> set constraint_exclusion to off;
>
> -- not ok!

It needed a bit more effort than I put in the first time around to
make this work properly. constraint_exclusion = 'off' becomes a bit of
a special case for partitioned tables now.  To make this work I had to
get rid of hasInheritedTarget and make a new enum that tracks if we're
inheritance planning for an inheritance parent or a partitioned table.
We can't simply only set hasInheritedTarget to true when planning with
inheritance parents as constraint_exclusion = 'partition' must still
know that we're planning using the inheritance planner.

v2 patch attached.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


v2-0001-Add-GUC-to-allow-partition-pruning-to-be-disabled.patch
Description: Binary data


proposal: force slow part of plpgsql compilation

2018-04-20 Thread Pavel Stehule
Hi

I am plpgsql_check author and now, I am solving bug related to cached
plpgsql function. I am not able fore plpgsql compilation - and sometimes
the compiled version can be obsolete. As protection to false alarms, I need
fresh compilation immediately before plpgsql_check call.

I need a new option "NoCache" to plpgsql_compile function. If this option
will be true, then function_valid will be false.

Comments, notes?

Regards

Pavel


Re: Corrupted btree index on HEAD because of covering indexes

2018-04-20 Thread Peter Geoghegan
On Fri, Apr 20, 2018 at 7:18 AM, Teodor Sigaev  wrote:
> After close look I change my opinion. To have a clean code it's much better
> to have new pair get/set macroses specialy to manage link to top pare during
> page deletion. This removes last naked usage of
> ItemPointer(SetInvalid/IsInvalid/GetBlockNumberNoCheck) and uses
> self-described macroses. Patch is attached.

I see your point. Maybe don't have the newline between the get and
set, though, to match the existing style. And, the note about the
assertion seems unnecessary.

I suggest putting something about what general area this deals with.
Perhaps something like the following:

"Get/set leaf page highkey's link. During the second phase of
deletion, the target leaf page's high key may point to an ancestor
page (at all other times, the leaf level high key's link is not used).
See the nbtree README for full details."

-- 
Peter Geoghegan



Re: Excessive PostmasterIsAlive calls slow down WAL redo

2018-04-20 Thread Thomas Munro
Here's a new version, because FreeBSD's new interface changed slightly.

-- 
Thomas Munro
http://www.enterprisedb.com


0001-Use-signals-for-postmaster-death-on-Linux-v3.patch
Description: Binary data


0002-Use-signals-for-postmaster-death-on-FreeBSD-v3.patch
Description: Binary data


Re: Searching for: Fast windows buildfarm animal

2018-04-20 Thread Tom Lane
Stephen Frost  writes:
> * Andres Freund (and...@anarazel.de) wrote:
>> It'd be awesome if somebody could set up a windows animal that runs
>> frequently (i.e. checks for build needed every minute or five) and is
>> fast enough to not take ages to finish a build.

> Done.

Thanks!

BTW, dory just fell over in HEAD:

*** 
c:/pgbuildfarm/pgbuildroot/HEAD/pgsql.build/src/test/regress/expected/partition_prune.out
   Fri Apr 20 16:03:09 2018
--- 
c:/pgbuildfarm/pgbuildroot/HEAD/pgsql.build/src/test/regress/results/partition_prune.out
Fri Apr 20 20:12:43 2018
***
*** 1778,1788 
  -- Execute query 5 times to allow choose_custom_plan
  -- to start considering a generic plan.
  execute ab_q4 (1, 8);
!  avg 
! -
! 
! (1 row)
! 
  execute ab_q4 (1, 8);
   avg 
  -
--- 1778,1784 
  -- Execute query 5 times to allow choose_custom_plan
  -- to start considering a generic plan.
  execute ab_q4 (1, 8);
! ERROR:  lost connection to parallel worker
  execute ab_q4 (1, 8);
   avg 
  -

Wonder how reproducible that will be.

regards, tom lane



Re: Searching for: Fast windows buildfarm animal

2018-04-20 Thread Stephen Frost
Greetings,

* Andres Freund (and...@anarazel.de) wrote:
> It's common that half the buildfarm has reported back before a single
> windows buildfarm animal reports. And if they report a failure one often
> has to wait for hours for the next run.

Yes, that's been rather annoying.

> It'd be awesome if somebody could set up a windows animal that runs
> frequently (i.e. checks for build needed every minute or five) and is
> fast enough to not take ages to finish a build.

Done.

'dory' was approved today which checks for new commits every 5 minutes
and, as an added bonus, builds with a somewhat more recent version of
Visual Studio (2015) and a recent Windows Server version (2016) as
compared to the other Windows systems.  While it's still not quite as
fast as our faster buildfarm animals, it's at least able to report in
within 20-30 minutes or so, as compared to hours.

We'll probably continue to tweak the system a bit, but it's at least up
and running now for HEAD, REL_10 and REL9_6 (older branches don't work
easily with VS2015, as discussed elsewhere).

If anyone has any specific questions about it or suggestions, please
feel free to reach out.

Thanks!

Stephen


signature.asc
Description: PGP signature


Re: [FEATURE PATCH] pg_stat_statements with plans (v02)

2018-04-20 Thread legrand legrand
[...]
> I've taken a look at your patch. I agree that having a plan identifier
> would be great, but I'm not a big fan of the jumbling. That's a lot of
> hashing that needs to be done to decide wether two plans are
> essentially equivalent or not.

As there is no planid available yet in core, I reused jumbling from 
pg_stat_plans, and I agree : it seems quite complicated (its not even 
sure that it is compatible with all the new partitioning stuff) and 
I won't be able to maintain it.

> Maybe, in the future, in a different patch, we could add functionality
> that would allow us to compare two plan trees. There is even a remark
> in the header of src/backend/nodes/equalfuncs.c regarding this:
>  * Currently, in fact, equal() doesn't know how to compare Plan trees
>  * either.  This might need to be fixed someday. 

It would be so helpfull if a planid was included in core Query / plan tree, 
it could be reused here in equal(), in explain commands, 
in pg_stat_activity, in pg_stat_statements ...


>> May be your view pg_stat_statements_plans could be transformed :
>> same key as pgss: dbid,userid,queryid,planid 
>> and THE plan column to store explain result (no more need for
>> plan_time nor
>> tz that 
>> are already available in modified pgss).

> The documentation [of pg_stat_statements] gives no clear indication
> which fields qualify as primary keys, mainly because the hashing that
> generates the queryid isn't guaranteed to produce unique results...
> So I'm not sure which columns should be used to create associations
> between pg_stat_statements and pg_stat_statements_plans.


Yes I understand, to choose between
  dbid,userid,queryid,planid 
and
  dbid,userid,planid 

(planid alone seems not acceptable regarding privacy policies).

Regards
PAscal



--
Sent from: http://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html



Re: [sqlsmith] Unpinning error in parallel worker

2018-04-20 Thread Jonathan Rudenberg


On Fri, Apr 20, 2018, at 00:42, Thomas Munro wrote:
> On Wed, Apr 18, 2018 at 11:43 AM, Jonathan Rudenberg
>  wrote:
> > On Tue, Apr 17, 2018, at 19:31, Thomas Munro wrote:
> >> On Wed, Apr 18, 2018 at 11:01 AM, Jonathan Rudenberg
> >>  wrote:
> >> > Yep, I think I know approximately what it looked like, I've attached a 
> >> > lightly redacted plan. All of the hung queries were running some variant 
> >> > of this plan as far as I can tell.
> >>
> >> Hmm, that isn't a parallel query.  I was expecting to see "Gather" and
> >> "Parallel" in there.
> >
> > Oops, I'm really sorry about that. I only have the first part of the hung 
> > queries, and there are a few variants. Here's one that's parallel.
> 
> I spent some time trying to reproduce this failure without any luck,
> using query plans similar to your Gather plan fragment, and using some
> test harness code for the allocator stuff in isolation.
> 
> I had an idea that (1) freeing a large object that releases and unpins
> a segment in one backend and then (2) freeing it again in another
> backend (illegally) might produce this effect with sufficiently bad
> luck.  I'm still trying to reproduce that without any success, but I
> get other kinds of failures which I think you'd be seeing too if that
> hunch were right.  Still looking...

Thanks for investigating! We have monitoring in place and a plan to grab stack 
traces if this happens again. Is there anything else that would be useful for 
us to try to get in addition to a stack trace from the process processing the 
stuck query?



Re: Event trigger bugs (was Re: Repeated crashes in GENERATED ... AS IDENTITY tests)

2018-04-20 Thread Tom Lane
I wrote:
> Hence, two questions:
> * Should EventTriggerTableRewrite do
> if (!currentEventTriggerState ||
> currentEventTriggerState->commandCollectionInhibited)
> return;
> like most of the other functions, or should it just check for null
> currentEventTriggerState?

After closer inspection I've concluded that it should not look
at commandCollectionInhibited; that disables collection of some
data, but it's not related to whether we ought to fire triggers,
AFAICS.

> * Of the three other callers of EventTriggerCommonSetup, only one
> has such a guard now.  But aren't EventTriggerDDLCommandStart and
> EventTriggerDDLCommandEnd equally vulnerable to this type of race
> condition?  What should they check exactly?  Perhaps
> EventTriggerCommonSetup itself should check this?

And the answer here is that DDLCommandStart must fire regardless of the
existence of currentEventTriggerState, because we don't set that up when
there are only ddl_command_start triggers.  But it seems like
EventTriggerDDLCommandEnd should quit if it's not set up.  That function
itself wouldn't crash, but presumably the called triggers would call
pg_event_trigger_ddl_commands which would fail.  Better to pretend the
triggers aren't active yet.

(I see Andrew Gierth came to the same conclusion.)

regards, tom lane



Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-20 Thread Bruce Momjian
On Wed, Apr 18, 2018 at 08:45:53PM +0800, Craig Ringer wrote:
> wrOn 18 April 2018 at 19:46, Bruce Momjian  wrote:
> 
> > So, if sync mode passes the write to NFS, and NFS pre-reserves write
> > space, and throws an error on reservation failure, that means that NFS
> > will not corrupt a cluster on out-of-space errors.
> 
> Yeah. I need to verify in a concrete test case.

Thanks.

> The thing is that write() is allowed to be asynchronous anyway. Most
> file systems choose to implement eager reservation of space, but it's
> not mandated. AFAICS that's largely a historical accident to keep
> applications happy, because FSes used to *allocate* the space at
> write() time too, and when they moved to delayed allocations, apps
> tended to break too easily unless they at least reserved space. NFS
> would have to do a round-trip on write() to reserve space.
> 
> The Linux man pages (http://man7.org/linux/man-pages/man2/write.2.html) say:
> 
> "
>A successful return from write() does not make any guarantee that
>data has been committed to disk.  On some filesystems, including NFS,
>it does not even guarantee that space has successfully been reserved
>for the data.  In this case, some errors might be delayed until a
>future write(2), fsync(2), or even close(2).  The only way to be sure
>is to call fsync(2) after you are done writing all your data.
> "
> 
> ... and I'm inclined to believe it when it refuses to make guarantees.
> Especially lately.

Uh, even calling fsync after write isn't 100% safe since the kernel
could have flushed the dirty pages to storage, and failed, and the fsync
would later succeed.  I realize newer kernels have that fixed for files
open during that operation, but that is the minority of installs.

> The idea is that when the SAN's actual physically allocate storage
> gets to 40TB it starts telling you to go buy another rack of storage
> so you don't run out. You don't have to resize volumes, resize file
> systems, etc. All the storage space admin is centralized on the SAN
> and storage team, and your sysadmins, DBAs and app devs are none the
> wiser. You buy storage when you need it, not when the DBA demands they
> need a 200% free space margin just in case. Whether or not you agree
> with this philosophy or think it's sensible is kind of moot, because
> it's an extremely widespread model, and servers you work on may well
> be backed by thin provisioned storage _even if you don't know it_.


> Most FSes only touch the blocks on dirty writeback, or sometimes
> lazily as part of delayed allocation. So if your SAN is running out of
> space and there's 100MB free, each of your 100 FSes may have
> decremented its freelist by 2MB and be happily promising more space to
> apps on write() because, well, as far as they know they're only 50%
> full. When they all do dirty writeback and flush to storage, kaboom,
> there's nowhere to put some of the data.

I see what you are saying --- that the kernel is reserving the write
space from its free space, but the free space doesn't all exist.  I am
not sure how we can tell people to make sure the file system free space
is real.

> You'd have to actually force writes to each page through to the
> backing storage to know for sure the space existed. Yes, the docs say
> 
> "
>After a
>successful call to posix_fallocate(), subsequent writes to bytes in
>the specified range are guaranteed not to fail because of lack of
>disk space.
> "
> 
> ... but they're speaking from the filesystem's perspective. If the FS
> doesn't dirty and flush the actual blocks, a thin provisioned storage
> system won't know.

Frankly, in what cases will a write fail _for_ lack of free space?  It
could be a new WAL file (not recycled), or a pages added to the end of
the table.

Is that it?  It doesn't sound too terrible.  If we can eliminate the
corruption due to free space exxhaustion, it would be a big step
forward.

The next most common failure would be temporary storage failure or
storage communication failure.

Permanent storage failure is "game over" so we don't need to worry about
that.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +



Re: Add read-only param to set_config(...) / SET that effects (at least) customized runtime options

2018-04-20 Thread Chapman Flack
On 04/20/2018 04:03 PM, Andrew Gierth wrote:
>> "Chapman" == Chapman Flack  writes:
>  Chapman> It seemed to me at the time to be not hard to implement, even
>  Chapman> just as an extension for proof of concept. One small obstacle
>  Chapman> is that the check hook on a variable doesn't get called during
>  Chapman> attempted RESET, so the only way to block a prohibited RESET
>  Chapman> would have to be from an assign hook, which isn't expected to
>  Chapman> fail. It could surely throw an error, but which would mess up
>  Chapman> the whole RESET, but might be tolerable.
> 
> Assign hooks aren't allowed to throw errors under any circumstance
> because they're called during transaction abort, and throwing an error
> during abort is obviously not tolerable.

Ah yes, now you remind me, that was why I had put my extension idea
on the table at the time.

It seems as if, for any of these ideas (cookie-protection of GUCs
or just SET READONLY for GUCs), at least some small patch to core
will be prerequisite, to allow either a flag or a reset check hook
able to make RESET skip protected GUCs.

-Chap




Re: Corrupted btree index on HEAD because of covering indexes

2018-04-20 Thread Peter Geoghegan
On Thu, Apr 19, 2018 at 11:41 PM, Teodor Sigaev  wrote:
>> heard of people using bt_index_parent_check() in production, but only
>> when they already knew that their database was corrupt, and wanted to
>> isolate the problem. I imagine that people that use
>> bt_index_parent_check() in production do so because they want as much
>> information as possible, and are willing to do whatever it takes to
>> get more information.
>
> That why I think we need improve amcheck - ideally, it should not miss any
> mistakes in tree structure.

I have finished writing a prototype that works, and passes all tests.
Without the fix, this is what the complaint from VACUUM looks like on
my machine:

ERROR:  right sibling 265 of block 134 is not next child 395 of block
135 in index "delete_test_table_pkey"

After the final VACUUM in your test case runs (the one that throws
this error), my working copy of amcheck can detect the same issue with
low overhead:

pg@~[20995]=# select bt_index_parent_check('delete_test_table_pkey', true);
ERROR:  index block lacks downlink in index "delete_test_table_pkey"
DETAIL:  Block=265 level=1 page lsn=0/909F4B18.

There is no complaint from amcheck if called before the final VACUUM
in your test case, since that's when the trouble starts (and ends).
It's easy to account for concurrent page splits within amcheck by
checking for BTP_SPLIT_END on the child page that could lack a
downlink. Don't confuse this with checking the BTP_INCOMPLETE_SPLIT
flag, which is what we set on the left half/sibling of a split -- not
the "new" right half/sibling. Of course, the right sibling is the type
of block that can lack a downlink until after the second phase of a
pagesplit (i.e. until it's fully finished), or after the first phase
of page deletion. (As you know, page deletion is like a page split "in
reverse".)

I said concurrent page split, but I really meant incomplete page
split, since a concurrent insert/split is impossible given that
amcheck holds a ShareLock here. (See commit 40dae7ec5.)

Of course, we don't see details about the next level up in the amcheck
error, unlike the VACUUM error. That shouldn't matter, because we're
already verifying a lot about the relationship between blocks at the
next level up, and even between the two levels. This enhancement adds
the one piece we were missing. It could in theory detect other
problems that VACUUM cannot detect, too.

Expect me to post a patch during the work day on Monday or Tuesday
(Pacific time). I need to polish this patch some more. I particularly
need to think about the use of memory for the Bloom filter (right now
I just use work_mem).

> Could you write it? I'm afraid I can't give good explanation why we believe
> that nobody update this page ant it's  high key while page is unlocked but
> pinned.

Okay. I'll come up with something soon.

> Hm, it seems to me, that 350ms is short enough to place it in both core and
> amcheck test. I think, basic functionality should be covered by core tests
> as we test insert/create.

I think you're right. There is really no excuse for not having
thorough code coverage for a module like nbtree.

-- 
Peter Geoghegan



Re: Toast issues with OldestXmin going backwards

2018-04-20 Thread Andrew Gierth
> "Amit" == Amit Kapila  writes:

 Amit> I haven't tried to reproduce it, but I can see the possibility of
 Amit> the problem described by you. What should we do next? I could see
 Amit> few possibilities: (a) Vacuum for main and toast table should
 Amit> always use same OldestXmin,

I don't see how this could be arranged without either disabling the
ability to vacuum the toast table independently, or storing the xmin
somewhere.

 Amit> (b) Before removing the row from toast table, we should ensure
 Amit> that row in the main table is removed,

We can't find the main table row version(s) without scanning the whole
main table, so this amounts (again) to disabling vacuum on the toast
table only.

 Amit> (c) Change the logic during rewrite such that it can detect this
 Amit> situation and skip copying the tuple in the main table,

This seems more promising, but the problem is how to detect the error
safely (since currently, it's just ereport()ed from deep inside the
toast code). How about:

1) add a toast_datum_exists() call or similar that returns false if the
(first block of) the specified external toast item is missing

2) when copying a RECENTLY_DEAD tuple, check all its external column
values using this function beforehand, and if any are missing, treat the
tuple as DEAD instead.

 Amit> on a quick look, this seems tricky, because the toast table TID
 Amit> might have been reused by that time,

Toast pointers don't point to TIDs fortunately, they point to OIDs. The
OID could have been reused (if there's been an OID wraparound since the
toast item was created), but that should be harmless: it means that we'd
incorrectly copy the new value with the old tuple, but the old tuple is
never going to be visible to anybody ever again so nothing will see that.

 Amit> or (d) Ensure that GetOldestXmin doesn't move backwards or write
 Amit> a new API similar to it which doesn't allow OldestXmin to move
 Amit> backwards and use that for the purpose of vacuum.

This would presumably require storing a reference OldestXmin somewhere
(in shared memory? but there'd have to be a global one and one per db,
and I don't think we have any shared memory structures at present that
are per-db). We'd even have to preserve an oldestXmin for databases with
no currently active connections, though we wouldn't have to preserve it
across restarts.

 Amit> Any better ideas?

It turns out this issue has come up before (5 years ago!):

https://www.postgresql.org/message-id/20362.1359747327%40sss.pgh.pa.us

-- 
Andrew (irc:RhodiumToad)



Re: Add read-only param to set_config(...) / SET that effects (at least) customized runtime options

2018-04-20 Thread Andrew Gierth
> "Chapman" == Chapman Flack  writes:

 Chapman> It seemed to me at the time to be not hard to implement, even
 Chapman> just as an extension for proof of concept. One small obstacle
 Chapman> is that the check hook on a variable doesn't get called during
 Chapman> attempted RESET, so the only way to block a prohibited RESET
 Chapman> would have to be from an assign hook, which isn't expected to
 Chapman> fail. It could surely throw an error, but which would mess up
 Chapman> the whole RESET, but might be tolerable.

Assign hooks aren't allowed to throw errors under any circumstance
because they're called during transaction abort, and throwing an error
during abort is obviously not tolerable.

-- 
Andrew (irc:RhodiumToad)



Re: Reopen logfile on SIGHUP

2018-04-20 Thread Alexander Kuzmenkov

On 04/16/2018 05:54 AM, Kyotaro HORIGUCHI wrote:

We can provide a new command "pg_ctl logrotate" to hide the
details. (It cannot be executed by root, though.)


I like this approach.

I looked at the patch and changed some things:
- cleaned up the error messages
- moved checkLogrotateSignal to postmaster.c, since it has no reason to 
be in xlog.c

- added some documentation I had from my older patch

Other than that, it looks good to me. The updated patch is attached.

--
Alexander Kuzmenkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml
index 4a68ec3..c0bf632 100644
--- a/doc/src/sgml/maintenance.sgml
+++ b/doc/src/sgml/maintenance.sgml
@@ -960,6 +960,22 @@ pg_ctl start | rotatelogs /var/log/pgsql_log 86400
   
 
   
+   An external log rotation tool, such as logrotate, can also 
+   be used to collect log files produced by the built-in logging collector. In this 
+   configuration, the location and names of the log files are determined by logging 
+   collector, and logrotate periodically archives these files.
+   There must be some way for logrotate to inform the application
+   that the log file it is using is going to be archived, and that it must direct further 
+   output to a new file. This is commonly done with a postrotate script
+   that sends a SIGHUP signal to the application, which then reopens the 
+   log file. In PostgreSQL, this is achieved using a 
+   logrotate command of pg_ctl. When the
+   server receives this command, it switches to the new log files according to its 
+   configuration (see ). If it is configured
+   to use a static file name, it will just reopen the file.
+  
+
+  
On many systems, however, syslog is not very reliable,
particularly with large log messages; it might truncate or drop messages
just when you need them the most.  Also, on Linux,
diff --git a/doc/src/sgml/ref/pg_ctl-ref.sgml b/doc/src/sgml/ref/pg_ctl-ref.sgml
index 0816bc1..acd1d91 100644
--- a/doc/src/sgml/ref/pg_ctl-ref.sgml
+++ b/doc/src/sgml/ref/pg_ctl-ref.sgml
@@ -102,6 +102,12 @@ PostgreSQL documentation
kill
signal_name
process_id
+
+
+  
+   pg_ctl
+   logrotate
+   -D datadir
   
 
   On Microsoft Windows, also:
@@ -234,6 +240,12 @@ PostgreSQL documentation
   
 
   
+   logrotate mode commands the server to rotate its log file.
+   For the discussion of how to use this with external log rotation tools, see
+   .
+  
+
+  
register mode registers the PostgreSQL
server as a system service on Microsoft Windows.
The -S option allows selection of service start type,
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index d948369..4549f2e 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -385,6 +385,9 @@ static bool LoadedSSL = false;
 static DNSServiceRef bonjour_sdref = NULL;
 #endif
 
+/* Log rotation signal file path, relative to $PGDATA */
+#define LOGROTATE_SIGNAL_FILE	"logrotate"
+
 /*
  * postmaster.c - function prototypes
  */
@@ -398,6 +401,7 @@ static void reset_shared(int port);
 static void SIGHUP_handler(SIGNAL_ARGS);
 static void pmdie(SIGNAL_ARGS);
 static void reaper(SIGNAL_ARGS);
+static bool checkLogrotateSignal(void);
 static void sigusr1_handler(SIGNAL_ARGS);
 static void startup_die(SIGNAL_ARGS);
 static void dummy_handler(SIGNAL_ARGS);
@@ -4995,6 +4999,24 @@ ExitPostmaster(int status)
 }
 
 /*
+ * Check to see if a log rotation request has arrived. Should be
+ * called by postmaster after receiving SIGUSR1.
+ */
+static bool
+checkLogrotateSignal(void)
+{
+	struct stat stat_buf;
+
+	if (stat(LOGROTATE_SIGNAL_FILE, _buf) == 0)
+	{
+		unlink(LOGROTATE_SIGNAL_FILE);
+		return true;
+	}
+
+	return false;
+}
+
+/*
  * sigusr1_handler - handle signal conditions from child processes
  */
 static void
@@ -5092,7 +5114,8 @@ sigusr1_handler(SIGNAL_ARGS)
 		signal_child(PgArchPID, SIGUSR1);
 	}
 
-	if (CheckPostmasterSignal(PMSIGNAL_ROTATE_LOGFILE) &&
+	if ((checkLogrotateSignal() ||
+		 CheckPostmasterSignal(PMSIGNAL_ROTATE_LOGFILE)) &&
 		SysLoggerPID != 0)
 	{
 		/* Tell syslogger to rotate logfile */
diff --git a/src/bin/pg_ctl/pg_ctl.c b/src/bin/pg_ctl/pg_ctl.c
index 143021d..5abafb2 100644
--- a/src/bin/pg_ctl/pg_ctl.c
+++ b/src/bin/pg_ctl/pg_ctl.c
@@ -59,6 +59,7 @@ typedef enum
 	STOP_COMMAND,
 	RESTART_COMMAND,
 	RELOAD_COMMAND,
+	LOGROTATE_COMMAND,
 	STATUS_COMMAND,
 	PROMOTE_COMMAND,
 	KILL_COMMAND,
@@ -100,6 +101,7 @@ static char version_file[MAXPGPATH];
 static char pid_file[MAXPGPATH];
 static char backup_file[MAXPGPATH];
 static char promote_file[MAXPGPATH];
+static char logrotate_file[MAXPGPATH];
 
 #ifdef WIN32
 static DWORD pgctl_start_type = SERVICE_AUTO_START;
@@ -125,6 +127,7 @@ static void do_restart(void);
 static void do_reload(void);
 static void do_status(void);
 static void do_promote(void);
+static void do_logrotate(void);
 static 

Re: Should we add GUCs to allow partition pruning to be disabled?

2018-04-20 Thread Alvaro Herrera
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index fa92ce2e68..c51a9270e4 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -951,6 +951,15 @@ static struct config_bool ConfigureNamesBool[] =
true,
NULL, NULL, NULL
},
+   {
+   {"enable_partition_pruning", PGC_USERSET, QUERY_TUNING_METHOD,
+   gettext_noop("Enables the planner's ability to remove 
non-required partitions from the query plan."),
+   NULL
+   },
+   _partition_pruning,
+   true,
+   NULL, NULL, NULL
+   },

I would make the short description shorter, and use the long description
to elaborate.  So gettext_noop("Enable plan-time and run-time partition
pruning.")
followed by something like

gettext_noop("Allows the query planner and executor to compare partition
bounds to conditions in the query, and determine which partitions {can be
skipped | must be scanned} ...")

(Not wedded to those particular phrasings.)

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Should we add GUCs to allow partition pruning to be disabled?

2018-04-20 Thread Alvaro Herrera
Amit Langote wrote:

> Sorry, I should have said what I said after quoting only the last sentence
> of what you had said.  That is, I want to the new GUC to be the only
> determiner of whether the pruning occurs or not for partitioned tables.
> To implement that behavior, it will have to override the setting of
> constraint_exclusion (the parameter) in *some* cases, because some
> commands still rely on constraint exclusion (the algorithm) as the
> underlying pruning mechanism.

I agree -- it will make more sense now, and will continue to make sense
later when we remove usage of constraint exclusion for upd/del, to make
it work as you suggest:

* if the table is partitioned, do constraint exclusion based on
  enable_partition_prune=on rather than constraint_exclusion=partition.
  This will only affect upd/del, because the select queries would be
  affected by the enable_partition_prune anyway since
  constraint_exclusion does not apply.

* If the table is using regular inheritance, continue to use the
  original behavior.

> Now, the "override the setting of constraint_exclusion" implementation
> may not be the most popular choice in the end.

I guess there are different ways to implement it.  Supposedly this is
going to disappear in pg12, so I don't think it's a big deal.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Postgresql9.6 type cache invalidation issue - different behave of psql and pg regress

2018-04-20 Thread Pavel Stehule
2018-04-20 15:44 GMT+02:00 Tom Lane :

> Pavel Stehule  writes:
> > In interactive mode, the build_row_from_class has unrefreshed metadata.
> But
> > why this behave I see only in psql and not in my regress tests?
>
> The short answer is that no plpgsql version before commit 4b93f5799
> will have nice behavior for cases where you change a referenced composite
> type between calls.  Why that's translating to the particular behavior
> you're seeing isn't clear, considering you showed only one case in
> detail; but I imagine it's because a parse of the plpgsql function
> happens before the ALTER TABLE in one case and not the other.
> Perhaps you have different settings of check_function_bodies,
> for instance.
>

good catch - I had check_function_bodies disabled.

Thank you for reply. Now I can believe to my regress tests again :)

Regards

Nice weekend

Pavel


> regards, tom lane
>


Re: [HACKERS] proposal: schema variables

2018-04-20 Thread Pavel Stehule
2018-04-20 17:32 GMT+02:00 Robert Haas :

> On Tue, Apr 17, 2018 at 12:28 PM, Pavel Stehule 
> wrote:
> > It true, so there are lot of "unused" attributes for this purpose, but
> there
> > is lot of shared attributes, and lot of shared code. Semantically, I see
> > variables in family of sequences, tables, indexes, views. Now, it shares
> > code, and I hope in next steps more code can be shared - constraints,
> > triggers.
>
> I dunno, it seems awfully different to me.  There's only one "column",
> right?  What code is really shared here?  Are constraints and triggers
> even desirable feature for variables?  What would be the use case?
>

The schema variable can hold composite value. The patch allows to use any
composite type or adhoc composite values

DECLARE x AS compositetype;
DECLARE x AS (a int, b int, c int);

Constraints are clear, no.

Triggers are strange maybe, but why not - it can be used like enhanced
constraints, can be used for some value calculations, ..


> I think stuffing this into pg_class is pretty strange.
>

It will be if variable is just scalar value without any possibilities. But
then there is only low benefit

The access rights implementation is shared with other from pg_class too.

Regards

Pavel


>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>


Re: Foreign keys and partitioned tables

2018-04-20 Thread Alvaro Herrera
Alvaro Herrera wrote:
> After wasting some time trying to resolve
> "minor last minute issues", I decided to reduce the scope for now: in
> the current patch, it's allowed to have foreign keys in partitioned
> tables, but it is not possible to have foreign keys that point to
> partitioned tables.  I have split out some preliminary changes that
> intended to support FKs referencing partitioned tables; I intend to
> propose that for early v12, to avoid spending any more time this
> commitfest on that.

Hello,

I won't be able to work on foreign keys pointing to partitioned tables
for the next commitfest, so if somebody is interested in seeing it
supported, I applaud they working on it and I offer a bit of time to
discuss it, if they're so inclined:

CREATE TABLE pktab (a int PRIMARY KEY) PARTITION BY RANGE (a);
... create some partitions ...

CREATE TABLE fktab (a int REFERENCES pktab);


Thanks,

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Postgres stucks in deadlock detection

2018-04-20 Thread Alvaro Herrera
Konstantin Knizhnik wrote:

> As I have mentioned at the beginning of this thread the same problem with
> deadlock detection timeout expiration we have with YSCB benchmark with zipf
> distribution.
> Here the source of contention are tuple locks. And as far as I understand
> from the discussion in the mentioned thread, it is not possible to eliminate
> heavy weight tuple locks.

Well, if the source of tuple locks are foreign keys, maybe we can fix
that problem by removing the need for tuple locks in the first place --
for example, mark a table as prohibiting DELETEs, TRUNCATE, and any
UPDATE that modifies the columns of the primary key.  With that, you
don't need to lock rows in the FK triggers.  (This is like 'ALTER TABLE
tb SET READ ONLY', except the restrictions are less severe).

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Custom compression methods

2018-04-20 Thread Konstantin Knizhnik



On 30.03.2018 19:50, Ildus Kurbangaliev wrote:

On Mon, 26 Mar 2018 20:38:25 +0300
Ildus Kurbangaliev  wrote:


Attached rebased version of the patch. Fixed conflicts in pg_class.h.


New rebased version due to conflicts in master. Also fixed few errors
and removed cmdrop method since it couldnt be tested.

 I seems to be useful (and not so difficult) to use custom compression 
methods also for WAL compression: replace direct calls of pglz_compress 
in xloginsert.c


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: Add read-only param to set_config(...) / SET that effects (at least) customized runtime options

2018-04-20 Thread Chapman Flack
On 04/20/2018 09:55 AM, Matthias Kurz wrote:

> SET READONLY my_app.some_var = 'foo';
> SET READONLY SESSION my_app.some_var = 'foo';
> SET READONLY LOCAL my_app.some_var = 'foo';
> 
> Of course read-only would default to false for backwards compatibility.
> When setting READONLY for on SESSION config then that config is not allowed
> to be changed anymore until the connection gets closed - even when running
> RESET ALL or DISCARD ALL.

There may be considerable overlap here with the idea of
cookie-protected settings discussed back in

https://www.postgresql.org/message-id/59127E4E.8090705%40anastigmatix.net

At one point in that thread was a suggestion to look at proposals
for "secure variables" or "package variables" that were pending,
but I didn't see anything then to convince me those proposals
addressed the same use case.

There's also a thing called "schema variables" currently registered
in the 2018-09 commitfest, but I haven't really seen how it would
serve the same purpose either.

https://commitfest.postgresql.org/18/1608/

But the cookie-protected idea in the first link seems easy to
implement and was aimed at the same purpose you seem to have here:
nail down some setting in advance of running some code you don't
want changing it.

It seemed to me at the time to be not hard to implement, even just
as an extension for proof of concept. One small obstacle is that the
check hook on a variable doesn't get called during attempted RESET,
so the only way to block a prohibited RESET would have to be from
an assign hook, which isn't expected to fail. It could surely throw
an error, but which would mess up the whole RESET, but might be
tolerable.

I had been thinking to implement such an extension. I haven't worked
on it, but might yet. It would be simplest for it to exercise such
control over variables the extension itself defines. Not so easy
to protect other variables (poking a new check hook or assign hook
into an existing variable definition isn't practical, as those are
on a list that's static within guc.c).

But if you are only concerned about custom options, the needed
pieces might already exist.

-Chap



Re: Postgres stucks in deadlock detection

2018-04-20 Thread Konstantin Knizhnik



On 20.04.2018 18:36, Robert Haas wrote:

On Wed, Apr 18, 2018 at 10:08 AM, Konstantin Knizhnik
 wrote:

And it is very hard not to notice 17-times difference.
Certainly it is true in the assumption that most deadlock timeout expiration
are caused by high workload and contention, and not by real deadlocks.
But it seems to be quite common case.

If I understand this workload correctly, the contention is for the
relation extension lock.  But I think we're likely to move that out of
the heavyweight lock manager altogether in the not-too-distant future,
as proposed in https://commitfest.postgresql.org/17/1133/ ?  I'd be
interested in hearing what happens to performance with that patch
applied.



With the extension lock patch performance in increased to 1146 TPS.
So it is much better than with vanilla postgres and about 40% better 
than with deadlock patch (1146 vs. 719 TPS).

Profile is the following:

 33.51%  postgres    [.] s_lock
   4.59%  postgres    [.] LWLockWaitListLock
   3.67%  postgres    [.] perform_spin_delay
   3.04%  [kernel]    [k] gup_pgd_range
   2.43%  [kernel]    [k] get_futex_key
   2.00%  [kernel]    [k] __basepage_index
   1.20%  postgres    [.] 
calculateDigestFromBuffer

   0.97%  [kernel]    [k] update_load_avg
   0.97%  postgres    [.] XLogInsertRecord
   0.93%  [kernel]    [k] switch_mm_irqs_off
   0.90%  postgres    [.] LWLockAttemptLock
   0.88%  [kernel]    [k] _atomic_dec_and_lock
   0.84%  [kernel]    [k] __schedule
   0.82%  postgres    [.] 
ConditionVariableBroadcast

   0.75%  postgres    [.] LWLockRelease
   0.74%  [kernel]    [k] 
syscall_return_via_sysret

   0.65%  postgres    [.] SetLatch
   0.64%  [kernel]    [k] 
_raw_spin_lock_irqsave
   0.62%  [kernel]    [k] 
copy_user_enhanced_fast_string

   0.59%  postgres    [.] RelationPutHeapTuple
   0.55%  [kernel]    [k] select_task_rq_fair
   0.54%  [kernel]    [k] try_to_wake_up
   0.52%  [kernel]    [k] menu_select

So definitely elimination heavy weight relation extension lock is good 
idea which eliminates the need for my deadlock patch ... but only in 
this insert test.
As I have mentioned at the beginning of this thread the same problem 
with deadlock detection timeout expiration we have with YSCB benchmark 
with zipf distribution.
Here the source of contention are tuple locks. And as far as I 
understand from the discussion in the mentioned thread, it is not 
possible to eliminate heavy weight tuple locks.




--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: [HACKERS] path toward faster partition pruning

2018-04-20 Thread Alvaro Herrera
Amit Langote wrote:

> PS: git grep "partition by hash\|PARTITION BY HASH" on src/test indicates
> that there are hash partitioning related tests in create_table,
> foreign_key, and partition_join files as well.  Do we want to use the
> custom opclass in those files as well?

By the way, let me suggest 'git grep -i' instead.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Postgres stucks in deadlock detection

2018-04-20 Thread Robert Haas
On Wed, Apr 18, 2018 at 10:08 AM, Konstantin Knizhnik
 wrote:
> And it is very hard not to notice 17-times difference.
> Certainly it is true in the assumption that most deadlock timeout expiration
> are caused by high workload and contention, and not by real deadlocks.
> But it seems to be quite common case.

If I understand this workload correctly, the contention is for the
relation extension lock.  But I think we're likely to move that out of
the heavyweight lock manager altogether in the not-too-distant future,
as proposed in https://commitfest.postgresql.org/17/1133/ ?  I'd be
interested in hearing what happens to performance with that patch
applied.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] proposal: schema variables

2018-04-20 Thread Robert Haas
On Tue, Apr 17, 2018 at 12:28 PM, Pavel Stehule  wrote:
> It true, so there are lot of "unused" attributes for this purpose, but there
> is lot of shared attributes, and lot of shared code. Semantically, I see
> variables in family of sequences, tables, indexes, views. Now, it shares
> code, and I hope in next steps more code can be shared - constraints,
> triggers.

I dunno, it seems awfully different to me.  There's only one "column",
right?  What code is really shared here?  Are constraints and triggers
even desirable feature for variables?  What would be the use case?

I think stuffing this into pg_class is pretty strange.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Add read-only param to set_config(...) / SET that effects (at least) customized runtime options

2018-04-20 Thread Matthias Kurz
Hi,

in MS SQL Server you are able to make an entry of SESSION_CONTEXT read-only
by passing the @read_only param to the sp_set_session_context function:
"[ @read_only= ] { 0 | 1 } A flag of type bit. If 1, then the value for the
specified key cannot be changed again on this logical connection. If 0
(default), then the value can be changed."
See
https://docs.microsoft.com/en-us/sql/relational-databases/system-stored-procedures/sp-set-session-context-transact-sql

I propose the same functionally for PostgreSQL - at least when setting
"customized options" (the ones with a dot in the name:
https://www.postgresql.org/docs/10/static/runtime-config-custom.html)

The set_config(...) method therefore would end up like:
set_config(setting_name, new_value, is_local, read_only)

The equalient SET command could look something like:
SET READONLY my_app.some_var = 'foo';
SET READONLY SESSION my_app.some_var = 'foo';
SET READONLY LOCAL my_app.some_var = 'foo';

Of course read-only would default to false for backwards compatibility.
When setting READONLY for on SESSION config then that config is not allowed
to be changed anymore until the connection gets closed - even when running
RESET ALL or DISCARD ALL.
When setting read-only for a transaction config "the effects of the SET
LOCAL command disappear at function exit" will still be true, like written
in https://www.postgresql.org/docs/10/static/sql-set.html.

What do you think?

Regards,
Matthias


Re: Corrupted btree index on HEAD because of covering indexes

2018-04-20 Thread Teodor Sigaev

Should we use BTreeInnerTupleGetDownLink() as soon as we use
BTreeInnerTupleSetDownLink() for setting this?
Or even invent BTreeInnerTupleDownLinkIsValid() macro?
I am not sure. Here we actually store UP link - to top parent to remove. I'm 
afraid using BTreeInnerTupleGetDownLink/BTreeInnerTupleSetDownLink could cause a 
confusion, in other hand, introducing 
TreeInnerTupleGetUpLink/BTreeInnerTupleSetUpLink seems over-engineering




After close look I change my opinion. To have a clean code it's much better to 
have new pair get/set macroses specialy to manage link to top pare during page 
deletion. This removes last naked usage of 
ItemPointer(SetInvalid/IsInvalid/GetBlockNumberNoCheck) and uses self-described 
macroses. Patch is attached.



--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index beef089ba8..3be229db1f 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -1602,10 +1602,9 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	MemSet(, 0, sizeof(IndexTupleData));
 	trunctuple.t_info = sizeof(IndexTupleData);
 	if (target != leafblkno)
-		ItemPointerSetBlockNumber(_tid, target);
+		BTreeTupleSetTopParent(, target);
 	else
-		ItemPointerSetInvalid(_tid);
-	BTreeTupleSetNAtts(, 0);
+		BTreeTupleSetTopParent(, InvalidBlockNumber);
 
 	if (PageAddItem(page, (Item) , sizeof(IndexTupleData), P_HIKEY,
 	false, false) == InvalidOffsetNumber)
@@ -1690,7 +1689,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty)
 	BTPageOpaque opaque;
 	bool		rightsib_is_rightmost;
 	int			targetlevel;
-	ItemPointer leafhikey;
+	IndexTuple	leafhikey;
 	BlockNumber nextchild;
 
 	page = BufferGetPage(leafbuf);
@@ -1702,7 +1701,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty)
 	 * Remember some information about the leaf page.
 	 */
 	itemid = PageGetItemId(page, P_HIKEY);
-	leafhikey = &((IndexTuple) PageGetItem(page, itemid))->t_tid;
+	leafhikey = (IndexTuple) PageGetItem(page, itemid);
 	leafleftsib = opaque->btpo_prev;
 	leafrightsib = opaque->btpo_next;
 
@@ -1714,9 +1713,10 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty)
 	 * parent in the branch.  Set 'target' and 'buf' to reference the page
 	 * actually being unlinked.
 	 */
-	if (ItemPointerIsValid(leafhikey))
+	target = BTreeTupleGetTopParent(leafhikey);
+
+	if (target != InvalidBlockNumber)
 	{
-		target = ItemPointerGetBlockNumberNoCheck(leafhikey);
 		Assert(target != leafblkno);
 
 		/* fetch the block number of the topmost parent's left sibling */
@@ -1919,12 +1919,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty)
 	 * branch.
 	 */
 	if (target != leafblkno)
-	{
-		if (nextchild == InvalidBlockNumber)
-			ItemPointerSetInvalid(leafhikey);
-		else
-			ItemPointerSetBlockNumber(leafhikey, nextchild);
-	}
+		BTreeTupleSetTopParent(leafhikey, nextchild);
 
 	/*
 	 * Mark the page itself deleted.  It can be recycled when all current
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index fb8c769df9..67a94cb80a 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -800,11 +800,7 @@ btree_xlog_mark_page_halfdead(uint8 info, XLogReaderState *record)
 	 */
 	MemSet(, 0, sizeof(IndexTupleData));
 	trunctuple.t_info = sizeof(IndexTupleData);
-	if (xlrec->topparent != InvalidBlockNumber)
-		ItemPointerSetBlockNumber(_tid, xlrec->topparent);
-	else
-		ItemPointerSetInvalid(_tid);
-	BTreeTupleSetNAtts(, 0);
+	BTreeTupleSetTopParent(, xlrec->topparent);
 
 	if (PageAddItem(page, (Item) , sizeof(IndexTupleData), P_HIKEY,
 	false, false) == InvalidOffsetNumber)
@@ -912,11 +908,7 @@ btree_xlog_unlink_page(uint8 info, XLogReaderState *record)
 		/* Add a dummy hikey item */
 		MemSet(, 0, sizeof(IndexTupleData));
 		trunctuple.t_info = sizeof(IndexTupleData);
-		if (xlrec->topparent != InvalidBlockNumber)
-			ItemPointerSetBlockNumber(_tid, xlrec->topparent);
-		else
-			ItemPointerSetInvalid(_tid);
-		BTreeTupleSetNAtts(, 0);
+		BTreeTupleSetTopParent(, xlrec->topparent);
 
 		if (PageAddItem(page, (Item) , sizeof(IndexTupleData), P_HIKEY,
 		false, false) == InvalidOffsetNumber)
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 1194be9281..34c71978d5 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -226,6 +226,20 @@ typedef struct BTMetaPageData
 #define BTreeInnerTupleSetDownLink(itup, blkno) \
 	ItemPointerSetBlockNumber(&((itup)->t_tid), (blkno))
 
+/*
+ * Get/set link to top parent to remove. It will be better to have assertion
+ * BTreeTupleGetNAtts() == 0 but pre-v11 version hasn't set INDEX_ALT_TID_MASK
+ * bit and has P_HIKEY as value of t_tid.
+ 

Re: SHOW ALL does not honor pg_read_all_settings membership

2018-04-20 Thread Michael Paquier
On Fri, Apr 20, 2018 at 01:21:46PM +0200, Laurenz Albe wrote:
> I agree; here is a patch for that.

Thanks for taking care of that as well this looks fine to me.  Note to
committer: this needs to be merged with the first patch actually fixing
the SHOW ALL issue.

All the four core callers of GetConfigOption() actually do not use
restrict_superuser set to true...  Modules may use this option, so of
course let's not remove it.
--
Michael


signature.asc
Description: PGP signature


Re: Postgresql9.6 type cache invalidation issue - different behave of psql and pg regress

2018-04-20 Thread Tom Lane
Pavel Stehule  writes:
> In interactive mode, the build_row_from_class has unrefreshed metadata. But
> why this behave I see only in psql and not in my regress tests?

The short answer is that no plpgsql version before commit 4b93f5799
will have nice behavior for cases where you change a referenced composite
type between calls.  Why that's translating to the particular behavior
you're seeing isn't clear, considering you showed only one case in
detail; but I imagine it's because a parse of the plpgsql function
happens before the ALTER TABLE in one case and not the other.
Perhaps you have different settings of check_function_bodies,
for instance.

regards, tom lane



Re: Toast issues with OldestXmin going backwards

2018-04-20 Thread Amit Kapila
On Thu, Apr 19, 2018 at 4:07 PM, Andrew Gierth
 wrote:
> Various comments in GetOldestXmin mention the possibility of the oldest
> xmin going backward, and assert that this is actually safe. It's not.
>
> Consider:
>
> A table has a toastable column. A row is updated in a way that changes
> the toasted value. There are now two row versions pointing to different
> toast values, one live and one dead.
>
> Now suppose the toast table - but not the main table - is vacuumed; the
> dead toast entries are removed, even though they are still referenced by
> the dead main-table row. Autovacuum treats the main table and toast
> table separately, so this can happen.
>
> Now suppose that OldestXmin goes backwards so that the older main table
> row version is no longer dead, but merely recently-dead.
>
> At this point, VACUUM FULL (or similar rewrites) on the table will fail
> with "missing chunk number 0 for ..." toast errors, because it tries to
> copy the recently-dead row, but that row's toasted values have been
> vacuumed away already.
>

I haven't tried to reproduce it, but I can see the possibility of the
problem described by you.  What should we do next?  I could see few
possibilities: (a) Vacuum for main and toast table should always use
same OldestXmin, (b) Before removing the row from toast table, we
should ensure that row in the main table is removed, (c) Change the
logic during rewrite such that it can detect this situation and skip
copying the tuple in the main table, on a quick look, this seems
tricky, because the toast table TID might have been reused by that
time, basically I am not sure if this can be a viable solution or (d)
Ensure that GetOldestXmin doesn't move backwards or write a new API
similar to it which doesn't allow OldestXmin to move backwards and use
that for the purpose of vacuum.

Any better ideas?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Explain buffers wrong counter with parallel plans

2018-04-20 Thread Adrien Nayrat
Hello,

I tried to understand this issue and it seems Gather node only take account of
this own buffer usage :


create unlogged table t1 (c1 int);
insert into t1 select generate_series(1,100);
vacuum t1;

explain (analyze,buffers,timing off,costs off) select count(*) from t1;
  QUERY PLAN

 Finalize Aggregate (actual rows=1 loops=1)
   Buffers: shared hit=1531
   ->  Gather (actual rows=3 loops=1)
 Workers Planned: 2
 Workers Launched: 2
 Buffers: shared hit=1531
 ->  Partial Aggregate (actual rows=1 loops=3)
   Buffers: shared hit=4425
   ->  Parallel Seq Scan on t1 (actual rows=33 loops=3)
 Buffers: shared hit=4425


Same query without parallelism

 QUERY PLAN

 Aggregate (actual rows=1 loops=1)
   Buffers: shared hit=4425
   ->  Seq Scan on t1 (actual rows=100 loops=1)
 Buffers: shared hit=4425


We can notice Parallel Seq Scan and Partial Aggregate report 4425 buffers, same
for the plan without parallelism.


I put elog debug around theses lines in execParallel.c :

/* Accumulate the statistics from all workers. */
instrument = GetInstrumentationArray(instrumentation);
instrument += i * instrumentation->num_workers;
for (n = 0; n < instrumentation->num_workers; ++n)
  {
elog(LOG, "worker %d - shared_blks_read: %ld - shared_blks_hit: %ld", n,
instrument[n].bufusage.shared_blks_read,instrument[n].bufusage.shared_blks_hit);
InstrAggNode(planstate->instrument, [n]);
  }


And I get theses messages :

LOG:  worker 0 - shared_blks_read: 0 - shared_blks_hit: 1443
LOG:  worker 1 - shared_blks_read: 0 - shared_blks_hit: 1451


If you sum each worker's shared_blks_read and Gather's shared hit you get :
1443 + 1451 + 1531 = 4425 (size of t1)

It seems Gather node and Finalize Aggregate only report buffers read by the main
process. Workers seem omitted.

Same issue occurs for Gather merge :

explain (analyze,buffers,timing off,costs off)
  select * from t1 where c1%10=4 order by 1;

   QUERY PLAN
-
 Gather Merge (actual rows=10 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   Buffers: shared hit=1549
   ->  Sort (actual rows=3 loops=3)
 Sort Key: c1
 Sort Method: quicksort  Memory: 3143kB
 Buffers: shared hit=4521
 ->  Parallel Seq Scan on t1 (actual rows=3 loops=3)
   Filter: ((c1 % 10) = 4)
   Rows Removed by Filter: 30
   Buffers: shared hit=4425


In my log :
worker 0 - shared_blks_read: 0 - shared_blks_hit: 1487
worker 1 - shared_blks_read: 0 - shared_blks_hit: 1485

1487 + 1485 + 1549 = 4521

I tried to understand how instrumentation works, but it is above my knowledge :(



Regards,

-- 
Adrien




signature.asc
Description: OpenPGP digital signature


Re: Built-in connection pooling

2018-04-20 Thread Tatsuo Ishii
>> I understand your customers like to have unlimited number of
>> connections.  But my customers do not. (btw, even with normal
>> PostgreSQL, some of my customers are happily using over 1k, even 5k
>> max_connections).
> 
> If you have limited number of client, then you do not need pooling at
> all.

Still pooler is needed even if the number of connections is low
because connecting to PostgreSQL is very expensive operation as
everybody knows.

BTW, the main reason why Pgpool-II is used is, because it is a pooler,
but query routing: write queies to primary server and read queries to
standbys. This is not possible in bulit-in pooler.

>> I am confused.  If so, why do you want to push statement based or
>> transaction based built-in connection pooler?
> 
> I want to provide session semantic but do not start dedicated backend
> for each session.
> Transaction level rescheduling (rather than statement level
> resheduling) is used to avoid complexity with storing/restoring
> transaction context and maintaining locks.

Not sure if it's acceptable for community. Probably many developers
want built-in pooler keeps exactly the same semantics as normal
connections.

Tome Lane wrote:
> FWIW, I concur with Heikki's position that we're going to have very high
> standards for the transparency of any in-core pooler.  Before trying to
> propose a patch, it'd be a good idea to try to fix the perceived
> shortcomings of some existing external pooler.  Only after you can say
> "there's nothing wrong with this that isn't directly connected to its
> not being in-core" does it make sense to try to push the logic into core.

So I would suggest you to start with session level in-core pooler,
which would be much easier than transaction level pooler to make it
transparent.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp



Re: Problem while setting the fpw with SIGHUP

2018-04-20 Thread Amit Kapila
On Fri, Apr 20, 2018 at 11:40 AM, Kyotaro HORIGUCHI
 wrote:
> By the way, I think I found a bug of FPW.
>
> The following steps yields INSERT record that doesn't have a FPI
> after a checkpoint.
>
> (Start server with full_page_writes = off)
> CREATE TABLE t (a int);
> CHECKPOINT;
> INSERT INTO t VALUES (1);
> ALTER SYSTEM SET full_page_writes TO on;
> SELECT pg_reload_conf();
> CHECKPOINT;
> INSERT INTO t VALUES (1);
>
> The last insert is expected to write a record with FPI but it
> doesn't actually. No FPI will be written for the page after that.
>
> It seems that the reason is that XLogInsertRecord is forgetting
> to check doPageWrites' update.
>
> In the failure case, fpw_lsn has been set by XLogRecordAssemble
> but doPageWrite is false at the time and it considers that no FPI
> is required then it sets fpw_lsn to InvalidXLogRecPtr.
>
> After that, XLogInsertRecord receives the record but it thinks
> that doPageWrites is true since it looks the shared value.
>
>> if (fpw_lsn != InvalidXLogRecPtr && fpw_lsn <= RedoRecPtr && doPageWrites)
>
> So this line thinks that "no FPI is omitted in this record" but
> actually the record is just forgotten to attach them.
>
> The attached patch lets XLogInsertRecord check if doPageWrites
> has been turned on after the last call and cause recomputation in
> the case.
>

I think you have correctly spotted the problem and your fix looks good
to me.  As this is a separate problem and fix is different from what
we are discussing here, I think this can be committed it separately.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Built-in connection pooling

2018-04-20 Thread Craig Ringer
On Fri., 20 Apr. 2018, 06:59 Andres Freund,  wrote:

> On 2018-04-19 15:01:24 -0400, Tom Lane wrote:
> > Only after you can say "there's nothing wrong with this that isn't
> > directly connected to its not being in-core" does it make sense to try
> > to push the logic into core.
>
> I think there's plenty things that don't really make sense solving
> outside of postgres:
> - additional added hop / context switches due to external pooler
> - temporary tables
> - prepared statements
> - GUCs and other session state
>

Totally agreed. Poolers can make some limited efforts there, but that's all.

Poolers also have a hard time determining if a query is read-only or
read/write. Wheas Pg its self has a better chance, and we could help it
along with function READONLY attributes if we wanted. This matters
master/standby query routing. Standbys being able to proxy for the master
would be fantastic but isn't practical without some kind of pooler.


> I think there's at least one thing that we should attempt to make
> easier for external pooler:
> - proxy authorization
>

Yes, very yes. I've raised this before in a limited form - SET SESSION
AURHORIZATION that cannot be reset without a cookie value. But true proxy
auth would be better.


Re: SHOW ALL does not honor pg_read_all_settings membership

2018-04-20 Thread Laurenz Albe
Michael Paquier wrote:
> On Mon, Apr 16, 2018 at 02:32:10PM +0200, Laurenz Albe wrote:
> > Now that the dust from the last commitfest is settling, I'll make a second
> > attempt to attract attention for this small bug fix.
> > 
> > The original commit was Simon's.
> 
> Thanks for the ping.
> 
> This was new as of v10, so this cannot be listed as an open item still I
> have added that under the section for older bugs, because you are right
> as far as I can see.
> 
> GetConfigOption is wrong by the way, as restrict_superuser means that
> all members of the group pg_read_all_settings can read
> GUC_SUPERUSER_ONLY params, and not only superusers, so the comment at
> least needs a fix, the variable ought to be renamed as well.

Thanks for the review!

I agree; here is a patch for that.

Yours,
Laurenz AlbeFrom cfe04ef974dba324c47510b854587de6c33e39a1 Mon Sep 17 00:00:00 2001
From: Laurenz Albe 
Date: Fri, 20 Apr 2018 13:13:20 +0200
Subject: [PATCH] Fix comments and a parameter name

Commit 25fff40798fc4ac11a241bfd9ab0c45c085e2212 changed the semantics
of GetConfigOption, but did not update the comment.

Also, the parameter name should better be changed.

Noted by Michael Paquier.
---
 src/backend/utils/misc/guc.c | 10 +-
 src/include/utils/guc.h  |  2 +-
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 7acec5ea21..b5501fd949 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -6869,15 +6869,15 @@ SetConfigOption(const char *name, const char *value,
  * this cannot be distinguished from a string variable with a NULL value!),
  * otherwise throw an ereport and don't return.
  *
- * If restrict_superuser is true, we also enforce that only superusers can
- * see GUC_SUPERUSER_ONLY variables.  This should only be passed as true
- * in user-driven calls.
+ * If restrict_privileged is true, we also enforce that only superusers and
+ * members of the pg_read_all_settings role can see GUC_SUPERUSER_ONLY
+ * variables.  This should only be passed as true in user-driven calls.
  *
  * The string is *not* allocated for modification and is really only
  * valid until the next call to configuration related functions.
  */
 const char *
-GetConfigOption(const char *name, bool missing_ok, bool restrict_superuser)
+GetConfigOption(const char *name, bool missing_ok, bool restrict_privileged)
 {
 	struct config_generic *record;
 	static char buffer[256];
@@ -6892,7 +6892,7 @@ GetConfigOption(const char *name, bool missing_ok, bool restrict_superuser)
  errmsg("unrecognized configuration parameter \"%s\"",
 		name)));
 	}
-	if (restrict_superuser &&
+	if (restrict_privileged &&
 		(record->flags & GUC_SUPERUSER_ONLY) &&
 		!is_member_of_role(GetUserId(), DEFAULT_ROLE_READ_ALL_SETTINGS))
 		ereport(ERROR,
diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h
index 3d13a33b94..f462eabe59 100644
--- a/src/include/utils/guc.h
+++ b/src/include/utils/guc.h
@@ -347,7 +347,7 @@ extern void DefineCustomEnumVariable(
 extern void EmitWarningsOnPlaceholders(const char *className);
 
 extern const char *GetConfigOption(const char *name, bool missing_ok,
-bool restrict_superuser);
+bool restrict_privileged);
 extern const char *GetConfigOptionResetString(const char *name);
 extern int	GetConfigOptionFlags(const char *name, bool missing_ok);
 extern void ProcessConfigFile(GucContext context);
-- 
2.14.3



Re: Boolean partitions syntax

2018-04-20 Thread Amit Langote
On 2018/04/19 20:35, Kyotaro HORIGUCHI wrote:
> At Wed, 18 Apr 2018 19:27:16 +0900, Amit Langote wrote:
>> On 2018/04/16 16:17, Kyotaro HORIGUCHI wrote:
>>> It was a bother that some rules used c_expr directly but I
>>> managed to replace all of them with a_expr by lowering precedence
>>> of some ordinary keywords (PASSING, BY, COLUMNS and ROW). c_expr
>>> is no loger used elsewhere so we can just remove columnref from
>>> c_expr. Finally [abc]0_expr was eliminated and we have only
>>> a_expr, b_expr, u_expr and c_expr. This seems simple enough.
>>>
>>> The relationship among the rules after this change is as follows.
>>>
>>> a_expr --+-- columnref
>>>  +-- u_expr ---+-- c_expr -- (all old c_expr stuff except columnref)
>>>+-- (all old a_expr stuff)
>>>
>>> b_expr --+-- columnref
>>>  +-- c_expr -- (ditto)
>>>  +-- (all old b_expr stuff)
>>>
>>> On the way fixing this, I noticed that the precedence of some
>>> keywords (PRESERVE, STRIP_P) that was more than necessary and
>>> corrected it.
>>
>> Thank you for simplifying gram.y changes.  I think I was able to
>> understand them.  Having said that, I think your proposed patch to gram.y
>> could be rewritten such that they do not *appear* to impact the syntax for
>> other features like XML, etc.  For example, allowing a_expr in places
>> where previously only c_expr was allowed seems to me a dangerous, although
>> I don't have any examples to show for it.
> 
> It cannot be dangerous, since "(a_expr)" is a c_expr. The
> difference is just the way resolve conflicts. The words of which
> I changed precedence are used only in the the problematic
> contexts.

Oh, okay.  But the point I was trying to make is that if a change that the
patch makes is not necessary to solve the problem at hand, we should try
to pursue it separately.  At first, I thought *all* the changes to gram.y
in your proposed patch were necessary, but after your clarifications, it
became clear to me that the only change needed was to refactor a_expr,
c_expr such that the we can use a_expr for partition bound expressions
without shift/reduce conflicts.

>> It seems that we would like to use a_expr syntax but without columnref for
>> partition_bound expressions.  No columnrefs because they cause conflicts
>> with unreserved keywords MINVALUE and MAXVALUE that are used in range
> 
> Right.
> 
>> bound productions.  I think that can be implemented with minimal changes
>> to expression syntax productions, as shown in the attached patch.
> 
> Agreed that the refactor stuff can be separated, but I'm a bit
> uneasy with the increased number of new syntax. The purpose of
> the change of c_expr in v6 was to minimize the *structure* change
> of *_expr syntax. I agree that v8 style is preferable to
> backpatch to PG10, but I'd still like to use v6 style for PG11.

I think if this bug/open item can be resolved by adopting the minimal
patch, then we should use it for that.  Maybe, we can discuss the rest of
the changes independently.  If they make things better overall, we should
definitely try to adopt them.

>> About the changes in transformPartitionBoundValue() to check for collation
>> conflict, I think that seems unnecessary.  We're not going to use the
>> partition bound expression's collation anywhere, so trying to validate it
>> seems unnecessary.  I wondered if we should issue a WARNING to warn the
>> user that COLLATE specified in a partition bound expression is ignored,
> 
> That's definitely wrong.

Sorry, there may have been a misunderstanding.

> Partition key expression is inheriting the collation of the base
> column and any collation can be specifiable on the expression.
> Similary we can specify collation on partition bound values with
> this patch. If a key expression and a bound value have
> contradicting collations, we cannot determine the ordering of the
> values and should error out.

There is infrastructure in place to store the collation of the partition
key (partcollation column of pg_partitioned_table catalog), which is used
when routing tuples, generating partition constraint expression, etc.  So,
we do remember the collation, either the default one derived from the
underlying column's definition or a user-specified one for partitioning.
And it *is* used when routing data (a partitioning action) and when
applying partition constraints.  I will borrow an example from an email I
had sent on the pruning thread:

create table p (a text) partition by range (a collate "en_US");
create table p1 partition of p for values from ('a') to ('m');
create table p2 partition of p for values from ('m') to ('z ');

create table q (a text) partition by range (a collate "C");
create table q1 partition of q for values from ('a') to ('m');
create table q2 partition of q for values from ('m') to ('z ');

-- per "en_US", 'a' <= 'A' < 'm'
insert into p values ('A');
INSERT 0 1

-- per "C", that's not true
insert into q values 

Re: Built-in connection pooling

2018-04-20 Thread Konstantin Knizhnik



On 20.04.2018 12:02, Tatsuo Ishii wrote:


I understand your customers like to have unlimited number of
connections.  But my customers do not. (btw, even with normal
PostgreSQL, some of my customers are happily using over 1k, even 5k
max_connections).


If you have limited number of client, then you do not need pooling at all.
With the only one exception if clients for some reasons do not want to 
keep connections to database server and

prefer to establish connection on demand and disconnect as soon as possible.
But IMHO in most cases it meas bad design of client application, because 
establishing connection (even with connection pooler) is quite expensive 
operation.

The primary idea and main benefit of built-in connection pooler is to

support session semantic with limited number of backends.

I am confused.  If so, why do you want to push statement based or
transaction based built-in connection pooler?


I want to provide session semantic but do not start dedicated backend 
for each session.
Transaction level rescheduling (rather than statement level resheduling) 
is used to avoid complexity with storing/restoring transaction context 
and maintaining locks.


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: Should we add GUCs to allow partition pruning to be disabled?

2018-04-20 Thread Amit Langote
On 2018/04/20 17:51, Amit Langote wrote:
> On 2018/04/20 14:47, David Rowley wrote:
>> On 20 April 2018 at 14:07, Amit Langote  
>> wrote:
>>> To clarify: if we're going to add a new parameter *for partitioned tables*
>>> to configure whether or not pruning occurs, even if UPDATE and DELETE now
>>> rely on constraint exclusion for pruning, we should ignore the setting of
>>> constraint_exclusion the configuration parameter.  For UPDATE and DELETE,
>>> if enable_partition_pruning is on, we proceed to prune using constraint
>>> exclusion (because that's the only method available now), irrespective of
>>> the setting of constraint_exclusion.
>>>
>>> So to users, enable_partition_pruning should be the only way to configure
>>> whether or not pruning occurs.
>>
>> I hope the attached implements what is being discussed here.
>>
>> Please test it to ensure it behaves as you'd expect.
>>
>> I was a little unsure if the new GUCs declaration should live in
>> costsize.c or not since it really has no effect on plan costs, but in
>> the end, I stuck it there anyway so that it can be with its friends.
> 
> The patch looks good except one thing,
OK, I forgot to comment on a couple of minor issues.

+ 
+  enable_partition_pruning
(boolean)
+   
+enable_partition_pruning
configuration parameter
+   
+  
+  
+   
+Enables or disables the query planner's ability to eliminate a
+partitioned table's subpartitions from query plans.

Why subpartitions?  Maybe, just "partitions" is fine.

+  This also
+controls the planner's ability to generate query plans which
allow the
+query executor to remove or ignoring partitions during query

Here: ignoring -> ignore

Also, maybe add the GUC to postgresql.conf.sample.

Thanks,
Amit




Re: Built-in connection pooling

2018-04-20 Thread Tatsuo Ishii
 This is only applied to external process type pooler (like Pgpool-II).

> - temporary tables
> - prepared statements
> - GUCs and other session state
 These are only applied to "non session based" pooler; sharing a
 database connection with multiple client connections. "Session based"
 connection pooler like Pgpool-II does not have the shortcomings.
>>> But them are not solving the main problem: restricting number of
>>> launched backends.
>> Pgpool-II already does this. If number of concurrent clients exceeds
>> max_connections, max_connections+1 client have to wait until other
>> client disconnect the session. So "restricting number of launched
>> backends" is an indenpendet function from whether "session based"
>> connection poolers" is used or not.
> Sorry, but delaying new client connection until some other client is
> disconnected is not an acceptable solution in most cases.

I just wanted to pointed out the counter fact against this.

>>> But them are not solving the main problem: restricting number of
>>> launched backends.

> Most of customers want to provide connections to the database server
> for unlimited (or at least > 100) number of clients.
> And this clients used to keep connection alive and do not hangout
> after execution of each statement/transaction.
> In this case approach with session pooling dopesn't work.

I understand your customers like to have unlimited number of
connections.  But my customers do not. (btw, even with normal
PostgreSQL, some of my customers are happily using over 1k, even 5k
max_connections).

>>> Pgbouncer  also can be used in session pooling mode. But  it makes
>>> sense only if there is limited number of clients which permanently
>>> connect/disconnect to the database.
>>> But I do not think that it is so popular use case. Usually there is
>>> very large number of connected clients which rarely drop connection
>>> but only few of them are active at each moment of time.
>> Not neccessarily. i.e. Session based poolers allow to use temporary
>> tables, prepared statements and keep GUCs and other session state,
>> while non session based poolers does not allow to use them.
>>
>> So choosing "session based poolers" or "non session based poolers" is
>> a trade off. i.e. let user choose one of them.
>>
>> If you are willing to merge your connection pooler into core, I would
>> suggest you'd better to implement those pool modes.
> 
> 
> Sorry, may we do not understand each other.
> There are the following facts:
> 1. There are some entities in Postgres which are local to a backend:
> temporary tables, GUCs, prepared statement, relation and catalog
> caches,...
> 2. Postgres doesn't "like"  larger number of backends. Even only few
> of them are actually active. Large number of backends means large
> procarray, large snapshots,...
> Please refere to my measurement at the beginning of this thread which
> illustrate how performance of Pastgres degrades with increasing number
> of backends.
> 3. Session semantic (prepared statements, GUCs, temporary tables) can
> be supported only in session level pooling mode.

I agree with 1 -3.

> 4. This mode is not acceptable in most cases because it is not
> possible to limit number of clients which want to establish connection
> wither database server or keep it small.
> This is why most pgbouncer users are using statement pooling mode.

Not sure about 4. I rarely see such users around me.

> 5. It doesn't matter how you manged to implement pooling outside
> Postgres: if you want to preserve session semantic, then you need to
> spawn as much backends as sessions. And number of clients is limited
> by number of backends/sessions.

Rigt. I am happy with the limitation for now.

> The primary idea and main benefit of built-in connection pooler is to
> support session semantic with limited number of backends.

I am confused.  If so, why do you want to push statement based or
transaction based built-in connection pooler?

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp



Postgresql9.6 type cache invalidation issue - different behave of psql and pg regress

2018-04-20 Thread Pavel Stehule
Hi

I searching a reason why result of plpgsql_check's regress tests are
different when its executed from regress tests or interactive.

This is simple test

drop table testtable;

create table testtable(a int, b int);

create or replace function test()
returns int as $$
declare r testtable;
begin
  select * into r from testtable;
  return r.a;
end;
$$ language plpgsql;

alter table testtable drop column b;
select * from plpgsql_check_function('test()');

this test should to return 0 rows, and it is working when I run it as test.
But when I execute it in psql I got

┌───┐
│plpgsql_check_function │
╞═══╡
│ warning:0:4:SQL statement:too few attributes for target variables │
│ Detail: There are more target variables than output columns in query. │
│ Hint: Check target variables in SELECT INTO statement.│
└───┘
(3 rows)

In interactive mode, the build_row_from_class has unrefreshed metadata. But
why this behave I see only in psql and not in my regress tests?

Regards

Pavel


Re: Should we add GUCs to allow partition pruning to be disabled?

2018-04-20 Thread Amit Langote
Hi David.

Thanks for writing the patch.

On 2018/04/20 14:47, David Rowley wrote:
> On 20 April 2018 at 14:07, Amit Langote  wrote:
>> To clarify: if we're going to add a new parameter *for partitioned tables*
>> to configure whether or not pruning occurs, even if UPDATE and DELETE now
>> rely on constraint exclusion for pruning, we should ignore the setting of
>> constraint_exclusion the configuration parameter.  For UPDATE and DELETE,
>> if enable_partition_pruning is on, we proceed to prune using constraint
>> exclusion (because that's the only method available now), irrespective of
>> the setting of constraint_exclusion.
>>
>> So to users, enable_partition_pruning should be the only way to configure
>> whether or not pruning occurs.
> 
> I hope the attached implements what is being discussed here.
> 
> Please test it to ensure it behaves as you'd expect.
> 
> I was a little unsure if the new GUCs declaration should live in
> costsize.c or not since it really has no effect on plan costs, but in
> the end, I stuck it there anyway so that it can be with its friends.

The patch looks good except one thing, which I was trying to emphasize
shouldn't be the behavior.

drop table p;
create table p (a int) partition by list (a);
create table p1 partition of p for values in (1);
create table p2 partition of p for values in (2);

set enable_partition_pruning to off;

-- ok
explain select * from p where a = 1;
QUERY PLAN
--
 Append  (cost=0.00..83.88 rows=26 width=4)
   ->  Seq Scan on p1  (cost=0.00..41.88 rows=13 width=4)
 Filter: (a = 1)
   ->  Seq Scan on p2  (cost=0.00..41.88 rows=13 width=4)
 Filter: (a = 1)
(5 rows)

reset enable_partition_pruning;
-- ok
explain select * from p where a = 1;
QUERY PLAN
--
 Append  (cost=0.00..41.94 rows=13 width=4)
   ->  Seq Scan on p1  (cost=0.00..41.88 rows=13 width=4)
 Filter: (a = 1)
(3 rows)

set enable_partition_pruning to off;

-- ok
explain update p set a = 2 where a = 1;
QUERY PLAN
---
 Update on p  (cost=0.00..83.75 rows=26 width=10)
   Update on p1
   Update on p2
   ->  Seq Scan on p1  (cost=0.00..41.88 rows=13 width=10)
 Filter: (a = 1)
   ->  Seq Scan on p2  (cost=0.00..41.88 rows=13 width=10)
 Filter: (a = 1)
(7 rows)

reset enable_partition_pruning;

-- ok
explain update p set a = 2 where a = 1;
QUERY PLAN
---
 Update on p  (cost=0.00..41.88 rows=13 width=10)
   Update on p1
   ->  Seq Scan on p1  (cost=0.00..41.88 rows=13 width=10)
 Filter: (a = 1)
(4 rows)


set constraint_exclusion to off;

-- not ok!
explain update p set a = 2 where a = 1;
QUERY PLAN
---
 Update on p  (cost=0.00..83.75 rows=26 width=10)
   Update on p1
   Update on p2
   ->  Seq Scan on p1  (cost=0.00..41.88 rows=13 width=10)
 Filter: (a = 1)
   ->  Seq Scan on p2  (cost=0.00..41.88 rows=13 width=10)
 Filter: (a = 1)
(7 rows)

I think we should teach relation_excluded_by_constraints() to forge ahead
based on the value of enable_partition_pruning, ignoring whatever
constraint_exclusion has been set to.  What do you think of doing that
sort of thing?

Thanks,
Amit




Re: Should we add GUCs to allow partition pruning to be disabled?

2018-04-20 Thread Amit Langote
On 2018/04/20 15:00, Ashutosh Bapat wrote:
> On Fri, Apr 20, 2018 at 7:37 AM, Amit Langote wrote:
>> On 2018/04/19 21:50, Ashutosh Bapat wrote:
>>> There's no point in confusing users
>>> with by adding dependencies between these two GUCs.
>>
>> That's exactly what I'm trying to propose.
> 
> Not really. By pruning based on the partition bounds I didn't mean
> constraint exclusion working on partition bound based constraints.

Sorry, I should have said what I said after quoting only the last sentence
of what you had said.  That is, I want to the new GUC to be the only
determiner of whether the pruning occurs or not for partitioned tables.
To implement that behavior, it will have to override the setting of
constraint_exclusion (the parameter) in *some* cases, because some
commands still rely on constraint exclusion (the algorithm) as the
underlying pruning mechanism.  Now, the "override the setting of
constraint_exclusion" implementation may not be the most popular choice in
the end.

Thanks,
Amit




Re: Built-in connection pooling

2018-04-20 Thread Konstantin Knizhnik



On 20.04.2018 11:16, Tatsuo Ishii wrote:

On 20.04.2018 01:58, Tatsuo Ishii wrote:

I think there's plenty things that don't really make sense solving
outside of postgres:
- additional added hop / context switches due to external pooler

This is only applied to external process type pooler (like Pgpool-II).


- temporary tables
- prepared statements
- GUCs and other session state

These are only applied to "non session based" pooler; sharing a
database connection with multiple client connections. "Session based"
connection pooler like Pgpool-II does not have the shortcomings.

But them are not solving the main problem: restricting number of
launched backends.

Pgpool-II already does this. If number of concurrent clients exceeds
max_connections, max_connections+1 client have to wait until other
client disconnect the session. So "restricting number of launched
backends" is an indenpendet function from whether "session based"
connection poolers" is used or not.
Sorry, but delaying new client connection until some other client is 
disconnected is not an acceptable solution in most cases.
Most of customers want to provide connections to the database server for 
unlimited (or at least > 100) number of clients.
And this clients used to keep connection alive and do not hangout after 
execution of each statement/transaction.

In this case approach with session pooling dopesn't work.







Pgbouncer  also can be used in session pooling mode. But  it makes
sense only if there is limited number of clients which permanently
connect/disconnect to the database.
But I do not think that it is so popular use case. Usually there is
very large number of connected clients which rarely drop connection
but only few of them are active at each moment of time.

Not neccessarily. i.e. Session based poolers allow to use temporary
tables, prepared statements and keep GUCs and other session state,
while non session based poolers does not allow to use them.

So choosing "session based poolers" or "non session based poolers" is
a trade off. i.e. let user choose one of them.

If you are willing to merge your connection pooler into core, I would
suggest you'd better to implement those pool modes.



Sorry, may we do not understand each other.
There are the following facts:
1. There are some entities in Postgres which are local to a backend: 
temporary tables, GUCs, prepared statement, relation and catalog caches,...
2. Postgres doesn't "like"  larger number of backends. Even only few of 
them are actually active. Large number of backends means large 
procarray, large snapshots,...
Please refere to my measurement at the beginning of this thread which 
illustrate how performance of Pastgres degrades with increasing number 
of backends.
3. Session semantic (prepared statements, GUCs, temporary tables) can be 
supported only in session level pooling mode.
4. This mode is not acceptable in most cases because it is not possible 
to limit number of clients which want to establish connection wither 
database server or keep it small.

This is why most pgbouncer users are using statement pooling mode.
5. It doesn't matter how you manged to implement pooling outside 
Postgres: if you want to preserve session semantic, then you need to 
spawn as much backends as sessions. And number of clients is limited by 
number of backends/sessions.


The primary idea and main benefit of built-in connection pooler is to 
support session semantic with limited number of backends.



--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: Problem while setting the fpw with SIGHUP

2018-04-20 Thread Kyotaro HORIGUCHI
I noticed that the previous patch is a mixture with another
patch.. sorry.

At Thu, 19 Apr 2018 19:11:43 +0530, Amit Kapila  wrote 
in 
> On Thu, Apr 19, 2018 at 7:19 AM, Michael Paquier  wrote:
> > On Wed, Apr 18, 2018 at 10:52:51AM -0400, Robert Haas wrote:
> >> I would just document the risks.  If the documentation says that you
> >> can't rely on the value until after the next checkpoint, or whatever
> >> the rule is, then I think we're fine.  I don't think that we really
> >> have the infrastructure to do any better; if we try, we'll just end up
> >> with odd warts.  Documenting the current set of warts is less churn
> >> and less work.
> >
> > The last version of the patch proposed has eaten this diff which was
> > part of one of the past versions (v2-0001-Change-FPW-handling.patch from
> > https://www.postgresql.org/message-id/20180412.103430.133595350.horiguchi.kyotaro%40lab.ntt.co.jp):
> > +The default is on. The change of the parameter 
> > takes
> > +effect at the next checkpoint time.
> > So there were some documentation about the beHavior change for what it's
> > worth.
> >
> > And, er, actually, I was thinking again about the case where a user
> > wants to disable full_page_writes temporarily to do some bulk load and
> > then re-enable it.  With the patch proposed to actually update the FPW
> > effect at checkpoint time, then a user would need to issue a manual
> > checkpoint after updating the configuration and reloading, which may
> > create more I/O than he'd want to pay for, then a second checkpoint
> > would need to be issued after the configuration comes back again.
> >
> 
> Why a second checkpoint?  One checkpoint either manual or automatic
> should be enough to make the setting effective.

One significant point in my first patch is anyway the FPW is
useless untill the second checkpoint starts. In the sense of the
timing when *useful* FPW comes back, the second checkpoint is
required regardless of the patch. As Amit said, there is no
difference whether it is manual or automatic.

On the other hand the timing when FPW is actually turned off is
different. Focusing on user's view, one can run bulkload
immediately under the current behavior but should wait for the
next checkpoint starts with the first patch, which can be caused
manually but may be delayed after the currently running
checkpoint if any.

I think that starting the *first* checkpoint before bulkload is
not such a bother but on the other hand the behavior can lead to
FPW flood for those who are accostomed to the current behavior.

The attached first patch is the bugfix proposed in this thread.
The second fixes the cocurrent update problem only.

The third changes the behavior so that turning-on happenes only
on checkpoints and turning-off happens any time.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 9b3496ceb6f39b017698225fef289974bd01fb07 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Fri, 20 Apr 2018 15:01:26 +0900
Subject: [PATCH 1/3] Correctly attach FPI to the first record after a
 checkpoint

XLogInsert fails to attach a required FPIs to the first record after a
checkpoint if no other record has been written after full_page_writes
was turned on. This makes incosistency between fpw flag of the
checkpoint record and the following record's FPW status. This patch
makes XLogInsertRecord to cause a recomputation when the given record
is generated during FPW is off but found that the flag has been turned
on.
---
 src/backend/access/transam/xlog.c | 10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 08dc9ba031..27753f7321 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -937,7 +937,7 @@ static void WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt);
  *
  * If 'fpw_lsn' is valid, it is the oldest LSN among the pages that this
  * WAL record applies to, that were not included in the record as full page
- * images.  If fpw_lsn >= RedoRecPtr, the function does not perform the
+ * images.  If fpw_lsn <= RedoRecPtr, the function does not perform the
  * insertion and returns InvalidXLogRecPtr.  The caller can then recalculate
  * which pages need a full-page image, and retry.  If fpw_lsn is invalid, the
  * record is always inserted.
@@ -970,6 +970,7 @@ XLogInsertRecord(XLogRecData *rdata,
 			   info == XLOG_SWITCH);
 	XLogRecPtr	StartPos;
 	XLogRecPtr	EndPos;
+	bool		prevDoPageWrites = doPageWrites;
 
 	/* we assume that all of the record header is in the first chunk */
 	Assert(rdata->len >= SizeOfXLogRecord);
@@ -1022,7 +1023,8 @@ XLogInsertRecord(XLogRecData *rdata,
 	 * This can only happen just after a checkpoint, so it's better to be slow
 	 * in this case and fast otherwise.
 	 *
-	

Re: Built-in connection pooling

2018-04-20 Thread Tatsuo Ishii
> On 20.04.2018 01:58, Tatsuo Ishii wrote:
>>> I think there's plenty things that don't really make sense solving
>>> outside of postgres:
>>> - additional added hop / context switches due to external pooler
>> This is only applied to external process type pooler (like Pgpool-II).
>>
>>> - temporary tables
>>> - prepared statements
>>> - GUCs and other session state
>> These are only applied to "non session based" pooler; sharing a
>> database connection with multiple client connections. "Session based"
>> connection pooler like Pgpool-II does not have the shortcomings.
> But them are not solving the main problem: restricting number of
> launched backends.

Pgpool-II already does this. If number of concurrent clients exceeds
max_connections, max_connections+1 client have to wait until other
client disconnect the session. So "restricting number of launched
backends" is an indenpendet function from whether "session based"
connection poolers" is used or not.

> Pgbouncer  also can be used in session pooling mode. But  it makes
> sense only if there is limited number of clients which permanently
> connect/disconnect to the database.
> But I do not think that it is so popular use case. Usually there is
> very large number of connected clients which rarely drop connection
> but only few of them are active at each moment of time.

Not neccessarily. i.e. Session based poolers allow to use temporary
tables, prepared statements and keep GUCs and other session state,
while non session based poolers does not allow to use them.

So choosing "session based poolers" or "non session based poolers" is
a trade off. i.e. let user choose one of them.

If you are willing to merge your connection pooler into core, I would
suggest you'd better to implement those pool modes.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp




Re: Built-in connection pooling

2018-04-20 Thread Konstantin Knizhnik



On 20.04.2018 03:14, Tatsuo Ishii wrote:

On Fri, Apr 20, 2018 at 07:58:00AM +0900, Tatsuo Ishii wrote:

Yeah. Since SCRAM auth is implemented, some connection poolers
including Pgpool-II are struggling to adopt it.

Er, well.  pgpool is also taking advantage of MD5 weaknesses...  While
SCRAM fixes this class of problems, and channel binding actually makes
this harder for poolers to deal with.

One of Pgpool-II developers Usama are working hard to re-implement
SCRAM auth for upcoming Pgpool-II 4.0: i.e. storing passwords (of
course in some encrypted form) in Pgpool-II.



Just want to notice that authentication is are where I have completely 
no experience.
So any suggestions or help  in developing right authentication mechanism 
for built-in connection pooling is welcome.


Right authentication of pooled session by shared backend is performed in 
the same way as by normal (dedicated) Postgres backend.
Postmaster just transfer accepted socket to one of the workers 
(backends) and it performs authentication in normal way.
It actually means that all sessions scheduled to the same worker should 
access the same database under the same user.
Accepting connections to different databases/users right now is 
supported by making it possible to create several session pools and 
binding each session pool to its own port at which postmaster will 
accept connections to this page pool.


As alternative approach I considered spawning separate "authentication" 
process (or do it in postmaster), which will process startup package and 
only after it schedule session to one of the workers. But such policy is 
much more difficult to implement and it is unclear how to map 
database/user pairs to worker backends.


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: Built-in connection pooling

2018-04-20 Thread Konstantin Knizhnik



On 20.04.2018 01:58, Tatsuo Ishii wrote:

I think there's plenty things that don't really make sense solving
outside of postgres:
- additional added hop / context switches due to external pooler

This is only applied to external process type pooler (like Pgpool-II).


- temporary tables
- prepared statements
- GUCs and other session state

These are only applied to "non session based" pooler; sharing a
database connection with multiple client connections. "Session based"
connection pooler like Pgpool-II does not have the shortcomings.
But them are not solving the main problem: restricting number of 
launched backends.
Pgbouncer  also can be used in session pooling mode. But  it makes sense 
only if there is limited number of clients which permanently 
connect/disconnect to the database.
But I do not think that it is so popular use case. Usually there is very 
large number of connected clients which rarely drop connection but only 
few of them are active at each moment of time.


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: Built-in connection pooling

2018-04-20 Thread Konstantin Knizhnik



On 19.04.2018 17:27, Dave Cramer wrote:



On Thu, Apr 19, 2018, 9:24 AM Konstantin Knizhnik, 
> wrote:




On 19.04.2018 07:46, Tsunakawa, Takayuki wrote:
> From: Konstantin Knizhnik [mailto:k.knizh...@postgrespro.ru
]
> Oracle, for example, you can create dedicated and non-dedicated
backends.
>> I wonder why we do not want to have something similar in Postgres.
> Yes, I want it, too.  In addition to dedicated and shared server
processes, Oracle provides Database Resident Connection Pooling
(DRCP).  I guessed you were inspired by this.
>
>

https://docs.oracle.com/cd/B28359_01/server.111/b28310/manproc002.htm#ADMIN12348

It seems to be that my connection pooling is more close to DRCP
than to
shared servers.
It is not clear from this article what this 35KB per client
connection
are used for...
It seems to be some thing similar with session context used to
suspend/resume session.
In my prototype I also maintain some per-session context to keep
values
of session specific GUCs, temporary namespace, ...
Definitely pooled session memory footprint depends on size of
catalog,
prepared statements, updated GUCs,... but 10-100kb seems to be a
reasonable estimation.


>
> BTW, you are doing various great work -- autoprepare,
multithreaded Postgres, built-in connection pooling, etc. etc.,
aren't you?  Are you doing all of these alone?
Yes, but there is huge distance from prototype till product-ready
solution. And definitely I need some help here. This is why I have to
suspend future development of multithreaded version of Postgres
(looks
like it is not considered as some realistic project by community).
But with builtin connection pooling situation is better and I am
going
to tests it with some our clients which are interested in this
feature.


Konstantin



It would be useful to test with the JDBC driver

We run into issues with many pool implementations due to our 
opinionated nature


I have tested built-in connection pool with YCSB benchmark which is 
implemented in Java and so works through JDBC driver.

Results were published in the following mail in this thread:
https://www.postgresql.org/message-id/7359-c582-7a08-5772-cb882988c0ae%40postgrespro.ru


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: Built-in connection pooling

2018-04-20 Thread Vladimir Sitnikov
>Development in built-in connection pooling will be continued in
https://github.com/postgrespro/postgresql.builtin_pool.git

The branch (as of 0020c44195992c6dce26baec354a5e54ff30b33f) passes pgjdbc
tests: https://travis-ci.org/vlsi/pgjdbc/builds/368997672

Current tests are mostly single-threaded, so the tests are unlikely to
trigger lots of "concurrent connection" uses.
The next step might be to create multiple schemas, and execute multiple
tests in parallel.

Vladimir


Re: Built-in connection pooling

2018-04-20 Thread Vladimir Sitnikov
Christopher>One of the things that they find likable is that by having the
connection
pool live
Christopher>in the framework alongside the application is that this makes
it easy to
attach
Christopher>hooks so that the pool can do intelligent things based on
application-aware
logic.

I'm afraid I do not follow you. Can you please provide an example?

TL;DR:
1) I think in-application pooling would be required for performance reasons
in any case.
2) Out-of-application pooling (in-backend or in-the-middle) is likely
needed as well


JDBC clients use client-side connection pooling for performance reasons:

1) Connection setup does have overhead:
1.1) TCP connection takes time to init/close
1.2) Startup queries involve a couple of roundrips: "startup packet", then
"SET extra_float_digits = 3", then "SET application_name = '...' "
2) Binary formats on the wire are tied to oids. Clients have to cache the
oids somehow, and "cache per connection" is the current approach.
3) Application threads tend to augment "application_name", "search_path",
etc for its own purposes, and it would slow the application down
significantly if JDBC driver reverted application_name/search_path/etc for
each and every "connection borrow".
4) I believe there's non-zero overhead for backend process startup

As Konstantin lists in the initial email, the problem is backend itself
does not scale well with lots of backend processes.
In other words: it is fine if PostgreSQL is accessed by a single Java
application since the number of connections would be reasonable (limited by
the Java connection pool).
That, however, is not the case when the DB is accessed by lots of
applications (==lots of idle connections) and/or in case the application is
using short-lived connections (==in-app pool is missing that forces backend
processes to come and go).

Vladimir


Re: Corrupted btree index on HEAD because of covering indexes

2018-04-20 Thread Teodor Sigaev

heard of people using bt_index_parent_check() in production, but only
when they already knew that their database was corrupt, and wanted to
isolate the problem. I imagine that people that use
bt_index_parent_check() in production do so because they want as much
information as possible, and are willing to do whatever it takes to
get more information.
That why I think we need improve amcheck - ideally, it should not miss 
any mistakes in tree structure.



Agree, but at least this place needs a comment - why it's safe.


Good idea.
Could you write it? I'm afraid I can't give good explanation why we 
believe that nobody update this page ant it's  high key while page is 
unlocked but pinned.





I also think that we could have better conventional regression test
coverage here.


Will try to invent not so large test.oif it means they may get a little extra


Your smaller test takes about 350ms to run on a debug build on my
laptop, which seems worth it (honestly, this test should have been
there already). Maybe we can add it to the amcheck regression tests
instead, since those are run less often. This also allows us to add a
test case specifically as part of the amcheck enhancement, which makes
a bit more sense.
Hm, it seems to me, that 350ms is short enough to place it in both core 
and amcheck test. I think, basic functionality should be covered by core 
tests as we test insert/create.



--
Teodor Sigaev  E-mail: teo...@sigaev.ru
  WWW: http://www.sigaev.ru/



Re: Corrupted btree index on HEAD because of covering indexes

2018-04-20 Thread Teodor Sigaev

I tried to minimize Michael's test case and add it to patch.


-if (ItemPointerIsValid(leafhikey))
+if (ItemPointerGetBlockNumberNoCheck(leafhikey) != InvalidBlockNumber)

Should we use BTreeInnerTupleGetDownLink() as soon as we use
BTreeInnerTupleSetDownLink() for setting this?
Or even invent BTreeInnerTupleDownLinkIsValid() macro?
I am not sure. Here we actually store UP link - to top parent to remove. 
I'm afraid using BTreeInnerTupleGetDownLink/BTreeInnerTupleSetDownLink 
could cause a confusion, in other hand, introducing 
TreeInnerTupleGetUpLink/BTreeInnerTupleSetUpLink seems over-engineering


--
Teodor Sigaev  E-mail: teo...@sigaev.ru
  WWW: http://www.sigaev.ru/



Re: Problem while setting the fpw with SIGHUP

2018-04-20 Thread Kyotaro HORIGUCHI
By the way, I think I found a bug of FPW.

The following steps yields INSERT record that doesn't have a FPI
after a checkpoint.

(Start server with full_page_writes = off)
CREATE TABLE t (a int);
CHECKPOINT;
INSERT INTO t VALUES (1);
ALTER SYSTEM SET full_page_writes TO on;
SELECT pg_reload_conf();
CHECKPOINT;
INSERT INTO t VALUES (1);

The last insert is expected to write a record with FPI but it
doesn't actually. No FPI will be written for the page after that.

It seems that the reason is that XLogInsertRecord is forgetting
to check doPageWrites' update.

In the failure case, fpw_lsn has been set by XLogRecordAssemble
but doPageWrite is false at the time and it considers that no FPI
is required then it sets fpw_lsn to InvalidXLogRecPtr.

After that, XLogInsertRecord receives the record but it thinks
that doPageWrites is true since it looks the shared value.

> if (fpw_lsn != InvalidXLogRecPtr && fpw_lsn <= RedoRecPtr && doPageWrites)

So this line thinks that "no FPI is omitted in this record" but
actually the record is just forgotten to attach them.

The attached patch lets XLogInsertRecord check if doPageWrites
has been turned on after the last call and cause recomputation in
the case.

> * If there are any registered buffers, and a full-page image was not taken
> * of all of them, *fpw_lsn is set to the lowest LSN among such pages. This
> * signals that the assembled record is only good for insertion on the
> * assumption that the RedoRecPtr and doPageWrites values were up-to-date.

And the patch fixes one comment typo of XLogInsertRecord.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 40ce98bba0496b1eb0a982134eae9cec01d532a8 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Fri, 20 Apr 2018 15:01:26 +0900
Subject: [PATCH] Correctly attach FPI to the first record after a checkpoint

XLogInsert fails to attach a required FPIs to the first record after a
checkpoint if no other record has been written after full_page_writes
was turned on. This makes incosistency between fpw flag of the
checkpoint record and the following record's FPW status. This patch
makes XLogInsertRecord to cause a recomputation when the given record
is generated during FPW is off but found that the flag has been turned
on.
---
 src/backend/access/transam/xlog.c | 10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 08dc9ba031..27753f7321 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -937,7 +937,7 @@ static void WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt);
  *
  * If 'fpw_lsn' is valid, it is the oldest LSN among the pages that this
  * WAL record applies to, that were not included in the record as full page
- * images.  If fpw_lsn >= RedoRecPtr, the function does not perform the
+ * images.  If fpw_lsn <= RedoRecPtr, the function does not perform the
  * insertion and returns InvalidXLogRecPtr.  The caller can then recalculate
  * which pages need a full-page image, and retry.  If fpw_lsn is invalid, the
  * record is always inserted.
@@ -970,6 +970,7 @@ XLogInsertRecord(XLogRecData *rdata,
 			   info == XLOG_SWITCH);
 	XLogRecPtr	StartPos;
 	XLogRecPtr	EndPos;
+	bool		prevDoPageWrites = doPageWrites;
 
 	/* we assume that all of the record header is in the first chunk */
 	Assert(rdata->len >= SizeOfXLogRecord);
@@ -1022,7 +1023,8 @@ XLogInsertRecord(XLogRecData *rdata,
 	 * This can only happen just after a checkpoint, so it's better to be slow
 	 * in this case and fast otherwise.
 	 *
-	 * If we aren't doing full-page writes then RedoRecPtr doesn't actually
+	 * If doPageWrites was just turned on, we force a recomputation. Otherwise
+	 * if we aren't doing full-page writes then RedoRecPtr doesn't actually
 	 * affect the contents of the XLOG record, so we'll update our local copy
 	 * but not force a recomputation.  (If doPageWrites was just turned off,
 	 * we could recompute the record without full pages, but we choose not to
@@ -1035,7 +1037,9 @@ XLogInsertRecord(XLogRecData *rdata,
 	}
 	doPageWrites = (Insert->fullPageWrites || Insert->forcePageWrites);
 
-	if (fpw_lsn != InvalidXLogRecPtr && fpw_lsn <= RedoRecPtr && doPageWrites)
+	if (doPageWrites &&
+		(!prevDoPageWrites ||
+		 (fpw_lsn != InvalidXLogRecPtr && fpw_lsn <= RedoRecPtr)))
 	{
 		/*
 		 * Oops, some buffer now needs to be backed up that the caller didn't
-- 
2.16.3



Re: Should we add GUCs to allow partition pruning to be disabled?

2018-04-20 Thread Ashutosh Bapat
On Fri, Apr 20, 2018 at 7:37 AM, Amit Langote
 wrote:
> On 2018/04/19 21:50, Ashutosh Bapat wrote:
>> On Thu, Apr 19, 2018 at 5:02 PM, Amit Langote
>>> I can imagine having a enable_partition_pruning which defaults to true, if
>>> only to avoid the performance overhead of pruning code when a user knows
>>> for sure that it won't help for some queries.  Although, I'm a bit dubious
>>> why they'd write such queries if they're using partitioning in the first
>>> place.
>>>
>>> Also, I'd think that enable_partition_pruning set to false means pruning
>>> doesn't occur at all, not even using constraint exclusion.  That is,
>>> behavior equivalent of constraint_exclusion < partition (that is, off/on).
>>>
>>> Also, if we do have such a GUC, it should apply to all command types,
>>> including UPDATE and DELETE which don't yet invoke the new pruning code,
>>> from the start.  So, if enable_partition_pruning is false, we won't load
>>> the partition constraints at all, which we currently do for UPDATE and
>>> DELETE so that constraint exclusion can be used for pruning.  OTOH, if
>>> enable_partition_pruning is on, we perform constraint exclusion -based
>>> pruning for UPDATE and DELETE irrespective of the setting of
>>> constraint_exclusion GUC.  In other words, we completely dissociate
>>> partitioned table pruning from the setting of constraint_exclusion.
>>
>> Isn't word "dissociate" turns the last sentence into a sentence
>> contradicting everything you wrote prior to it?
>>
>> I think we should keep these two things separate.
>
> Yes, that's what I meant.
>
> To clarify: if we're going to add a new parameter *for partitioned tables*
> to configure whether or not pruning occurs, even if UPDATE and DELETE now
> rely on constraint exclusion for pruning, we should ignore the setting of
> constraint_exclusion the configuration parameter.  For UPDATE and DELETE,
> if enable_partition_pruning is on, we proceed to prune using constraint
> exclusion (because that's the only method available now), irrespective of
> the setting of constraint_exclusion.
>
> So to users, enable_partition_pruning should be the only way to configure
> whether or not pruning occurs.
>
> Does that make sense?
>
> It seems like talking about the finer implementation details is making
> this discussion a bit confusing.
>
>> enable_partition_pruning affects the partition pruning based on the
>> partition bounds and that currently does not work for UPDATE/DELETE.
>> When it does work in those case, we might think of not loading
>> partition bound based constraints. constraint_exclusion affects
>> whether constraints can be used to exclude a relation (with partition
>> option affecting the child tables). Once we stop loading partition
>> bound based constraints, constraint exclusion would stop pruning
>> partitions based on the bounds. There's no point in confusing users
>> with by adding dependencies between these two GUCs.
>
> That's exactly what I'm trying to propose.

Not really. By pruning based on the partition bounds I didn't mean
constraint exclusion working on partition bound based constraints.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company