Re: WIP: Avoid creation of the free space map for small tables

2018-10-21 Thread John Naylor
On 10/16/18, Amit Kapila  wrote:
> I think we can avoid using prevBlockNumber and try_every_block, if we
> maintain a small cache which tells whether the particular block is
> tried or not.  What I am envisioning is that while finding the block
> with free space, if we came to know that the relation in question is
> small enough that it doesn't have FSM, we can perform a local_search.
> In local_seach, we can enquire the cache for any block that we can try
> and if we find any block, we can try inserting in that block,
> otherwise, we need to extend the relation.  One simple way to imagine
> such a cache would be an array of structure and structure has blkno
> and status fields.  After we get the usable block, we need to clear
> the cache, if exists.

Here is the design I've implemented in the attached v6. There is more
code than v5, but there's a cleaner separation between freespace.c and
hio.c, as you preferred. I also think it's more robust. I've expended
some effort to avoid doing unnecessary system calls to get the number
of blocks.
--

For the local, in-memory map, maintain a static array of status
markers, of fixed-length HEAP_FSM_CREATION_THRESHOLD, indexed by block
number. This is populated every time we call GetPageWithFreeSpace() on
small tables with no FSM. The statuses are

'zero' (beyond the relation)
'available to try'
'tried already'

Example for a 4-page heap:

01234567


If we try block 3 and there is no space, we set it to 'tried' and next
time through the loop we'll try 2, etc:

01234567
AAAT

If we try all available blocks, we will extend the relation. As in the
master branch, first we call GetPageWithFreeSpace() again to see if
another backend extended the relation to 5 blocks while we were
waiting for the lock. If we find a new block, we will mark the new
block available and leave the rest alone:

01234567
A000

On the odd chance we still can't insert into the new block, we'll skip
checking any others and we'll redo the logic to extend the relation.

If we're about to successfully return a buffer, whether from an
existing block, or by extension, we clear the local map.

Once this is in shape, I'll do some performance testing.

-John Naylor
From 529fa1f57946d70736b2304c2883213e45f7c077 Mon Sep 17 00:00:00 2001
From: John Naylor 
Date: Mon, 22 Oct 2018 13:39:25 +0700
Subject: [PATCH v6] Avoid creation of the free space map for small tables.

The FSM isn't created if the heap has fewer than 8 blocks. If the last
known good block has insufficient space, try every block before extending
the heap.

If a heap with a FSM is truncated back to below the threshold, the FSM
stays around and can be used as usual.
---
 contrib/pageinspect/expected/page.out |  77 
 contrib/pageinspect/sql/page.sql  |  33 ++--
 src/backend/access/brin/brin.c|   2 +-
 src/backend/access/brin/brin_pageops.c|   8 +-
 src/backend/access/heap/hio.c |  57 --
 src/backend/commands/vacuumlazy.c |  17 +-
 src/backend/storage/freespace/freespace.c | 224 +-
 src/backend/storage/freespace/indexfsm.c  |   4 +-
 src/include/storage/freespace.h   |   7 +-
 9 files changed, 344 insertions(+), 85 deletions(-)

diff --git a/contrib/pageinspect/expected/page.out b/contrib/pageinspect/expected/page.out
index 3fcd9fbe6d..83e5910453 100644
--- a/contrib/pageinspect/expected/page.out
+++ b/contrib/pageinspect/expected/page.out
@@ -1,48 +1,69 @@
 CREATE EXTENSION pageinspect;
-CREATE TABLE test1 (a int, b int);
-INSERT INTO test1 VALUES (16777217, 131584);
-VACUUM test1;  -- set up FSM
+CREATE TABLE test_rel_forks (a int);
+-- Make sure there are enough blocks in the heap for the FSM to be created.
+INSERT INTO test_rel_forks SELECT g from generate_series(1,1) g;
+-- set up FSM and VM
+VACUUM test_rel_forks;
 -- The page contents can vary, so just test that it can be read
 -- successfully, but don't keep the output.
-SELECT octet_length(get_raw_page('test1', 'main', 0)) AS main_0;
+SELECT octet_length(get_raw_page('test_rel_forks', 'main', 0)) AS main_0;
  main_0 
 
8192
 (1 row)
 
-SELECT octet_length(get_raw_page('test1', 'main', 1)) AS main_1;
-ERROR:  block number 1 is out of range for relation "test1"
-SELECT octet_length(get_raw_page('test1', 'fsm', 0)) AS fsm_0;
+SELECT octet_length(get_raw_page('test_rel_forks', 'main', 100)) AS main_100;
+ERROR:  block number 100 is out of range for relation "test_rel_forks"
+SELECT octet_length(get_raw_page('test_rel_forks', 'fsm', 0)) AS fsm_0;
  fsm_0 
 ---
   8192
 (1 row)
 
-SELECT octet_length(get_raw_page('test1', 'fsm', 1)) AS fsm_1;
- fsm_1 

-  8192
-(1 row)
-
-SELECT octet_length(get_raw_page('test1', 'vm', 0)) AS vm_0;
+SELECT octet_length(get_raw_page('test_rel_forks', 'fsm', 10)) AS fsm_10;
+ERROR:  block number 10 is out of range for relation "test_rel_forks"
+SELECT octet_length(get_raw_page('test_rel_forks', 'vm', 0)) AS vm_0;
  vm_0 
 --
  8192
 

Re: Side effect of CVE-2017-7484 fix?

2018-10-21 Thread Amit Langote
Hi,

On 2018/10/22 14:41, Stephen Frost wrote:
> Greetings,
> 
> * Dilip Kumar (dilipbal...@gmail.com) wrote:
>> As part of the security fix
>> (e2d4ef8de869c57e3bf270a30c12d48c2ce4e00c), we have restricted the
>> users from accessing the statistics of the table if the user doesn't
>> have privileges on the table and the function is not leakproof.  Now,
>> as a side effect of this, if the user has the privileges on the root
>> partitioned table but does not have privilege on the child tables, the
>> user will be able to access the data of the child table but it won't
>> be able to access the statistics of the child table. This may result
>> in a bad plan.   I am not sure what should be the fix.  Should we
>> allow to access the statistics of the table if a user has privilege on
>> its parent table?
> 
> Yes...  If the user has access to the parent table then they can see the
> child tables, so they should be able to see the statistics on them.

Yeah, but I'd think only if access the child tables are being accessed via
the parent table.

So, maybe, statistic_proc_security_check() added by that commit should
return true if IS_OTHER_REL(vardata->rel)?

Thanks,
Amit




Re: Function to promote standby servers

2018-10-21 Thread Michael Paquier
On Sat, Oct 20, 2018 at 01:48:56PM +0900, Michael Paquier wrote:
> On Sat, Oct 20, 2018 at 06:24:28AM +0200, Laurenz Albe wrote:
>> Here is another version, with a fix in pg_proc.dat, an improved comment
>> and "wait_seconds" exercised in the regression test.
> 
> Thanks for the new version.  This looks pretty good to me.  I'll see if
> I can review it once and then commit.
> 
>> -WAIT_EVENT_SYNC_REP
>> +WAIT_EVENT_SYNC_REP,
>> +WAIT_EVENT_PROMOTE
>>  } WaitEventIPC;
> 
> Those are kept in alphabetical order.  Individual wait events are also
> documented with a description.

Regarding the documentation, wouldn't it be more adapted to list the new
function under the section "Recovery Control Functions"?  Not only does
the new function signal the postmaster, but it also creates the
promotion trigger file.

Anyway, I have been looking in depth at the patch, and I finish with the
attached.  Here are some notes:
- pg_ctl returns an error if it cannot create the promote trigger file
and if it doesn't close it.  pg_promote should do the same.
- WL_TIMEOUT could have been reached, leading to no further retries
(remove for example the part related to the trigger file creation and
try to trigger the promotion, the wait time is incorrect).
- Documentation has been reworked.
- The new wait event is documented.
- a WARNING is returned if the signal to the postmaster is not sent,
which is something I think makes most sense as we do the same for other
signals.
- position including unistd.h was not correct in xlogfuncs.c.
- Timeouts for the tests are made a tad longer.  Some buildfarm machines
don't like a promotion wait of 10s.
- a catalog version bump is included, just a personal reminder..
- Indentatio has been run.

Laurenz, what do you think?
--
Michael
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 5193df3366..96d45419e5 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -19202,6 +19202,9 @@ postgres=# SELECT * FROM pg_walfile_name_offset(pg_stop_backup());

 pg_is_wal_replay_paused

+   
+pg_promote
+   

 pg_wal_replay_pause

@@ -19232,6 +19235,22 @@ postgres=# SELECT * FROM pg_walfile_name_offset(pg_stop_backup());
True if recovery is paused.

   
+  
+   
+pg_promote(wait boolean DEFAULT true, wait_seconds integer DEFAULT 60)
+
+   boolean
+   
+Promotes a physical standby server.  Returns true
+if promotion is successful and false otherwise.
+With wait set to true, the
+default, the function waits until promotion is completed or
+wait_seconds seconds have passed, otherwise the
+function returns immediately after sending the promotion signal to the
+postmaster.  This function is restricted to superusers by default, but
+other users can be granted EXECUTE to run the function.
+   
+  
   

 pg_wal_replay_pause()
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index ebcb3daaed..faf8e71854 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -1471,14 +1471,17 @@ synchronous_standby_names = 'ANY 2 (s1, s2, s3)'

 

-To trigger failover of a log-shipping standby server,
-run pg_ctl promote or create a trigger
-file with the file name and path specified by the trigger_file
-setting in recovery.conf. If you're planning to use
-pg_ctl promote to fail over, trigger_file is
-not required. If you're setting up the reporting servers that are
-only used to offload read-only queries from the primary, not for high
-availability purposes, you don't need to promote it.
+To trigger failover of a log-shipping standby server, run
+pg_ctl promote, call pg_promote,
+or create a trigger file with the file name and path specified by the
+trigger_file setting in
+recovery.conf. If you're planning to use
+pg_ctl promote or to call
+pg_promote to fail over,
+trigger_file is not required. If you're
+setting up the reporting servers that are only used to offload read-only
+queries from the primary, not for high availability purposes, you don't
+need to promote it.

   
 
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 0484cfa77a..073f19542a 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1268,7 +1268,7 @@ postgres   27093  0.0  0.0  30096  2752 ?Ss   11:34   0:00 postgres: ser
  Waiting in an extension.
 
 
- IPC
+ IPC
  BgWorkerShutdown
  Waiting for background worker to shut down.
 
@@ -1384,6 +1384,10 @@ postgres   27093  0.0  0.0  30096  2752 ?Ss   11:34   0:00 postgres: ser
  ProcArrayGroupUpdate
  Waiting for group leader to clear transaction id at transaction end.
 
+
+ P

Re: remove deprecated @@@ operator ?

2018-10-21 Thread Oleg Bartunov
On Sun, Oct 21, 2018 at 11:24 PM Tom Lane  wrote:
>
> Oleg Bartunov  writes:
> > The  commit 9b5c8d45f62bd3d243a40cc84deb93893f2f5122 is now 10+ years
> > old, may be we could remove deprecated @@@ operator ?
>
> Is it actually causing any problem?  AFAICS it's just a couple extra
> pg_operator entries, so why not leave it?
>
> I'd be +1 for removing it from the docs, though ...

attached a tiny patch for docs

>
> regards, tom lane



-- 
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


func.sgml.patch
Description: Binary data


Re: Side effect of CVE-2017-7484 fix?

2018-10-21 Thread David Fetter
On Mon, Oct 22, 2018 at 11:10:09AM +0530, Dilip Kumar wrote:
> As part of the security fix
> (e2d4ef8de869c57e3bf270a30c12d48c2ce4e00c), we have restricted the
> users from accessing the statistics of the table if the user doesn't
> have privileges on the table and the function is not leakproof.
> Now, as a side effect of this, if the user has the privileges on the
> root partitioned table but does not have privilege on the child
> tables, the user will be able to access the data of the child table
> but it won't be able to access the statistics of the child table.

Do we have any kind of quantitative idea of how much worse query
performance gets with this blind spot?

> This may result in a bad plan.   I am not sure what should be the
> fix.  Should we allow to access the statistics of the table if a
> user has privilege on its parent table?

In threat modeling terms, access to the statistics is an information
leak. If we just say "too bad" to the people who care a lot about
slowing information leaks, I'm guessing that would make them very
unhappy. Since the access controls are built for those people, I'd say
that we should prioritize performance optimizations for cases when
people haven't explicitly decided to trade performance for lower
information leak rates, which is to say for people who haven't put
those access controls on in the first place.

That's just my take, though. Another GUC to control this, perhaps?

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate



Re: Side effect of CVE-2017-7484 fix?

2018-10-21 Thread Stephen Frost
Greetings,

* Dilip Kumar (dilipbal...@gmail.com) wrote:
> As part of the security fix
> (e2d4ef8de869c57e3bf270a30c12d48c2ce4e00c), we have restricted the
> users from accessing the statistics of the table if the user doesn't
> have privileges on the table and the function is not leakproof.  Now,
> as a side effect of this, if the user has the privileges on the root
> partitioned table but does not have privilege on the child tables, the
> user will be able to access the data of the child table but it won't
> be able to access the statistics of the child table. This may result
> in a bad plan.   I am not sure what should be the fix.  Should we
> allow to access the statistics of the table if a user has privilege on
> its parent table?

Yes...  If the user has access to the parent table then they can see the
child tables, so they should be able to see the statistics on them.

That's my 2c on a quick review, at least.

Thanks!

Stephen


signature.asc
Description: PGP signature


Side effect of CVE-2017-7484 fix?

2018-10-21 Thread Dilip Kumar
As part of the security fix
(e2d4ef8de869c57e3bf270a30c12d48c2ce4e00c), we have restricted the
users from accessing the statistics of the table if the user doesn't
have privileges on the table and the function is not leakproof.  Now,
as a side effect of this, if the user has the privileges on the root
partitioned table but does not have privilege on the child tables, the
user will be able to access the data of the child table but it won't
be able to access the statistics of the child table. This may result
in a bad plan.   I am not sure what should be the fix.  Should we
allow to access the statistics of the table if a user has privilege on
its parent table?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: SQL:2011 PERIODS vs Postgres Ranges?

2018-10-21 Thread Pavel Stehule
Hi

ne 21. 10. 2018 v 21:47 odesílatel Paul A Jungwirth <
p...@illuminatedcomputing.com> napsal:

> On Sun, Oct 21, 2018 at 12:11 PM Heikki Linnakangas 
> wrote:
> > On 21/10/2018 21:17, Paul A Jungwirth wrote:
> > > 3. Build our own abstractions on top of ranges, and then use those to
> > > implement PERIOD-based features.
> > +1 on this approach. I think [7] got the model right. If we can
> > implement SQL-standard PERIODs on top of it, then that's a bonus, but
> > having sane, flexible, coherent set of range operators is more important
> > to me.
>
> Okay, I'm surprised to hear from you and Isaac that following the
> standard isn't as important as I thought, but I'm certainly pleased
> not to make it the focus. I just thought that Postgres's reputation
> was to be pretty careful about sticking to it. (I think we could still
> add a standard-compliant layer, but like you I don't feel a duty to
> suffer from it.) It sounds like I should work out some proposed
> function signatures and write up how to use them, and see what people
> think. Is that a useful approach?
>
>
It can be very unhappy if we cannot to implement standard syntax and
behave. The implementation behind or another is not too important. We
should not to accept any design that don't allow implement standard.

The world is 10 years after standards (maybe more). Now, this feature is
implemented in MySQL/MariaDB, and I expecting a press to have standardized
syntax after 5 years.

Regards

Pavel


> > What are we missing?
>
> Here are a few big ones:
>
> 1. Define temporal primary keys and foreign keys that are known to the
> database catalog and controlled as higher-level objects. For instance
> I wrote an extension at https://github.com/pjungwir/time_for_keys to
> create temporal foreign keys, but the database isn't "aware" of them.
> That means they are more cluttered in `\d foo` than necessary (you see
> the trigger constraints instead of something about a foreign key),
> they don't automatically disappear if you drop the column, it is hard
> to make them "polymorphic" (My extension supports only
> int+tstzrange.), they don't validate that the referenced table has a
> declared temporal PK, they probably have slightly different
> locking/transaction semantics than the real RI code, etc. This is what
> I'd like to implement right now.
>
> 2. System time: automatically track DML changes to the table, and let
> you query "as of" a given time.
>
> 3. Temporal joins. I don't want to tackle this myself, because there
> is already an amazing proposed patch that does everything we could ask
> for at
> https://www.postgresql-archive.org/PROPOSAL-Temporal-query-processing-with-range-types-tt5913058.html
> (recently updated btw, so I hope someone will look at it!).
>
> 4. Temporal UPDATE/DELETE: these should be converted to instead change
> the end time of old rows and insert new rows with the changed
> attributes. I'm interested in implementing this too, but one thing at
> a time. . . .
>
> I really appreciate your sharing your thoughts!
>
> Paul
>
>


Re: Number of buckets/partitions of dshash

2018-10-21 Thread Kyotaro HORIGUCHI
Hello.

At Fri, 12 Oct 2018 06:19:12 +, "Ideriha, Takeshi" 
 wrote in 
<4E72940DA2BF16479384A86D54D0988A6F1C1779@G01JPEXMBKW04>
> Hi, let me clarify my understanding about the $title.
> 
> It seems that the number of hash partitions is fixed at 128 in dshash and
> right now we cannot change it unless dshash.c code itself is changed, right?
> 
> According to the comment of dshash.c, DSHASH_NUM_PARTITIONS could be runtime 
> parameter in future.
> Are there someone working on it?
> (I want to do it, but TBH I cannot promise it because of some other work.)
> 
> In my current development for allocating catalog cache on the shared memory[1]
> I'm trying to put hash table of each CatCache to the shared memory using 
> dshash.
> The number of buckets for CatCache is pre-defined by cacheinfo and most of 
> them is under 128 like 8 or 16.
> This would cause some waste of memory on DSA because some partitions 
> (buckets) is allocated but not used.
> 
> So I'm thinking that current dshash design is still ok but flexible size of 
> partition is appropriate
> for some use cases like mine.
> 
> Do you have any thoughts?

We could do that easily, but shouldn't we find a way to reduce or
eliminate the impact of locking first? dshash needs to hold
partition lock while the caller is examining a returned entry.


> [1] 
> https://www.postgresql.org/message-id/4E72940DA2BF16479384A86D54D0988A567B9245%40G01JPEXMBKW04

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: found xmin x from before relfrozenxid y

2018-10-21 Thread Tom Lane
Andres Freund  writes:
> On 2018-10-21 10:24:16 -0400, Tom Lane wrote:
>> (Well, I see the short answer: the code layer throwing the error
>> doesn't know.  But that could be fixed easily enough.)

> I wonder if the better approach wouldn't be to add an errcontext for
> vaccuum, where continually update the block number etc. Theres plenty of
> different sources of corruption that'd potentially cause debug messages
> or errors, and that should get most of them.

I did not chase this in detail, but it looked to me like there were
two code paths leading to heap_prepare_freeze_tuple, and only one
of them came from VACUUM.  So adding a Relation parameter seemed like
a more promising fix for it.  But possibly there are more error messages
we need to worry about besides this.

regards, tom lane



Re: pgsql: Avoid duplicate XIDs at recovery when building initial snapshot

2018-10-21 Thread Michael Paquier
(moving to -hackers)

On Sun, Oct 14, 2018 at 10:42:40AM -0700, Andres Freund wrote:
> I'm unhappy this approach was taken over objections. Without a real
> warning.

Oops, that was not clear to me.  Sorry about that!  I did not see you
objecting again after the last arguments I raised.  The end of PREPARE
TRANSACTION is designed like that since it has been introduced, so
changing the way the dummy GXACT is inserted before the main one is
cleared from its XID does not sound wise to me.  The current design is
also present for a couple of reasons, please see this thread:
https://www.postgresql.org/message-id/13905.1119109...@sss.pgh.pa.us
This has resulted in e26b0abd.

Among the things I thought are:
- Clearing the XID at the same time the dummy entry is added, which
actually means to hold on ProcArrayLock longer while doing more at the
end of prepare.  I actually don't think you can do that cleanly without
endangering the transaction visibility for other backends, and syncrep
may cause the window to get wider.
- Changing GetRunningTransactionData so as duplicates are removed at
this stage.  However this also requires to hold ProcArrayLock for
longer.  For most deployments, if no dummy entries from 2PC transactions
are present it would be possible to bypass the checks to remove the
duplicated entries, but if at least one dummy entry is found it would be
necessary to scan again the whole ProcArray, which would be most likely
the case at each checkpoint with workloads like what Konstantin has
mentioned in the original report.  And ProcArrayLock is already a point
of contention for many OLTP workloads with small transactions.  So the
performance argument worries me.

Speaking of which, I have looked at the performance of qsort, and for a
couple of thousand entries, we may not see any impact.  But I am not
confident enough to say that it would be OK for all platforms each time
a standby snapshot is taken when ordering works on 4-bytes elements, so
the performance argument from Konstantin seems quite sensible to me (see
the quickly-hacked qsort_perf.c attached).

> Even leaving the crummyness aside, did you check other users of
> XLOG_RUNNING_XACTS, e.g. logical decoding? 

I actually spent some time checking that, so it is not innocent.
SnapBuildWaitSnapshot() waits for transactions to commit or abort based
on the XIDs in the record.  And that's the only place where those XIDs
are used so it won't matter to wait twice for the same transaction to
finish.  The error callback would be used only once.
--
Michael
/*
 * Measure performance of qsort:
 *  qsort_perf number_elements
 */

#include 
#include 
#include 
#include 

int
int_compare(const void *arg1, const void *arg2)
{
	int int1 = *(const int *) arg1;
	int int2 = *(const int *) arg2;

	if (int1 > int2)
		return 1;
	if (int1 < int2)
		return -1;
	return 0;
}

int
main(int argc, char **argv)
{
	int	num_elts, i;
	int	*elts;

	if (argc != 2)
	{
		fprintf(stderr, "Incorrect number of arguments\n");
		exit(1);
	}

	num_elts = atoi(argv[1]);

	elts = (int *) malloc(num_elts * sizeof(int));

	for (i = 0; i < num_elts; i++)
		elts[i] = random();

	fprintf(stdout, "Beginning %u\n", (unsigned) time(NULL));
	qsort(elts, num_elts, sizeof(int), int_compare);
	fprintf(stdout, "Finishing %u\n", (unsigned) time(NULL));

	exit(0);
}


signature.asc
Description: PGP signature


Re: relhassubclass and partitioned indexes

2018-10-21 Thread Amit Langote
Hi,

On 2018/10/22 11:09, Michael Paquier wrote:
> On Fri, Oct 19, 2018 at 06:46:15PM +0900, Amit Langote wrote:
>> Thanks.  Attached a patch to set relhassubclass when an index partition is
>> added to a partitioned index.
> 
> Thanks, committed after adding a test with ALTER TABLE ONLY, and
> checking upgrades as well as ATTACH partition for ALTER INDEX and ALTER
> TABLE.  In all cases the same code paths are taken.

Thank you for committing with those changes.  I didn't know about create
index on "only".

>>  /* update pg_inherits, if needed */
>>  if (OidIsValid(parentIndexRelid))
>> +{
>>  StoreSingleInheritance(indexRelationId, parentIndexRelid, 1);
>>  
>> +/* Also, set the parent's relhassubclass. */
>> +SetRelationHasSubclass(parentIndexRelid, true);
>> +}
> 
> Calling SetRelationHasSubclass() updates pg_class for this parent
> relation.  We would need CommandCounterIncrement() if the tuple gets
> updated, but that's not the case as far as I checked for all code paths
> where this gets called.  This would be seen immediately by the way..

I assume you're talking about avoiding getting into a situation that
results in "ERROR: tuple concurrently updated".

Afaics, acquire_inherited_sample_rows() "does" perform CCI, because as the
comment says the parent's pg_class row may already have been updated in
that case:

 /* CCI because we already updated the pg_class row in this command */
 CommandCounterIncrement();
 SetRelationHasSubclass(RelationGetRelid(onerel), false);

In all the other case, SetRelationHasSubclass() seems to be the first time
that the parent's pg_class row is updated.

Thanks,
Amit




Re: relhassubclass and partitioned indexes

2018-10-21 Thread Michael Paquier
On Fri, Oct 19, 2018 at 06:46:15PM +0900, Amit Langote wrote:
> Thanks.  Attached a patch to set relhassubclass when an index partition is
> added to a partitioned index.

Thanks, committed after adding a test with ALTER TABLE ONLY, and
checking upgrades as well as ATTACH partition for ALTER INDEX and ALTER
TABLE.  In all cases the same code paths are taken.

>   /* update pg_inherits, if needed */
>   if (OidIsValid(parentIndexRelid))
> + {
>   StoreSingleInheritance(indexRelationId, parentIndexRelid, 1);
>  
> + /* Also, set the parent's relhassubclass. */
> + SetRelationHasSubclass(parentIndexRelid, true);
> + }

Calling SetRelationHasSubclass() updates pg_class for this parent
relation.  We would need CommandCounterIncrement() if the tuple gets
updated, but that's not the case as far as I checked for all code paths
where this gets called.  This would be seen immediately by the way..
--
Michael


signature.asc
Description: PGP signature


Re: found xmin x from before relfrozenxid y

2018-10-21 Thread Andres Freund
Hi,

On 2018-10-21 10:24:16 -0400, Tom Lane wrote:
> =?UTF-8?Q?Johannes_Gra=c3=abn?=  writes:
> > after upgrading to version 11, I see the error pattern "found xmin x
> > from before relfrozenxid y" in different databases on different hosts.
> > From https://www.postgresql.org/docs/10/static/release-10-3.html, I
> > learned that this was an error caused by pg_upgrade, which apparently
> > had been fixed in 10.3. This page also states that refreshing the
> > affected materialized view non-concurrently would fix the problem.
> > My question is now how to infer the affected materialized view from the
> > error message. Is there a way to tell which one to refresh from the xmin
> > or relfrozenxid value?
> 
> No :-(.  I wonder why in the world we didn't make that error message
> include the relation and block number the tuple was found in.

Because it was a really complicated bugfix already, I don't think the
answer is more complicated than that.


> (Well, I see the short answer: the code layer throwing the error
> doesn't know.  But that could be fixed easily enough.)

I wonder if the better approach wouldn't be to add an errcontext for
vaccuum, where continually update the block number etc. Theres plenty of
different sources of corruption that'd potentially cause debug messages
or errors, and that should get most of them.

Greetings,

Andres Freund



Re: More issues with pg_verify_checksums and checksum verification in base backups

2018-10-21 Thread Stephen Frost
Greetings,

* Michael Paquier (mich...@paquier.xyz) wrote:
> On Sun, Oct 21, 2018 at 11:03:30AM -0400, Stephen Frost wrote:
> > This doesn't change my opinion of the bigger question though, which is
> > to what extent we should be implicitly supporting extensions and
> > whatever else putting files into the database and tablespace
> > directories.
> 
> Well, the whole point is that I have never seen either that it is
> forbidden for extensions to drop files into global/ and/or base/.  I am
> pretty sure that I'd actually want to be able to do that myself by the
> way.  If I had pluggable storage APIs and the possibility to write by
> myself a storage engine out-of-the-box, I would like to be able to rely
> on the default tablespace path and use other tablespace paths.

All of this pie-in-the-sky about what pluggable storage might have is
just hand-waving, in my opinion, and not worth much more than that.  I
hope (and suspect..) that the actual pluggable storage that's being
worked on doesn't do any of this "just drop a file somewhere" because
there's a lot of downsides to it- and if it did, it wouldn't be much
more than what we can do with an FDW, so why go through and add it?

Considering the example given doesn't today do that (maybe it once did?)
as you mentioned up-thread, it seems like maybe there was a realization
that it's not a good idea even without this issue around pg_basebackup
and pg_verify_checksums.

> > Even if we go with this proposed change to look at the relation
> > filename, I'd be happier with some kind of warning being thrown when we
> > come across files we don't recognize in directories that aren't intended
> > to have random files showing up.
> 
> Yes, that could be something we could do, as an option I would guess as
> this does not match with what v10 does.  I'll wait for more people to
> provide input on this thread before answering more, but if possible I
> think that we should focus on fixing v11 and HEAD first, then try to
> figure out what we'd want to do later on.

pg_basebackup works the way it does in v10 because we've accepted
letting users drop files into the root of PGDATA and even encouraged it
to some extent.  I don't think there was ever a serious intent that it
would be used to back up an extension's self-created files on the
filesystem in tablespaces, since there's no way for it to know how to do
so in a way that would ensure that they're consistent or useful or
sensible to backup online.

What are you thinking the 'option' would look like?  Everything I
come up with seems awful confusing and not very sensible.

Thanks!

Stephen


signature.asc
Description: PGP signature


RE: WAL archive (archive_mode = always) ?

2018-10-21 Thread Tsunakawa, Takayuki
From: Adelino Silva [mailto:adelino.j.si...@googlemail.com]
> What is the advantage to use archive_mode = always in a slave server compared
> to archive_mode = on (shared WAL archive) ?
> 
> I only see duplication of Wal files, what is the purpose of this feature ?

This also saves you the network bandwidth by not sending the WAL archive from 
the primary to the standby(s).  The network bandwidth can be costly between 
remote regions for disaster recovery.


Regards
Takayuki Tsunakawa





Re: More issues with pg_verify_checksums and checksum verification in base backups

2018-10-21 Thread Michael Paquier
On Sun, Oct 21, 2018 at 11:03:30AM -0400, Stephen Frost wrote:
> This doesn't change my opinion of the bigger question though, which is
> to what extent we should be implicitly supporting extensions and
> whatever else putting files into the database and tablespace
> directories.

Well, the whole point is that I have never seen either that it is
forbidden for extensions to drop files into global/ and/or base/.  I am
pretty sure that I'd actually want to be able to do that myself by the
way.  If I had pluggable storage APIs and the possibility to write by
myself a storage engine out-of-the-box, I would like to be able to rely
on the default tablespace path and use other tablespace paths.

> Even if we go with this proposed change to look at the relation
> filename, I'd be happier with some kind of warning being thrown when we
> come across files we don't recognize in directories that aren't intended
> to have random files showing up.

Yes, that could be something we could do, as an option I would guess as
this does not match with what v10 does.  I'll wait for more people to
provide input on this thread before answering more, but if possible I
think that we should focus on fixing v11 and HEAD first, then try to
figure out what we'd want to do later on.
--
Michael


signature.asc
Description: PGP signature


Re: Patch to avoid SIGQUIT accident

2018-10-21 Thread Renato dos Santos
I agree with your arguments, and if instead we put an option to disable
this before compiling or a set in the psql cli?

On Sun, Oct 21, 2018, 20:20 Tom Lane  wrote:

> Renato dos Santos  writes:
> > I have been using psql for quite a few days. And I have accidentally
> pressed the CTRL + \ keys that sends the signal QUIT+Coredump (CTRL+4 also
> sends the same signal).
> > I hope it's relevant to more people. (This has bothered me.)
>
> > this patch avoids the output of the CLI using ctrl + \ is the same as
> ctrl + c
>
> I'm fairly confused about why this would be a good idea, for several
> reasons:
>
> * If you have a tendency to make that typo, why would you want a fix that
> only affects psql and not any of your other applications?  (If you do
> want the latter, there are already ways to do it, eg you could remap
> SIGQUIT to some other key via stty, or disable core dumps via ulimit.)
>
> * If we put this in, what becomes of people who actually want a core dump,
> eg for debugging?
>
> * SIGQUIT is a fairly well-known way to get out of an application when all
> else fails.  People who aren't familiar with psql's exit commands might
> find it pretty unfriendly of us to block this off.
>
> regards, tom lane
>


Re: Patch to avoid SIGQUIT accident

2018-10-21 Thread Tom Lane
Renato dos Santos  writes:
> I have been using psql for quite a few days. And I have accidentally pressed 
> the CTRL + \ keys that sends the signal QUIT+Coredump (CTRL+4 also sends the 
> same signal).
> I hope it's relevant to more people. (This has bothered me.)

> this patch avoids the output of the CLI using ctrl + \ is the same as ctrl + c

I'm fairly confused about why this would be a good idea, for several
reasons:

* If you have a tendency to make that typo, why would you want a fix that
only affects psql and not any of your other applications?  (If you do
want the latter, there are already ways to do it, eg you could remap
SIGQUIT to some other key via stty, or disable core dumps via ulimit.)

* If we put this in, what becomes of people who actually want a core dump,
eg for debugging?

* SIGQUIT is a fairly well-known way to get out of an application when all
else fails.  People who aren't familiar with psql's exit commands might
find it pretty unfriendly of us to block this off.

regards, tom lane



Re: Pluggable Storage - Andres's take

2018-10-21 Thread Alexander Korotkov
On Thu, Oct 18, 2018 at 6:28 AM Haribabu Kommi  wrote:
> On Tue, Oct 16, 2018 at 6:06 AM Alexander Korotkov 
>  wrote:
>>  * 0002-add-costing-function-to-API.patch – adds function for costing
>> sequential and table sample scan to tableam API.  zheap costing
>> function are now copies of heap costing function.  This should be
>> adjusted in future.
>
> This patch misses the new *_cost.c files that are added specific cost
> functions.

Thank you for noticing.  Revised patchset is attached.

>>   Estimation for heap lookup during index scans
>> should be also pluggable, but not yet implemented (TODO).
>
> Yes, Is it possible to use the same API that is added by above
> patch?

I'm not yet sure.  I'll elaborate more on that.  I'd like to keep
number of costing functions small.  Handling of costing of index scan
heap fetches will probably require function signature change.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


0001-remove-extra-snapshot-functions-2.patch
Description: Binary data


0002-add-costing-function-to-API-2.patch
Description: Binary data


Re: [PATCH] XLogReadRecord returns pointer to currently read page

2018-10-21 Thread Heikki Linnakangas

On 17/08/2018 06:47, Andrey Lepikhov wrote:

I propose the patch for fix one small code defect.
The XLogReadRecord() function reads the pages of a WAL segment that
contain a WAL-record. Then it creates a readRecordBuf buffer in private
memory of a backend and copy record from the pages to the readRecordBuf
buffer. Pointer 'record' is set to the beginning of the readRecordBuf
buffer.

But if the WAL-record is fully placed on one page, the XLogReadRecord()
function forgets to bind the "record" pointer with the beginning of the
readRecordBuf buffer. In this case, XLogReadRecord() returns a pointer
to an internal xlog page. This patch fixes the defect.


Hmm. I agree it looks weird the way it is. But I wonder, why do we even 
copy the record to readRecordBuf, if it fits on the WAL page? Returning 
a pointer to the xlog page buffer seems OK in that case. What if we 
remove the memcpy(), instead?


- Heikki



Re: remove deprecated @@@ operator ?

2018-10-21 Thread Tom Lane
Oleg Bartunov  writes:
> The  commit 9b5c8d45f62bd3d243a40cc84deb93893f2f5122 is now 10+ years
> old, may be we could remove deprecated @@@ operator ?

Is it actually causing any problem?  AFAICS it's just a couple extra
pg_operator entries, so why not leave it?

I'd be +1 for removing it from the docs, though ...

regards, tom lane



remove deprecated @@@ operator ?

2018-10-21 Thread Oleg Bartunov
Hello,

The  commit 9b5c8d45f62bd3d243a40cc84deb93893f2f5122 is now 10+ years
old, may be we could remove deprecated @@@ operator ?


Author: Tom Lane 
Date:   Mon Apr 14 17:05:34 2008 +

Push index operator lossiness determination down to GIST/GIN opclass
"consistent" functions, and remove pg_amop.opreqcheck, as per recent
discussion.  The main immediate benefit of this is that we no longer need
8.3's ugly hack of requiring @@@ rather than @@ to test weight-using tsquery
searches on GIN indexes.  In future it should be possible to optimize some
other queries better than is done now, by detecting at runtime whether the
index match is exact or not.

Tom Lane, after an idea of Heikki's, and with some help from Teodor.

Best Regards,
Oleg
-- 
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: SQL:2011 PERIODS vs Postgres Ranges?

2018-10-21 Thread Paul A Jungwirth
On Sun, Oct 21, 2018 at 12:11 PM Heikki Linnakangas  wrote:
> On 21/10/2018 21:17, Paul A Jungwirth wrote:
> > 3. Build our own abstractions on top of ranges, and then use those to
> > implement PERIOD-based features.
> +1 on this approach. I think [7] got the model right. If we can
> implement SQL-standard PERIODs on top of it, then that's a bonus, but
> having sane, flexible, coherent set of range operators is more important
> to me.

Okay, I'm surprised to hear from you and Isaac that following the
standard isn't as important as I thought, but I'm certainly pleased
not to make it the focus. I just thought that Postgres's reputation
was to be pretty careful about sticking to it. (I think we could still
add a standard-compliant layer, but like you I don't feel a duty to
suffer from it.) It sounds like I should work out some proposed
function signatures and write up how to use them, and see what people
think. Is that a useful approach?

> What are we missing?

Here are a few big ones:

1. Define temporal primary keys and foreign keys that are known to the
database catalog and controlled as higher-level objects. For instance
I wrote an extension at https://github.com/pjungwir/time_for_keys to
create temporal foreign keys, but the database isn't "aware" of them.
That means they are more cluttered in `\d foo` than necessary (you see
the trigger constraints instead of something about a foreign key),
they don't automatically disappear if you drop the column, it is hard
to make them "polymorphic" (My extension supports only
int+tstzrange.), they don't validate that the referenced table has a
declared temporal PK, they probably have slightly different
locking/transaction semantics than the real RI code, etc. This is what
I'd like to implement right now.

2. System time: automatically track DML changes to the table, and let
you query "as of" a given time.

3. Temporal joins. I don't want to tackle this myself, because there
is already an amazing proposed patch that does everything we could ask
for at 
https://www.postgresql-archive.org/PROPOSAL-Temporal-query-processing-with-range-types-tt5913058.html
(recently updated btw, so I hope someone will look at it!).

4. Temporal UPDATE/DELETE: these should be converted to instead change
the end time of old rows and insert new rows with the changed
attributes. I'm interested in implementing this too, but one thing at
a time. . . .

I really appreciate your sharing your thoughts!

Paul



Re: SQL:2011 PERIODS vs Postgres Ranges?

2018-10-21 Thread Heikki Linnakangas

On 21/10/2018 21:17, Paul A Jungwirth wrote:

3. Build our own abstractions on top of ranges, and then use those to
implement PERIOD-based features. This is the least clear option, and I
imagine it would require a lot more design effort. Our range types are
already a step in this direction. Does anyone think this approach has
promise? If so I can start thinking about how we'd do it. I imagine we
could use a lot of the ideas in [7].
...
[7] C. J. Date, Hugh Darwen, Nikos Lorentzos. Time and Relational
Theory, Second Edition: Temporal Databases in the Relational Model and
SQL. 2nd edition, 2014.


+1 on this approach. I think [7] got the model right. If we can 
implement SQL-standard PERIODs on top of it, then that's a bonus, but 
having sane, flexible, coherent set of range operators is more important 
to me.


What are we missing? It's been years since I read that book, but IIRC 
temporal joins is one thing, at least. What features do you have in mind?


- Heikki



Re: SQL:2011 PERIODS vs Postgres Ranges?

2018-10-21 Thread Isaac Morland
On Sun, 21 Oct 2018 at 14:18, Paul A Jungwirth 
wrote:

> Also, just how strictly do we have to follow the standard? Requiring
> sentinels like '01 JAN 3000` just seems so silly. Could Postgres
> permit nullable start/end PERIOD columns, and give them the same
> meaning as ranges (unbounded)? Even if I forgot about ranges
> altogether, I'd sure love to avoid these sentinels.
>

We have "infinity" and "-infinity" values in our date and timestamp types:

https://www.postgresql.org/docs/current/static/datatype-datetime.html

I think this avoids the silliness with sentinel values.

For myself, I don't care about PERIOD etc. one bit. The "every new
capability gets its own syntax" model that SQL follows is very
old-fashioned, and for good reason. I'm happy with ranges and exclusion
constraints. But if we can provide an implementation of PERIOD that makes
it easier to port applications written for legacy database systems, it
might be worthwhile.


SQL:2011 PERIODS vs Postgres Ranges?

2018-10-21 Thread Paul A Jungwirth
Hello,

I'm interested in contributing some temporal database functionality to
Postgres, starting with temporal primary and foreign keys. I know some
other folks nearby interested in helping out, too. But before we begin
I'd like to ask the community about complying with the SQL:2011
standard [1] for these things.

In SQL:2011, temporal features all build upon PERIODs, which are a new
concept you can attach to tables. Each PERIOD is composed of a start
column and an end column (both of some date/time type). You define
PERIODs when you CREATE TABLE or ALTER TABLE. Then you refer to the
periods when you create primary keys or foreign keys to make them
temporal. There are also a handful of new operators for testing two
ranges for overlap/succession/etc.[2] Most PERIODs are for tracking
the history of a *thing* over time, but if the PERIOD is named
SYSTEM_TIME it instead tracks the history of changes to *your
database*.[3] (Google for "bitemporal" to read more about this.)

Personally I think PERIODs are quite disappointing. They are not part
of relational theory. They are not a column, but something else. If
you say `SELECT * FROM t` you don't get `PERIODs` (as far as I can
tell). But you can mention PERIODs approximately wherever you can
mention columns [4], so now we have to support them when projecting,
selecting, joining, aggregating, etc. (Or if we are permitted to not
support them in some of those places, isn't that even worse?)

You can see that PERIODs share a lot with Postgres's own range types.
But ranges are a real column, requiring no special-case behavior,
either for RDBMS implementers or SQL users. They have a richer set of
operators.[5] They don't require any special declarations to put them
in a table. They aren't limited to just date/time types. You can even
define new range types yourself (e.g. I've found it helpful before to
define inetrange and floatrange). Also the start/end columns of a
PERIOD must be not nullable,[6] so that unbounded ranges must use
sentinels like `01 JAN ` or `01 JAN 3000` instead. Also there is
no way (as far as I can tell) to define and use a period within a
subquery or CTE or view. Many of these criticisms of PERIODs you can
find in [7], pages 403 - 410 (where "interval" means basically our own
range types), plus others: for example PERIODs are always closed/open,
you can only have a single application PERIOD per table, they are
wordy, etc.

I expect that any Postgres implementation of the standard would wind
up using ranges internally. For example a temporal primary key would
use an exclusion constraint based on a range expression, so if you had
a PERIOD defined on columns named `valid_start` and `valid_end`, the
PK would use something like `EXCLUDE USING gist (id WITH =,
tstzrange(valid_start, valid_end) WITH &&)`. Also the new SQL:2011
operators would be easy to implement on top of our range operators.
And then a temporal foreign key implementation would use either those
or raw range operators.

So is there any way for Postgres to offer the same temporal features,
but give users the choice of using either PERIODs or ranges? If we
built that, would the community be interested in it? I think there are
several possible ways to go about it:

1. Permit defining PERIODs on either a start/end column pair, or an
existing range column. Then everything else continues to use PERIODs.
This seems tidy to implement, although since it acquiesces to the
PERIOD-based approach for temporal functionality, it doesn't solve all
the problems above. Also as [9] points out, it would lead to
incompatibilities in the new `information_schema` views. E.g.
`periods` is supposed to have `start_column_name` and
`end_column_name` columns.[8]

2. Permit either ranges or PERIODs in the new syntax, e.g. `PRIMARY
KEY (id, valid_at WITHOUT OVERLAPS)` where `valid_at` is either a
PERIOD or a range column. Similarly with foreign keys. There is
probably some `information_schema` messiness here too, but perhaps
less than with #1. This seems like a great alternative to
application-time PERIODs, but I'm not sure how you'd tell Postgres to
use a range column for the system-time dimension.[3] Perhaps just a
function, and then the PERIOD of `SYSTEM_TIME` would call that
function (with a range expression).

3. Build our own abstractions on top of ranges, and then use those to
implement PERIOD-based features. This is the least clear option, and I
imagine it would require a lot more design effort. Our range types are
already a step in this direction. Does anyone think this approach has
promise? If so I can start thinking about how we'd do it. I imagine we
could use a lot of the ideas in [7].

4. Just give up and follow the standard to the letter. I'm not
enthusiastic about this, but I also really want temporal features, so
I might still do the work if that's what folks preferred.

Left to my own devices I would probably go with a mix of #2 & #3,
where temporal functionality is exposed by a layer of public 

Patch to avoid SIGQUIT accident

2018-10-21 Thread Renato dos Santos
Hello,I have been using psql for quite a few days. And I have accidentally pressed the CTRL + \ keys that sends the signal QUIT+Coredump (CTRL+4 also sends the same signal).I hope it's relevant to more people. (This has bothered me.)this patch avoids the output of the CLI using ctrl + \ is the same as ctrl + c
Renato dos Santosshaz...@gmail.com

psql.patch
Description: Binary data


smime.p7s
Description: S/MIME cryptographic signature


Re: More issues with pg_verify_checksums and checksum verification in base backups

2018-10-21 Thread Stephen Frost
Greetings,

* Michael Paquier (mich...@paquier.xyz) wrote:
> This is a follow-up of the following thread:
> https://www.postgresql.org/message-id/20181012010411.re53cwcistcpi...@alap3.anarazel.de

Thanks for starting this thread Michael.

> pg_verify_checksums used first a blacklist to decide if files have
> checksums or not.  So with this approach all files should have checksums
> except files like pg_control, pg_filenode.map, PG_VERSION, etc.

So, this is a great example- pg_control actually *does* have a CRC and
we really should be checking it in tools like pg_verify_checksums and
pg_basebackup.  Similairly, we should be trying to get to a point where
we have checksums for anything else that we actually care about.

> d55241a has added a first fix to silence the buildfarm for EXEC_BACKEND
> builds.  After discussion, Andres has pointed out that some extensions
> may want to drop files within global/ or base/ as part of their logic.

> cstore_fdw was mentioned but if you look at their README that's not the
> case.  

So the one example that's been put forward doesn't actually do this..?

> However, I think that Andres argument is pretty good as with
> pluggable storage we should allow extensions to put custom files for
> different tablespaces.  

Andres' wasn't argueing, that I saw at least, that pluggable storage
would result in random files showing up in tablespace directories that
the core code has no knowledge of.  In fact, he seemed to be saying in
20181019205724.ewnuhvrsanaci...@alap3.anarazel.de that pluggable storage
results in the core code being aware of the files that those other
storage mechanisms create, so this entire line of argument seems to be
without merit.

> So this commit has changed the logic of
> pg_verify_checksums to use a whitelist, which assumes that only normal
> relation files have checksums:
> 
> .
> _
> _. After more discussion on the thread mentioned above, Stephen has pointed
> out that base backups use the same blacklist logic when verifying
> checksums.  This has the same problem with EXEC_BACKEND-specific files,
> resulting in spurious warnings (that's an example!) which are confusing
> for the user:
> $ pg_basebackup -D popo
> WARNING:  cannot verify checksum in file "./global/config_exec_params",
> block 0: read buffer size 5 and page size 8192 differ

Stephen also extensively argued that extensions shouldn't be dropping
random files into PG's database and tablespace directories and that we
shouldn't be writing code which attempts to work around that (and,
indeed, ultimately can't since technically extension authors could drop
files into those directories which match these "whitelist patterns").

Further, using a whitelist risks possibly missing files that should be
included, unlike having an exclude list which ensures that we actually
check everything and complain about anything found that's out of the
ordinary.  Additionally, having these checks would hopefully make it
clear that people shouldn't be dropping random files into our database
and tablespace directories- something we didn't try to do for the root
of the data directory, resulting in, frankly, a big mess.  Allowing
random files to exist in the data directory has lead to quite a few
issues including things like pg_basebackup breaking because of a
root-owned file or similar that can't be read, and this extends that.

I thought the point of this new thread was to encourage discussion of
that point and the pros and cons seen for each side, not to only include
one side of it, so I'm rather disappointed.

> Folks on this thread agreed that both pg_verify_checksums and checksum
> verification for base backups should use the same logic.  It seems to me
> that using a whitelist-based approach for both is easier to maintain as
> the patterns of files supporting checksums is more limited than files
> which do not support checksums.  So this way we allow extensions
> bundling custom files to still work with pg_verify_checksums and
> checksum verification in base backups.

This is not an accurate statement- those random files which some
extension drops into these directories don't "work" with
pg_verify_checksums, this just makes pg_verify_checksums ignore them.
That is not the same thing at all.  Worse, we end up with things like
pg_basebackup copying/backing up those files, but skipping them for
validation checking, when we have no reason to expect that those files
will be at all valid when they're copied and no way to see if they are
valid in the resulting restore.

Other parts of the system will continue to similairly do things that
people might not expect- DROP DATABASE will happily nuke any file it
finds, no matter if it matches these patterns or not.

Basically, the way I see it, at least, is that we should either maintain
that PG's database and tablespace directories are under the purview of
PG and other things shouldn't be putting files there without the core
code's knowledge, or we decide that it's ok for ran

Re: found xmin x from before relfrozenxid y

2018-10-21 Thread Tom Lane
=?UTF-8?Q?Johannes_Gra=c3=abn?=  writes:
> after upgrading to version 11, I see the error pattern "found xmin x
> from before relfrozenxid y" in different databases on different hosts.
> From https://www.postgresql.org/docs/10/static/release-10-3.html, I
> learned that this was an error caused by pg_upgrade, which apparently
> had been fixed in 10.3. This page also states that refreshing the
> affected materialized view non-concurrently would fix the problem.
> My question is now how to infer the affected materialized view from the
> error message. Is there a way to tell which one to refresh from the xmin
> or relfrozenxid value?

No :-(.  I wonder why in the world we didn't make that error message
include the relation and block number the tuple was found in.

(Well, I see the short answer: the code layer throwing the error
doesn't know.  But that could be fixed easily enough.)

In the meantime, the only answer I can think of offhand is to manually
do VACUUM FREEZE on each of your MVs, and then refresh anything that
shows up with an error.

regards, tom lane



More issues with pg_verify_checksums and checksum verification in base backups

2018-10-21 Thread Michael Paquier
Hi all,

This is a follow-up of the following thread:
https://www.postgresql.org/message-id/20181012010411.re53cwcistcpi...@alap3.anarazel.de

In a nutshell, b34e84f1 has added TAP tests for pg_verify_checksums, and
the buildfarm has immediately complained about files generated by
EXEC_BACKEND missing, causing pg_verify_checksums to fail for such
builds.  An extra issue has been noticed by Andres in the tool: it
misses to check for temporary files, hence files like
base/pgsql_tmp/pgsql_tmp$PID.NN.sharedfileset/1.1 can also cause the
tool to fail.  After a crash, those files would not be removed, and
stopping the instance would still let them around.

pg_verify_checksums used first a blacklist to decide if files have
checksums or not.  So with this approach all files should have checksums
except files like pg_control, pg_filenode.map, PG_VERSION, etc.

d55241a has added a first fix to silence the buildfarm for EXEC_BACKEND
builds.  After discussion, Andres has pointed out that some extensions
may want to drop files within global/ or base/ as part of their logic.
cstore_fdw was mentioned but if you look at their README that's not the
case.  However, I think that Andres argument is pretty good as with
pluggable storage we should allow extensions to put custom files for
different tablespaces.  So this commit has changed the logic of
pg_verify_checksums to use a whitelist, which assumes that only normal
relation files have checksums:

.
_
_.diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index b20f6c379c..207e2facb8 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -72,7 +72,6 @@ static void parse_basebackup_options(List *options, basebackup_options *opt);
 static void SendXlogRecPtrResult(XLogRecPtr ptr, TimeLineID tli);
 static int	compareWalFileNames(const void *a, const void *b);
 static void throttle(size_t increment);
-static bool is_checksummed_file(const char *fullpath, const char *filename);
 
 /* Was the backup currently in-progress initiated in recovery mode? */
 static bool backup_started_in_recovery = false;
@@ -187,17 +186,6 @@ static const char *excludeFiles[] =
 	NULL
 };
 
-/*
- * List of files excluded from checksum validation.
- */
-static const char *const noChecksumFiles[] = {
-	"pg_control",
-	"pg_filenode.map",
-	"pg_internal.init",
-	"PG_VERSION",
-	NULL,
-};
-
 
 /*
  * Called when ERROR or FATAL happens in perform_base_backup() after
@@ -1101,8 +1089,7 @@ sendDir(const char *path, int basepathlen, bool sizeonly, List *tablespaces,
 
 		/* Exclude all forks for unlogged tables except the init fork */
 		if (isDbDir &&
-			parse_filename_for_nontemp_relation(de->d_name, &relOidChars,
-&relForkNum))
+			ParseRelationFileName(de->d_name, &relOidChars, &relForkNum))
 		{
 			/* Never exclude init forks */
 			if (relForkNum != INIT_FORKNUM)
@@ -1312,32 +1299,6 @@ sendDir(const char *path, int basepathlen, bool sizeonly, List *tablespaces,
 	return size;
 }
 
-/*
- * Check if a file should have its checksum validated.
- * We validate checksums on files in regular tablespaces
- * (including global and default) only, and in those there
- * are some files that are explicitly excluded.
- */
-static bool
-is_checksummed_file(const char *fullpath, const char *filename)
-{
-	const char *const *f;
-
-	/* Check that the file is in a tablespace */
-	if (strncmp(fullpath, "./global/", 9) == 0 ||
-		strncmp(fullpath, "./base/", 7) == 0 ||
-		strncmp(fullpath, "/", 1) == 0)
-	{
-		/* Compare file against noChecksumFiles skiplist */
-		for (f = noChecksumFiles; *f; f++)
-			if (strcmp(*f, filename) == 0)
-return false;
-
-		return true;
-	}
-	else
-		return false;
-}
 
 /*
  * Functions for handling tar file format
@@ -1391,13 +1352,15 @@ sendFile(const char *readfilename, const char *tarfilename, struct stat *statbuf
 		char	   *filename;
 
 		/*
-		 * Get the filename (excluding path).  As last_dir_separator()
-		 * includes the last directory separator, we chop that off by
-		 * incrementing the pointer.
+		 * Check if a given file should have its checksums verified or not.
+		 * Only relation files should have this property now.  First get the
+		 * filename (excluding path).  As last_dir_separator() includes the
+		 * last directory separator, we chop that off by incrementing the
+		 * pointer.
 		 */
 		filename = last_dir_separator(readfilename) + 1;
 
-		if (is_checksummed_file(readfilename, filename))
+		if (ParseRelationFileName(filename, NULL, NULL))
 		{
 			verify_checksum = true;
 
diff --git a/src/backend/storage/file/reinit.c b/src/backend/storage/file/reinit.c
index 74ff6c359b..eae32e9a5d 100644
--- a/src/backend/storage/file/reinit.c
+++ b/src/backend/storage/file/reinit.c
@@ -186,8 +186,7 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
 			unlogged_relation_entry ent;
 
 			/* Skip anything that doesn't look like a relation data file. */
-			if (!parse_fil