date:20180917

Re: Problem while setting the fpw with SIGHUP

2018-09-17 Thread Michael Paquier

On Tue, Sep 18, 2018 at 01:06:09PM +0900, Kyotaro HORIGUCHI wrote:
> I was wrong here. It was handled in HandleStartupProcInterrupts
> called from StartupXLOG. So, it should be just removed from the
> set. Sorry for the bogus patch.

Thanks for confirming.

Still, it looks like a waste to abuse on SIGINT just to forcibly wake up
the checkpointer and request from it a checkpoint...  And you could just
have used a new parameter for the checkpointer appended with
CHECKPOINT_FORCE.  I think that my approach of just making the set of
events purely ordered will save from any kind of race conditions, while
I suspect that what you propose here does not close all the holes.
--
Michael


signature.asc
Description: PGP signature

Re: Problem while setting the fpw with SIGHUP

2018-09-17 Thread Michael Paquier

On Fri, Sep 14, 2018 at 04:30:37PM +0530, Amit Kapila wrote:
> On Fri, Sep 14, 2018 at 12:57 PM Michael Paquier  wrote:
>> So, I have been working on this problem again and I have reviewed the
>> thread, and there have been many things discussed in the last couple of
>> months:
>> 1) We do not want to initialize XLogInsert stuff unconditionally for all
>> processes at the moment recovery begins, but we just want to initialize
>> it once WAL write is open for business.
>> 2) Both the checkpointer and the startup process can call
>> UpdateFullPageWrites() which can cause Insert->fullPageWrites to get
>> incorrect values.
> 
> Can you share the steps to reproduce this problem?

This refers to the first problem reported on this thread:
https://www.postgresql.org/message-id/CAFiTN-u4BA8KXcQUWDPNgaKAjDXC%3DC2whnzBM8TAcv%3DstckYUw%40mail.gmail.com

In order to reproduce the problem, you can for example stop the server
in immediate mode.  Then attach a debugger to it and add a breakpoint to
UpdateFullPageWrites.  You can check that XLOG insert has not been
initialized yet by looking at xloginsert_cxt ot ThisTimeLineID.  On a
second session, switch full_page_writes to on or off, reload the
parameters and then trigger a checkpoint.  The important point is to
trigger an inconsistency between XLogCtl->Insert->fullPageWrites and
the value of fullPageWrites within the checkpointer context.  With the
checkpoint triggered, the debugger will stop at UpdateFullPageWrites
immediately.  At this point, you can simply check if fullPageWrites
Insert->fullPageWrites have the same value or a different one.  If the
values match, simply switch full_page_writes and reload again, with the
checkpointer still waiting at the beginning of UpdateFullPageWrites.
SIGHUP will make the checkpointer process hang a bit, and then it will
move on.  At this point you will be able to see the failure:
TRAP: FailedAssertion("!(CritSectionCount == 0)", File: "mcxt.c", Line: 731)
2018-09-18 15:06:39 JST [7396]: [11-1] db=,user=,app=,client= LOG:
checkpointer process (PID 7399) was terminated by signal 6: Aborted

> On a regular startup when there is no recovery, it won't allow us to
> log the WAL record (XLOG_FPW_CHANGE) which can happen without above
> change.  You can check that by setting full_page_writes=off and start
> the system.

Oh, good point, InRecovery is set to false in this case so that would be
skipped.  We can simply fix that by adding a flag, say "force" to
UpdateFullPageWrites to allow a process to enforce the update of FPW
even if RecoveryInProgress returns true, which would be the case for the
startup process.
--
Michael

signature.asc
Description: PGP signature

Re: Cache lookup errors with functions manipulation object addresses

2018-09-17 Thread Michael Paquier

On Fri, Sep 14, 2018 at 12:07:23PM -0300, Alvaro Herrera wrote:
> On 2018-Sep-14, Alvaro Herrera wrote:
>> I haven't looked at 0003 yet.

Thanks for the review.  I have pushed 0002 for now.  I had one doubt
about 0001 though.  So as to avoid redesigning the APIs for FDWs and
servers again in the future, wouldn't we want to give them a uint32
flags which can be in with more than one option.  There would be only
one option for now, which is what I have done in the attached.

> It's strange that pg_identify_object returns empty type in only some
> cases (as seen in the regression test output) ...

Are you referring to something in particular?  All the routines used in
objectaddress.c are part of what exists in core for some time now,
particularly handling for operators, functions and types has been sort
of messy.  It is of course possible to reach a state in the code where
we have options to get NULL results for all those things.  For example,
format_procedure_internal could be refactored in a way similar to what
happens for format_type_extended.  And format_type_extended could gain a
FORCE_NULL flag which makes sure that on an undefined object
objectaddress.c sees a NULL object.  That was one of the points raised
upthread that I still wanted to discuss..  One thing is that I am not
sure if it is worth complicating the existing interface even more just
to deal with undefined objects as pure NULLs.  Thoughts are welcome.

> And this one definitely does not make sense:
> 
> +SELECT * FROM pg_identify_object('pg_class'::regclass, 'pg_class'::regclass, 
> -8); -- no column for relation
> + type |   schema   |   name   |  identity   
> +--++--+-
> + table column | pg_catalog | pg_class | pg_catalog.pg_class
> +(1 row)

Indeed, this one is not intentional, and can be easily addressed by
checking first for the attribute existence.  This is corrected in the
attached version.
--
Michael
From f4dd9741833e4499c1d7698dc869e2f8b9bfb5ce Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Tue, 18 Sep 2018 14:17:09 +0900
Subject: [PATCH 1/2] Extend lookup routines for FDW and foreign server with
 NULL handling

The cache lookup routines for foreign-data wrappers and foreign servers
are extended with an extra argument to pass flags.  The only value which
can be used now is to indicate if it a missing object should result in
an error or not.  Those new routines are added into the existing set and
documented in consequence.
---
 doc/src/sgml/fdwhandler.sgml  | 34 +
 src/backend/foreign/foreign.c | 36 +--
 src/include/foreign/foreign.h | 16 
 3 files changed, 84 insertions(+), 2 deletions(-)

diff --git a/doc/src/sgml/fdwhandler.sgml b/doc/src/sgml/fdwhandler.sgml
index 4ce88dd77c..6fe6122568 100644
--- a/doc/src/sgml/fdwhandler.sgml
+++ b/doc/src/sgml/fdwhandler.sgml
@@ -1408,6 +1408,23 @@ ReparameterizeForeignPathByChild(PlannerInfo *root, List *fdw_private,
 
 
 ForeignDataWrapper *
+GetForeignDataWrapperExtended(Oid fdwid, uint32 flags);
+
+
+ This function returns a ForeignDataWrapper
+ object for the foreign-data wrapper with the given OID.  A
+ ForeignDataWrapper object contains properties
+ of the FDW (see foreign/foreign.h for details).
+ flags is a bitwise-or'd bit mask indicating
+ an extra set of options.  It can take the value
+ FDW_MISSING_OK, in which case a NULL
+ result is returned to the caller instead of an error for an undefined
+ object.
+
+
+
+
+ForeignDataWrapper *
 GetForeignDataWrapper(Oid fdwid);
 
 
@@ -1420,6 +1437,23 @@ GetForeignDataWrapper(Oid fdwid);
 
 
 ForeignServer *
+GetForeignServerExtended(Oid serverid, uint32 flags);
+
+
+ This function returns a ForeignServer object
+ for the foreign server with the given OID.  A
+ ForeignServer object contains properties
+ of the server (see foreign/foreign.h for details).
+ flags is a bitwise-or'd bit mask indicating
+ an extra set of options.  It can take the value
+ FSV_MISSING_OK, in which case a NULL
+ result is returned to the caller instead of an error for an undefined
+ object.
+
+
+
+
+ForeignServer *
 GetForeignServer(Oid serverid);
 
 
diff --git a/src/backend/foreign/foreign.c b/src/backend/foreign/foreign.c
index eac78a5d31..88726d9d70 100644
--- a/src/backend/foreign/foreign.c
+++ b/src/backend/foreign/foreign.c
@@ -33,6 +33,18 @@
  */
 ForeignDataWrapper *
 GetForeignDataWrapper(Oid fdwid)
+{
+	return GetForeignDataWrapperExtended(fdwid, 0);
+}
+
+
+/*
+ * GetForeignDataWrapperExtended -	look up the foreign-data wrapper
+ * by OID. If flags uses FDW_MISSING_OK, return NULL if the object cannot
+ * be found instead of raising an error.
+ */
+ForeignDataWrapper *
+GetForeignDataWrapperExtended(Oid fdwid, uint32 flags)
 {
 	Form_pg_foreign_data_wrapper fdwform;
 	ForeignDataWrapper *fdw;
@@ -43,7 +55,

Re: Multiple primary key on partition table?

2018-09-17 Thread amul sul

On Mon, Sep 17, 2018 at 9:06 PM amul sul  wrote:
>
> Nice catch Rajkumar.
>
> In index_check_primary_key(), relationHasPrimaryKey() called only for the an
> alter command but I think we need to call in this case as well, like this:
>
> diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
> index 7eb3e35166..c8714395fe 100644
> --- a/src/backend/catalog/index.c
> +++ b/src/backend/catalog/index.c
> @@ -223,7 +223,7 @@ index_check_primary_key(Relation heapRel,
>  * and CREATE INDEX doesn't have a way to say PRIMARY KEY, so it's no
>  * problem either.
>  */
> -   if (is_alter_table &&
> +   if ((is_alter_table || heapRel->rd_rel->relispartition) &&
> relationHasPrimaryKey(heapRel))
> {
> ereport(ERROR,
>
> Thoughts?
>

Here is the complete patch proposes the aforesaid fix with regression test.

Regards,
Amul
From e936138f4befce29f5dfda1df37f1ab9acf0c8de Mon Sep 17 00:00:00 2001
From: Amul Sul 
Date: Tue, 18 Sep 2018 01:40:38 -0400
Subject: [PATCH] multiple primary key restriction for a partition table

---
 src/backend/catalog/index.c| 10 +-
 src/test/regress/expected/indexing.out |  3 +++
 src/test/regress/sql/indexing.sql  |  2 ++
 3 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 7eb3e35166..a18816f312 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -218,12 +218,12 @@ index_check_primary_key(Relation heapRel,
 	int			i;
 
 	/*
-	 * If ALTER TABLE, check that there isn't already a PRIMARY KEY. In CREATE
-	 * TABLE, we have faith that the parser rejected multiple pkey clauses;
-	 * and CREATE INDEX doesn't have a way to say PRIMARY KEY, so it's no
-	 * problem either.
+	 * If ALTER TABLE and CREATE TABLE .. PARTITION OF, check that there isn't
+	 * already a PRIMARY KEY.  In CREATE TABLE for an ordinary relations, we
+	 * have faith that the parser rejected multiple pkey clauses; and CREATE
+	 * INDEX doesn't have a way to say PRIMARY KEY, so it's no problem either.
 	 */
-	if (is_alter_table &&
+	if ((is_alter_table || heapRel->rd_rel->relispartition) &&
 		relationHasPrimaryKey(heapRel))
 	{
 		ereport(ERROR,
diff --git a/src/test/regress/expected/indexing.out b/src/test/regress/expected/indexing.out
index b9297c98d2..b9d11bf397 100644
--- a/src/test/regress/expected/indexing.out
+++ b/src/test/regress/expected/indexing.out
@@ -781,6 +781,9 @@ Indexes:
 "idxpart_pkey" PRIMARY KEY, btree (a)
 Number of partitions: 0
 
+-- multiple primary key on child should fail
+create table failpart partition of idxpart (b primary key) for values from (0) to (100);
+ERROR:  multiple primary keys for table "failpart" are not allowed
 drop table idxpart;
 -- but not if you fail to use the full partition key
 create table idxpart (a int unique, b int) partition by range (a, b);
diff --git a/src/test/regress/sql/indexing.sql b/src/test/regress/sql/indexing.sql
index 2091a87ff5..25224b6924 100644
--- a/src/test/regress/sql/indexing.sql
+++ b/src/test/regress/sql/indexing.sql
@@ -399,6 +399,8 @@ drop table idxpart;
 -- Verify that it works to add primary key / unique to partitioned tables
 create table idxpart (a int primary key, b int) partition by range (a);
 \d idxpart
+-- multiple primary key on child should fail
+create table failpart partition of idxpart (b primary key) for values from (0) to (100);
 drop table idxpart;
 
 -- but not if you fail to use the full partition key
-- 
2.18.0

Difference in TO_TIMESTAMP results.

2018-09-17 Thread Prabhat Sahu

Hi All,

I have found below difference in TO_TIMESTAMP results.

postgres[114552]=# select to_timestamp('15-07-1984 23:30:32','dd-mm-
hh24:mi:ss');
   to_timestamp
---
 1984-07-15 23:30:32+05:30
(1 row)

postgres[114552]=# select to_timestamp('15-07-84 23:30:32','dd-mm-
hh24:mi:ss');
 to_timestamp
--
 0084-07-15 23:30:32+05:53*:28*
(1 row)

My doubt is the *":28"* in timezone part in 2nd result, is it expected ?

-- 


With Regards,

Prabhat Kumar Sahu
Skype ID: prabhat.sahu1984
EnterpriseDB Corporation

The Postgres Database Company

Re: Getting ERROR: could not open file "base/13164/t3_16388" with partition table with ON COMMIT

2018-09-17 Thread Rajkumar Raghuwanshi

On Fri, Sep 14, 2018 at 7:23 AM, Amit Langote  wrote:

> On 2018/09/13 23:13, Tom Lane wrote:
> > Amit Langote  writes:
> >> On 2018/09/13 1:14, Tom Lane wrote:
> >>> That seems excessively restrictive.  Anything that has storage (e.g.
> >>> matviews) ought to be truncatable, no?
> >
> >> Not by heap_truncate it seems.  The header comment of heap_truncate says
> >> that it concerns itself only with ON COMMIT truncation of temporary
> tables:
> >
> > Ah.  Well, in that case I'm OK with just a simple test for
> > RELKIND_RELATION, but I definitely feel that it should be inside
> > heap_truncate.  Callers don't need to know about the limited scope
> > of what that does.
>
> I guess you meant inside heap_truncate_one_rel.  I updated the patch that
> way, but I wonder if we shouldn't also allow other relkinds that have
> storage which RelationTruncate et al can technically deal with.
>
Verified. This patch fixed issue.

Re: Query is over 2x slower with jit=on

2018-09-17 Thread Amit Khandekar

On 18 September 2018 at 03:20, Tom Lane  wrote:
> Amit Khandekar  writes:
> I think we better show per-worker jit info also.
>
> Just to throw a contrarian opinion into this: I find the current EXPLAIN
> output for JIT to be insanely verbose already.

One option is to make the jit info displayed on a single line, with
optimize and inlining counters shown only when the corresponding
optimization was attempted. Currently they are shown even if they are
disabled.

In the patch, I have kept the per-worker output verbose-only.

Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Re: Query is over 2x slower with jit=on

2018-09-17 Thread Amit Khandekar

On 14 September 2018 at 16:48, Amit Khandekar  wrote:
> On 11 September 2018 at 14:50, Amit Khandekar  wrote:
>> On 10 September 2018 at 21:39, Andres Freund  wrote:
  /*
 + * Add up the workers' JIT instrumentation from dynamic shared memory.
 + */
 +static void
 +ExecParallelRetrieveJitInstrumentation(PlanState *planstate,
 +
 SharedJitInstrumentation *shared_jit)
 +{
 + int n;
 + JitContext *jit = planstate->state->es_jit;
 +
 + /* If the leader hasn't yet created a jit context, allocate one now. 
 */
 + if (!jit)
 + {
 + planstate->state->es_jit = jit =
 + jit_create_context(planstate->state->es_jit_flags);
 + }
>>>
>>> Wouldn't it be better to move the jit instrumentation to outside of the
>>> context, to avoid having to do this?  Or just cope with not having
>>> instrumentation for the leader in this case?  We'd kinda need to deal
>>> with failure to create one anyway?
>>
>> Yeah, I think taking out the instrumentation out of the context looks
>> better. Will work on that.
>
> Not yet done this. Will do now.

Attached v3 patch that does the above change.

The jit instrumentation counters are used by the provider code
directly. So I think even if we keep the instrumentation outside of
the context, we should let the resource manager handle deallocation of
the instrumentation, similar to how it deallocates the whole jit
context. So we should allocate the instrumentation in
TopMemoryContext. Otherwise, if the instrumentation is allocated in
per-query context, and if the provider functions accesses it after the
Portal resource is cleaned up then the instrumentation would have been
already de-allocated.

So for this, I thought we better have two separate contexts: base
context (i.e. the usual JitContext) and provider-specific context.
JitContext has new field provider_context pointing to LLVMJitContext.
And LLVMJitContext has a 'base' field pointing to the base context
(i.e. JitContext)

So in ExecParallelRetrieveJitInstrumentation(), where the jit context
is created if it's NULL, I now create just the base context. Later,
on-demand the JitContext->provider_context will be allocated in
llvm_compile().

This way, we can release both the base context and provider context
during ResourceOwnerRelease(), and at the same time continue to be
able to access the jit instrumentation from the provider.

---

BTW There is this code in llvm_compile_expr() on HEAD that does not
handle the !parent case correctly :

/* get or create JIT context */
if (parent && parent->state->es_jit)
{
   context = (LLVMJitContext *) parent->state->es_jit;
}
else
{
   context = llvm_create_context(parent->state->es_jit_flags);

   if (parent)
   {
  parent->state->es_jit = &context->base;
   }
}

Here, llvm_create_context() will crash if parent is NULL. Is it that
parent can never be NULL ? I think so, because in jit_compile_expr()
we skip everything if there is no plan state. So probably there should
be an Assert(parent != NULL) rather than handling parent check. Or
else, we need to decide what JIT flags to supply to
llvm_create_context() if we let the provider create the context
without a plan state. For now, in the changes that did to the above
snippet, I have kept a TODO.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

jit_instr_account_workers_v3.patch
Description: Binary data

Re: Problem while setting the fpw with SIGHUP

2018-09-17 Thread Kyotaro HORIGUCHI

Hello.

At Tue, 18 Sep 2018 11:38:50 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
 wrote in 
<20180918.113850.164570138.horiguchi.kyot...@lab.ntt.co.jp>
> At Thu, 6 Sep 2018 16:37:28 -0700, Michael Paquier  
> wrote in <20180906233728.gr2...@paquier.xyz>
> > I am finally coming back to this patch set, and that's one of the first
> > things I am going to help moving on for this CF.  And this bit from the
> > last patch series is not acceptable as there are some parameters which
> > are used by the startup process which can be reloaded.  One of them is
> > wal_retrieve_retry_interval for tuning when fetching WAL at recovery.
> 
> The third patch actually is not mandatory in the patch set. The
> only motive of that is it doesn't nothing. The handler for SIGHUP
> just sets got_SIGHUP then wakes up the process, and the variable
> is not looked up by no one. If you mind the change, it can be
> removed with no side effect.

I was wrong here. It was handled in HandleStartupProcInterrupts
called from StartupXLOG. So, it should be just removed from the
set. Sorry for the bogus patch.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] proposal - Default namespaces for XPath expressions (PostgreSQL 11)

2018-09-17 Thread Pavel Stehule

po 17. 9. 2018 v 23:15 odesílatel Thomas Munro <
thomas.mu...@enterprisedb.com> napsal:

> On Mon, Sep 17, 2018 at 5:36 PM Pavel Stehule 
> wrote:
> > po 17. 9. 2018 v 2:05 odesílatel Thomas Munro <
> thomas.mu...@enterprisedb.com> napsal:
> >> On Fri, Aug 10, 2018 at 6:26 AM Andrew Dunstan
> >>  wrote:
> >> > On 01/24/2018 04:30 AM, Pavel Stehule wrote:
> >> > >
> >> > > I am sending updated version.
> >> > >
> >> > > Very much thanks for very precious review
> >> >
> >> > Thomas,
> >> >
> >> > I am unable to replicate the Linux failure seen in the cfbot on my
> >> > Fedora machine. Both when building with libxml2 and without, after
> >> > applying the latest patch the tests pass without error. Can you please
> >> > investigate what's going on here?
> >>
> >> Well this is strange...  I can't reproduce the problem either with or
> >> without --with-libxml on a Debian box (was trying to get fairly close
> >> to the OS that Travis runs on).  But I see the same failure when I
> >> apply the patch on my FreeBSD 12 laptop and test without
> >> --with-libxml.  Note that when cfbot runs it, the patch is applied
> >> with FreeBSD patch, and then it's tested without --with-libxml on
> >> Ubuntu (Travis's default OS).  [Side note: I should change it to build
> >> --with-libxml, but that's not the point.]  So the common factor is a
> >> different patch implementation.  I wonder if a hunk is being
> >> misinterpreted.
> >
> >
> > This patch is not too large. Please, can me send a related files, I can
> check it manually.
>
> I confirmed that xml_1.out is different depending on which 'patch' you
> use.  I've attached the output from FreeBSD patch 2.0-12u11 and GNU
> patch 2.5.8.  It's an interesting phenomenon, probably due to having a
> huge long file with a lot of repeated text and a slightly different
> algorithms or parameters, but I don't think you need to worry about it
> for this.  Sorry for the distraction.
>

ok. Thank you for info

Regards

Pavel

>
> --
> Thomas Munro
> http://www.enterprisedb.com
>

Re: Changing the setting of wal_sender_timeout per standby

2018-09-17 Thread Michael Paquier

On Tue, Sep 18, 2018 at 11:20:03AM +0900, Kyotaro HORIGUCHI wrote:
> +1, and we need a means to see the actual value, in
> pg_stat_replication?

Well, being able to see what another session is using as settings is
not a trivial problem, perhaps not worth solving, and orthogonal to
what's discussed here...
--
Michael


signature.asc
Description: PGP signature

Re: pgsql: Allow concurrent-safe open() and fopen() in frontend code for Wi

2018-09-17 Thread Michael Paquier

On Tue, Sep 18, 2018 at 09:11:43AM +0900, Michael Paquier wrote:
> What I think I broke is that CreateFile ignores what _fmode uses, which
> has caused the breakage, while calling directly open() or fopen() does
> the work.  There are also other things assuming that binary mode is
> used, you can grep for "setmode" and see how miscinit.c or pg_basebackup
> do the job.

I have been playing with this stuff, and hacked the attached.  Now, while
TAP tests of initdb and pgbench are happy (I can actually see the past
failure as well), pg_dump complains at authentication time when using
plain-text mode when using databases with all ASCII characters.  That's
not something I expected first, but _get_fmode also influences
operations like pipe(), which is something that pg_dump uses, and
setmode is enforced to binary mode only when adapted.

I am getting to wonder if what's present on HEAD represents actually the
best deal we could get.  Attached is the patch I used for reference.
Thoughts?
--
Michael
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index b53d6eb9cc..cb8c7450d9 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -491,15 +491,7 @@ readfile(const char *path)
 	char	   *buffer;
 	int			c;
 
-#ifdef WIN32
-	/*
-	 * On Windows, we have to open the file in text mode so that carriage
-	 * returns are stripped.
-	 */
-	if ((infile = fopen(path, "rt")) == NULL)
-#else
 	if ((infile = fopen(path, "r")) == NULL)
-#endif
 	{
 		fprintf(stderr, _("%s: could not open file \"%s\" for reading: %s\n"),
 progname, path, strerror(errno));
diff --git a/src/port/open.c b/src/port/open.c
index a3ad946a60..f7a296ef28 100644
--- a/src/port/open.c
+++ b/src/port/open.c
@@ -71,6 +71,28 @@ pgwin32_open(const char *fileName, int fileFlags,...)
 		 _O_SHORT_LIVED | O_DSYNC | O_DIRECT |
 		 (O_CREAT | O_TRUNC | O_EXCL) | (O_TEXT | O_BINARY))) == fileFlags);
 
+	/*
+	 * If caller has set neither O_BINARY nor O_TEXT, then look for what
+	 * the default mode is for this process, then enforce it.
+	 */
+	if ((fileFlags & (O_BINARY | O_TEXT)) == 0)
+	{
+		int		pmode = 0;
+
+		/* only MSVC newer than 2015 support _get_fmode */
+#if (_MSC_VER >= 1900)
+		if (_get_fmode(&pmode) < 0)
+		{
+			/* get_fmode sets errno */
+			return -1;
+		}
+#else
+		pmode = _fmode;
+#endif
+
+		fileFlags |= pmode;
+	}
+
 	sa.nLength = sizeof(sa);
 	sa.bInheritHandle = TRUE;
 	sa.lpSecurityDescriptor = NULL;


signature.asc
Description: PGP signature

Re: Problem while setting the fpw with SIGHUP

2018-09-17 Thread Kyotaro HORIGUCHI

At Thu, 6 Sep 2018 16:37:28 -0700, Michael Paquier  wrote 
in <20180906233728.gr2...@paquier.xyz>
> On Tue, Aug 28, 2018 at 07:34:36PM +0900, Kyotaro HORIGUCHI wrote:
> > Thanks for prompting. The difference is in a comment and I'm fine
> > with the change.
> 
> /*
>  * Properly accept or ignore signals the postmaster might send us.
>  */
> -   pqsignal(SIGHUP, StartupProcSigHupHandler); /* reload config file */
> +   pqsignal(SIGHUP, SIG_IGN);  /* ignore reload config */
> 
> I am finally coming back to this patch set, and that's one of the first
> things I am going to help moving on for this CF.  And this bit from the
> last patch series is not acceptable as there are some parameters which
> are used by the startup process which can be reloaded.  One of them is
> wal_retrieve_retry_interval for tuning when fetching WAL at recovery.

The third patch actually is not mandatory in the patch set. The
only motive of that is it doesn't nothing. The handler for SIGHUP
just sets got_SIGHUP then wakes up the process, and the variable
is not looked up by no one. If you mind the change, it can be
removed with no side effect.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Changing the setting of wal_sender_timeout per standby

2018-09-17 Thread Kyotaro HORIGUCHI

At Fri, 14 Sep 2018 18:22:33 +0900, Masahiko Sawada  
wrote in 
> On Thu, Sep 13, 2018 at 12:32 PM, Michael Paquier  wrote:
> > On Thu, Sep 13, 2018 at 01:14:12AM +, Tsunakawa, Takayuki wrote:
> >> Some customer wants to change the setting per standby, i.e., a shorter
> >> timeout for a standby in the same region to enable faster detection
> >> failure and failover, and a longer timeout for a standby in the remote
> >> region (for disaster recovery) to avoid mis-judging its health.
> >
> > This argument is sensible.
> >
> >> The current PGC_HUP allows to change the setting by editing
> >> postgresql.conf or ALTER SYSTEM and then sending SIGHUP to a specific
> >> walsender.  But that's not easy to use.  The user has to do it upon
> >> every switchover and failover.
> >>
> >> With PGC_BACKEND, the user would be able to tune the timeout as follows:
> >>
> >> [recovery.conf]
> >> primary_conninfo = '... options=''-c wal_sender_timeout=6'' ...'
> >>
> >> With PGC_USERSET, the user would be able to use different user
> >> accounts for each standby, and tune the setting as follows:
> >>
> >> ALTER USER repluser_remote SET wal_sender_timeout = 6;
> >
> > It seems to me that switching to PGC_BACKENDwould cover already all the
> > use-cases you are mentioning, as at the end one would just want to
> > adjust the WAL sender timeout on a connection basis depending on the
> > geographical location of the receiver and the latency between primary
> > and standby.
> 
> +1 for PGC_BACKEND. It looks enough for most use cases.

+1, and we need a means to see the actual value, in
pg_stat_replication?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [PATCH] Fix for infinite signal loop in parallel scan

2018-09-17 Thread Thomas Munro

On Tue, Sep 18, 2018 at 1:15 AM Chris Travers  wrote:
> On Mon, Sep 17, 2018 at 2:59 PM Oleksii Kliukin  wrote:
>> With the patch applied, the posix_fallocate loop terminated right away 
>> (because
>> of QueryCancelPending flag set to true) and the backend went through the
>> cleanup, showing an ERROR of cancelling due to the conflict with recovery.
>> Without the patch, it looped indefinitely in the dsm_impl_posix_resize, while
>> the startup process were looping forever, trying to send SIGUSR1.

Thanks for testing!

>> One thing I’m wondering is whether we could do the same by just blocking 
>> SIGUSR1
>> for the duration of posix_fallocate?
>
> If we were to do that, I would say we should mask all signals we can mask 
> during the call.
>
> I don't have a problem going down that road instead if people think it is 
> better.

We discussed that when adding posix_fallocate() and decided that
retrying is better:

https://www.postgresql.org/message-id/20170628230458.n5ehizmvhoerr5yq%40alap3.anarazel.de

Here is a patch that I propose to commit and back-patch to 9.4.  I
just wrote a suitable commit message, edited the comments lightly and
fixed some whitespace.

-- 
Thomas Munro
http://www.enterprisedb.com


0001-Allow-DSM-allocation-to-be-interrupted.patch
Description: Binary data

Re: Online verification of checksums

2018-09-17 Thread Stephen Frost

Greetings,

* Tomas Vondra (tomas.von...@2ndquadrant.com) wrote:
> On 09/18/2018 12:01 AM, Stephen Frost wrote:
> > * Tomas Vondra (tomas.von...@2ndquadrant.com) wrote:
> >> On 09/17/2018 07:35 PM, Stephen Frost wrote:
> >> But the trick is that if the read sees the effect of the write somewhere
> >> in the middle of the page, the next read is guaranteed to see all the
> >> preceding new data.
> > 
> > If that's guaranteed then we can just check the LSN and be done.
> 
> What do you mean by "check the LSN"? Compare it to LSN from the first
> read? You don't know if the first read already saw the new LSN or not
> (see the next example).

Hmm, ok, I can see your point there.  I've been going back and forth
between checking against what the prior LSN was on the page and checking
it against an independent source (like the last checkpoint's LSN), but..

[...]

> Comparing the page LSN to the last checkpoint LSN solves this, because
> if the LSN is older than the checkpoint LSN, that write must have been
> completed by now, and so we're not in danger of seeing only incomplete
> effects of it. And newer write will update the LSN.

Yeah, that makes sense- we need to be looking at something which only
gets updated once the write has actually completed, and the last
checkpoint's LSN gives us that guarantee.

> > The problem that we aren't solving for is if, somehow, we do a read(8K)
> > and get the first half/second half mixup and then on a subsequent
> > read(8K) we see that *again*, implying that somehow the kernel's copy
> > has the latter-half of the page updated consistently but not the first
> > half.  That's a problem that I haven't got a solution to today.  I'd
> > love to have a guarantee that it's not possible- we've certainly never
> > seen it but it's been a concern and I thought Michael was suggesting
> > he'd seen that, but it sounds like there wasn't a check on the LSN in
> > the first read, in which case it could have just been a 'regular' torn
> > page case.
> 
> Well, yeah. If that would be possible, we'd be in serious trouble. I've
> done quite a bit of experimentation with concurrent reads and writes and
> I have not observed such behavior. Of course, that's hardly a proof it
> can't happen, and it wouldn't be the first surprise with respect to
> kernel I/O this year ...

I'm glad to hear that you've done a lot of experimentation in this area
and haven't seen such strange behavior happen- we've got quite a few
people running pgBackRest with checksum-checking and haven't seen it
either, but it's always been a bit of a concern.

> You're right it's not about the fsync, sorry for the confusion. My point
> is that using the checkpoint LSN gives us a guarantee that write is no
> longer in progress, and so we can't see a page torn because of it. And
> if we see a partial write due to a new write, it's guaranteed to update
> the page LSN (and we'll notice it).

Right, no worries about the confusion, I hadn't been fully thinking
through the LSN bit either and that what we really need is some external
confirmation of a write having *completed* (not just started) and that
makes a definite difference.

> > Right, I'm in agreement with doing that and it's what is done in
> > pgbasebackup and pgBackRest.
> 
> OK. All I'm saying is pg_verify_checksums should probably do the same
> thing, i.e. grab checkpoint LSN and roll with that.

Agreed.

Thanks!

Stephen


signature.asc
Description: PGP signature

Re: Online verification of checksums

2018-09-17 Thread Tomas Vondra



On 09/18/2018 12:01 AM, Stephen Frost wrote:
> Greetings,
> 
> * Tomas Vondra (tomas.von...@2ndquadrant.com) wrote:
>> On 09/17/2018 07:35 PM, Stephen Frost wrote:
>>> On Mon, Sep 17, 2018 at 13:20 Tomas Vondra >> > wrote:
>>> Doesn't the checkpoint fsync pretty much guarantee this can't happen?
>>>
>>> How? Either it’s possible for the latter half of a page to be updated
>>> before the first half (where the LSN lives), or it isn’t. If it’s
>>> possible then that LSN could be ancient and it wouldn’t matter. 
>>
>> I'm not sure I understand what you're saying here.
>>
>> It is not about the latter page to be updated before the first half. I
>> don't think that's quite possible, because write() into page cache does
>> in fact write the data sequentially.
> 
> Well, maybe 'updated before' wasn't quite the right way to talk about
> it, but consider if a read(8K) gets only half-way through the copy
> before having to go do something else and by the time it gets back, a
> write has come in and rewritten the page, such that the read(8K)
> returns half-old and half-new data.
> 
>> The problem is that the write is not atomic, and AFAIK it happens in
>> sectors (which are either 512B or 4K these days). And it may arbitrarily
>> interleave with reads.
> 
> Yes, of course the write isn't atomic, that's clear.
> 
>> So you may do write(8k), but it actually happens in 512B chunks and a
>> concurrent read may observe some mix of those.
> 
> Right, I'm not sure that we really need to worry about sub-4K writes
> though I suppose they're technically possible, but it doesn't much
> matter in this case since the LSN is early on in the page, of course.
> 
>> But the trick is that if the read sees the effect of the write somewhere
>> in the middle of the page, the next read is guaranteed to see all the
>> preceding new data.
> 
> If that's guaranteed then we can just check the LSN and be done.
> 

What do you mean by "check the LSN"? Compare it to LSN from the first
read? You don't know if the first read already saw the new LSN or not
(see the next example).

>> Without the checkpoint we risk seeing the same write() both in read and
>> re-read, just in a different stage - so the LSN would not change, making
>> the check futile.
> 
> This is the part that isn't making much sense to me.  If we are
> guaranteed that writes into the kernel cache are always in order and
> always at least 512B in size, then if we check the LSN first and
> discover it's "old", and then read the rest of the page and calculate
> the checksum, discover it's a bad checksum, and then go back and re-read
> the page then we *must* see that the LSN has changed OR conclude that
> the checksum is invalidated.
> 

Even if the writes are in order and in 512B chunks, you don't know how
they are interleaved with the reads.

Let's assume we're doing a write(), which splits the 8kB page into 512B
chunks. A concurrent read may observe a random mix of old and new data,
depending on timing.

So let's say a read sees the first 2kB of data like this:

[new, new, new, old, new, old, new, old]

OK, the page is obviously torn, checksum fails, and we try reading it
again. We should see new data at least until the last 'new' chunk in the
first read, so let's say we got this:

[new, new, new, new, new, new, new, old]

Obviously, this page is also torn (there are old data at the end), but
we've read the new data in both cases, which includes the LSN. So the
LSN is the same in both cases, and your detection fails.

Comparing the page LSN to the last checkpoint LSN solves this, because
if the LSN is older than the checkpoint LSN, that write must have been
completed by now, and so we're not in danger of seeing only incomplete
effects of it. And newer write will update the LSN.

> The reason this can happen in the first place is that our 8K read might
> only get half-way done before getting scheduled off and a 8K write
> happened on the page before our read(8K) gets back to finishing the
> read, but if what you're saying is true, then we can't ever have a case
> where such a thing would happen and a re-read would still see the "old"
> LSN.
> 
> If we check the LSN first and discover it's "new" (as in, more recent
> than our last checkpoint, or the checkpoint where the backup started)
> then, sure, there's going to be a risk that the page is currently being
> written right that moment and isn't yet completely valid.
> 

Right.

> The problem that we aren't solving for is if, somehow, we do a read(8K)
> and get the first half/second half mixup and then on a subsequent
> read(8K) we see that *again*, implying that somehow the kernel's copy
> has the latter-half of the page updated consistently but not the first
> half.  That's a problem that I haven't got a solution to today.  I'd
> love to have a guarantee that it's not possible- we've certainly never
> seen it but it's been a concern and I thought Michael was suggesting
> he'd seen that, but

Re: pgsql: Allow concurrent-safe open() and fopen() in frontend code for Wi

2018-09-17 Thread Michael Paquier

On Mon, Sep 17, 2018 at 07:38:24PM -0400, Tom Lane wrote:
> So we seem to be out of the woods in terms of 0ba06e0bf breaking the
> regression tests, but I'm not very happy about the whole thing, because
> that patch wasn't supposed to change the behavior of open/fopen in any
> way other than allowing concurrent file access.  Obviously, it did.

Thanks!  I can see that as well.

> After looking at src/port/open.c a bit, it seems like the problem is
> that our fopen() requires a nonstandard "t" option to select text mode.
> I'd always supposed that you got binary mode with "b" and text mode
> otherwise; is there some third possibility on Windows?

There is a sort of automatic mode...  Please see below.

> Anyway, I'm inclined to propose that we ought to do something like
> the attached in HEAD and v11 in order to reduce the risk that there
> are other unexpected behavioral changes.  This should also let us
> revert 0ba06e0bf's change in initdb.c.
> 
> I wonder whether pgwin32_open() also ought to enforce that either
> O_BINARY or O_TEXT mode gets selected.

There is no need to go down that road I think, a lot of code paths
already call setmode to make sure that binary or text modes are used.

> diff --git a/src/port/open.c b/src/port/open.c
> index a3ad946..85ab06e 100644
> --- a/src/port/open.c
> +++ b/src/port/open.c
> @@ -154,7 +154,7 @@ pgwin32_fopen(const char *fileName, const char *mode)
>  
>   if (strchr(mode, 'b'))
>   openmode |= O_BINARY;
> - if (strchr(mode, 't'))
> + else
>   openmode |= O_TEXT;

Hm.  I don't think that this is correct either.  The whole logic behind
the mode selected depends on how setmode() is being set to.  Please see
here:
https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/setmode?view=vs-2017

And the point is that if neither 'b' nor 't' are set, then open() would
use the value set by the process, which is 't' by default if not set.
And initdb does not set that, so it would use 't'.  miscinit.c sets it
to 'b', and your patch would likely cause some backend code to be
broken.  So it seems to me that if 'b' or 't' are not defined by the
caller, then we ought to use what get_fmode() returns:
https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/get-fmode?view=vs-2017

What I think I broke is that CreateFile ignores what _fmode uses, which
has caused the breakage, while calling directly open() or fopen() does
the work.  There are also other things assuming that binary mode is
used, you can grep for "setmode" and see how miscinit.c or pg_basebackup
do the job.
--
Michael

signature.asc
Description: PGP signature

Re: pgsql: Allow concurrent-safe open() and fopen() in frontend code for Wi

2018-09-17 Thread Tom Lane

So we seem to be out of the woods in terms of 0ba06e0bf breaking the
regression tests, but I'm not very happy about the whole thing, because
that patch wasn't supposed to change the behavior of open/fopen in any
way other than allowing concurrent file access.  Obviously, it did.

After looking at src/port/open.c a bit, it seems like the problem is
that our fopen() requires a nonstandard "t" option to select text mode.
I'd always supposed that you got binary mode with "b" and text mode
otherwise; is there some third possibility on Windows?

Anyway, I'm inclined to propose that we ought to do something like
the attached in HEAD and v11 in order to reduce the risk that there
are other unexpected behavioral changes.  This should also let us
revert 0ba06e0bf's change in initdb.c.

I wonder whether pgwin32_open() also ought to enforce that either
O_BINARY or O_TEXT mode gets selected.

regards, tom lane

diff --git a/src/port/open.c b/src/port/open.c
index a3ad946..85ab06e 100644
--- a/src/port/open.c
+++ b/src/port/open.c
@@ -154,7 +154,7 @@ pgwin32_fopen(const char *fileName, const char *mode)
 
 	if (strchr(mode, 'b'))
 		openmode |= O_BINARY;
-	if (strchr(mode, 't'))
+	else
 		openmode |= O_TEXT;
 
 	fd = pgwin32_open(fileName, openmode);
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index b53d6eb..cb8c745 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -491,15 +491,7 @@ readfile(const char *path)
 	char	   *buffer;
 	int			c;
 
-#ifdef WIN32
-	/*
-	 * On Windows, we have to open the file in text mode so that carriage
-	 * returns are stripped.
-	 */
-	if ((infile = fopen(path, "rt")) == NULL)
-#else
 	if ((infile = fopen(path, "r")) == NULL)
-#endif
 	{
 		fprintf(stderr, _("%s: could not open file \"%s\" for reading: %s\n"),
 progname, path, strerror(errno));

Re: infinite loop in parallel hash joins / DSA / get_best_segment

2018-09-17 Thread Thomas Munro

On Mon, Sep 17, 2018 at 9:12 PM Thomas Munro
 wrote:
> On Mon, Sep 17, 2018 at 10:42 AM Thomas Munro
>  wrote:
> > On Mon, Sep 17, 2018 at 10:38 AM Tomas Vondra
> >  wrote:
> > > While performing some benchmarks on REL_11_STABLE (at 55c2d9), I've
> > > repeatedly hit an apparent infinite loop on TPC-H query 4. I don't know
> > > what exactly are the triggering conditions, but the symptoms are these:
> > >
> > > ...
>
> With the attached draft patch, Tomas's benchmark script runs happily
> for long periods.  A bit more study required with fresh eyes,
> tomorrow.

Here is a tidied up version that I propose to back-patch to 10 & 11.

-- 
Thomas Munro
http://www.enterprisedb.com


0001-Fix-segment_bins-corruption-in-dsa.c.patch
Description: Binary data

Re: Online verification of checksums

2018-09-17 Thread Stephen Frost

Greetings,

* Tomas Vondra (tomas.von...@2ndquadrant.com) wrote:
> On 09/17/2018 07:35 PM, Stephen Frost wrote:
> > On Mon, Sep 17, 2018 at 13:20 Tomas Vondra  > > wrote:
> > Doesn't the checkpoint fsync pretty much guarantee this can't happen?
> > 
> > How? Either it’s possible for the latter half of a page to be updated
> > before the first half (where the LSN lives), or it isn’t. If it’s
> > possible then that LSN could be ancient and it wouldn’t matter. 
> 
> I'm not sure I understand what you're saying here.
> 
> It is not about the latter page to be updated before the first half. I
> don't think that's quite possible, because write() into page cache does
> in fact write the data sequentially.

Well, maybe 'updated before' wasn't quite the right way to talk about
it, but consider if a read(8K) gets only half-way through the copy
before having to go do something else and by the time it gets back, a
write has come in and rewritten the page, such that the read(8K)
returns half-old and half-new data.

> The problem is that the write is not atomic, and AFAIK it happens in
> sectors (which are either 512B or 4K these days). And it may arbitrarily
> interleave with reads.

Yes, of course the write isn't atomic, that's clear.

> So you may do write(8k), but it actually happens in 512B chunks and a
> concurrent read may observe some mix of those.

Right, I'm not sure that we really need to worry about sub-4K writes
though I suppose they're technically possible, but it doesn't much
matter in this case since the LSN is early on in the page, of course.

> But the trick is that if the read sees the effect of the write somewhere
> in the middle of the page, the next read is guaranteed to see all the
> preceding new data.

If that's guaranteed then we can just check the LSN and be done.

> Without the checkpoint we risk seeing the same write() both in read and
> re-read, just in a different stage - so the LSN would not change, making
> the check futile.

This is the part that isn't making much sense to me.  If we are
guaranteed that writes into the kernel cache are always in order and
always at least 512B in size, then if we check the LSN first and
discover it's "old", and then read the rest of the page and calculate
the checksum, discover it's a bad checksum, and then go back and re-read
the page then we *must* see that the LSN has changed OR conclude that
the checksum is invalidated.

The reason this can happen in the first place is that our 8K read might
only get half-way done before getting scheduled off and a 8K write
happened on the page before our read(8K) gets back to finishing the
read, but if what you're saying is true, then we can't ever have a case
where such a thing would happen and a re-read would still see the "old"
LSN.

If we check the LSN first and discover it's "new" (as in, more recent
than our last checkpoint, or the checkpoint where the backup started)
then, sure, there's going to be a risk that the page is currently being
written right that moment and isn't yet completely valid.

The problem that we aren't solving for is if, somehow, we do a read(8K)
and get the first half/second half mixup and then on a subsequent
read(8K) we see that *again*, implying that somehow the kernel's copy
has the latter-half of the page updated consistently but not the first
half.  That's a problem that I haven't got a solution to today.  I'd
love to have a guarantee that it's not possible- we've certainly never
seen it but it's been a concern and I thought Michael was suggesting
he'd seen that, but it sounds like there wasn't a check on the LSN in
the first read, in which case it could have just been a 'regular' torn
page case.

> But by waiting for the checkpoint we know that the original write is no
> longer in progress, so if we saw a partial write we're guaranteed to see
> a new LSN on re-read.
> 
> This is what I mean by the checkpoint / fsync guarantee.

I don't think any of this really has anythign to do with either fsync
being called or with the actual checkpointing process (except to the
extent that the checkpointer is the thing doing the writing, and that we
should be checking the LSN against the LSN of the last checkpoint when
we started, or against the start of the backup LSN if we're talking
about doing a backup).

> > The question is if it’s possible to catch a torn page where the second
> > half is updated *before* the first half of the page in a read (and then
> > in subsequent reads having that state be maintained).  I have some
> > skepticism that it’s really possible to happen in the first place but
> > having an interrupted system call be stalled across two more system
> > calls just seems terribly unlikely, and this is all based on the
> > assumption that the kernel might write the second half of a write before
> > the first to the kernel cache in the first place.
> 
> Yes, if that was possible, the explanation about the checkpoint fsync
> guarantee wou

Re: Query is over 2x slower with jit=on

2018-09-17 Thread Tom Lane

Amit Khandekar  writes:
> On 11 September 2018 at 14:50, Amit Khandekar  wrote:
>> On 10 September 2018 at 21:39, Andres Freund  wrote:
>>> On 2018-09-10 15:42:55 +0530, Amit Khandekar wrote:
 I think we better show per-worker jit info also.

Just to throw a contrarian opinion into this: I find the current EXPLAIN
output for JIT to be insanely verbose already.

regards, tom lane

Re: Strange OSX make check-world failure

2018-09-17 Thread Tom Lane

Thomas Munro  writes:
> On Tue, Sep 18, 2018 at 2:14 AM Tom Lane  wrote:
>> "make check" generally won't work on OSX unless you've disabled SIP:
>> https://www.howtogeek.com/230424/how-to-disable-system-integrity-protection-on-a-mac-and-why-you-shouldnt/

> Aha!  It looks like it was important to run "make install" before
> running those tests.

Right.  If you don't want to disable SIP, you can work around it by always
doing "make install" before "make check".  Kind of a PITA though.

regards, tom lane

Re: Online verification of checksums

2018-09-17 Thread Tomas Vondra

On 09/17/2018 07:35 PM, Stephen Frost wrote:
> Greetings,
> 
> On Mon, Sep 17, 2018 at 13:20 Tomas Vondra  > wrote:
> 
> On 09/17/2018 07:11 PM, Stephen Frost wrote:
> > Greetings,
> >
> > * Tomas Vondra (tomas.von...@2ndquadrant.com
> ) wrote:
> >> On 09/17/2018 06:42 PM, Stephen Frost wrote:
> >>> Ok, good, though I'm not sure what you mean by 'eliminates the
> >>> consistency guarantees provided by the checkpoint'.  The point
> is that
> >>> the page will be in the WAL and the WAL will be replayed during the
> >>> restore of the backup.
> >>
> >> The checkpoint guarantees that the whole page was written and
> flushed to
> >> disk with an LSN before the ckeckpoint LSN. So when you read a
> page with
> >> that LSN, you know the whole write already completed and a read won't
> >> return data from before the LSN.
> >
> > Well, you know that the first part was written out at some prior
> point,
> > but you could end up reading the first part of a page with an
> older LSN
> > while also reading the second part with new data.
> 
> 
> 
> Doesn't the checkpoint fsync pretty much guarantee this can't happen?
> 
> 
> How? Either it’s possible for the latter half of a page to be updated
> before the first half (where the LSN lives), or it isn’t. If it’s
> possible then that LSN could be ancient and it wouldn’t matter. 
> 

I'm not sure I understand what you're saying here.

It is not about the latter page to be updated before the first half. I
don't think that's quite possible, because write() into page cache does
in fact write the data sequentially.

The problem is that the write is not atomic, and AFAIK it happens in
sectors (which are either 512B or 4K these days). And it may arbitrarily
interleave with reads.

So you may do write(8k), but it actually happens in 512B chunks and a
concurrent read may observe some mix of those.

But the trick is that if the read sees the effect of the write somewhere
in the middle of the page, the next read is guaranteed to see all the
preceding new data.

Without the checkpoint we risk seeing the same write() both in read and
re-read, just in a different stage - so the LSN would not change, making
the check futile.

But by waiting for the checkpoint we know that the original write is no
longer in progress, so if we saw a partial write we're guaranteed to see
a new LSN on re-read.

This is what I mean by the checkpoint / fsync guarantee.

> >> Without the checkpoint that's not guaranteed, and simply
> re-reading the
> >> page and rechecking it vs. the first read does not help:
> >>
> >> 1) write the first 512B of the page (sector), which includes the LSN
> >>
> >> 2) read the whole page, which will be a mix [new 512B, ... old ... ]
> >>
> >> 3) the checksum verification fails
> >>
> >> 4) read the page again (possibly reading a bit more new data)
> >>
> >> 5) the LSN did not change compared to the first read, yet the
> checksum
> >> still fails
> >
> > So, I agree with all of the above though I've found it to be extremely
> > rare to get a single read which you've managed to catch part-way
> through
> > a write, getting multiple of them over a period of time strikes me as
> > even more unlikely.  Still, if we can come up with a solution to solve
> > all of this, great, but I'm not sure that I'm hearing one.
> 
> 
> I don't recall claiming catching many such torn pages - I'm sure it's
> not very common in most workloads. But I suspect constructing workloads
> hitting them regularly is not very difficult either (something with a
> lot of churn in shared buffers should do the trick).
> 
> 
> The question is if it’s possible to catch a torn page where the second
> half is updated *before* the first half of the page in a read (and then
> in subsequent reads having that state be maintained).  I have some
> skepticism that it’s really possible to happen in the first place but
> having an interrupted system call be stalled across two more system
> calls just seems terribly unlikely, and this is all based on the
> assumption that the kernel might write the second half of a write before
> the first to the kernel cache in the first place.
> 

Yes, if that was possible, the explanation about the checkpoint fsync
guarantee would be bogus, obviously.

I've spent quite a bit of time looking into how write() is handled, and
I believe seeing only the second half is not possible. You may observe a
page torn in various ways (not necessarily in half), e.g.

[old,new,old]

but then the re-read you should be guaranteed to see new data up until
the last "new" chunk:

[new,new,old]

At least that's my understanding. I failed to deduce what POSIX says
about this, or how it behaves on various OS/filesystems.

The one thing I

Re: Strange OSX make check-world failure

2018-09-17 Thread Thomas Munro

On Tue, Sep 18, 2018 at 2:14 AM Tom Lane  wrote:
> Chris Travers  writes:
> > Logs are below.  This happens on master, and on 10.  I suspect it is an
> > issue with something regarding ecpg.  Wondering what I am doing wrong.
>
> "make check" generally won't work on OSX unless you've disabled SIP:
>
> https://www.howtogeek.com/230424/how-to-disable-system-integrity-protection-on-a-mac-and-why-you-shouldnt/
>
> That might not be the issue --- I'd have rather expected a failure
> sooner --- but it's worth checking.

$ sw_vers
ProductName: Mac OS X
ProductVersion: 10.13.4
BuildVersion: 17E199
$ csrutil status
System Integrity Protection status: enabled.
$ make -s -C src/interfaces/ecpg check
... snip ...
==
 All 58 tests passed.
==

Hmm... why does this work for me... let's see where it gets libraries from:

$ otool -L src/interfaces/ecpg/test/pgtypeslib/dt_test
src/interfaces/ecpg/test/pgtypeslib/dt_test:
/Users/munro/install/postgres2/lib/libecpg.6.dylib (compatibility
version 6.0.0, current version 6.12.0)
/Users/munro/install/postgres2/lib/libpgtypes.3.dylib (compatibility
version 3.0.0, current version 3.12.0)
/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current
version 1252.50.4)

Aha!  It looks like it was important to run "make install" before
running those tests.  Let's see what happens if I remove those
libraries and try again:

...
test pgtypeslib/dt_test   ... stdout stderr FAILED (test
process was terminated by signal 6: Abort trap)
...

Same result as Chris.

--
Thomas Munro
http://www.enterprisedb.com

Re: Collation versioning

2018-09-17 Thread Thomas Munro

On Mon, Sep 17, 2018 at 9:02 AM Stephen Frost  wrote:
> * Thomas Munro (thomas.mu...@enterprisedb.com) wrote:
> > On Mon, Sep 17, 2018 at 6:13 AM Douglas Doole  wrote:
> > > On Sun, Sep 16, 2018 at 1:20 AM Thomas Munro 
> > >  wrote:
> > >> 3.  Fix the tracking of when reindexes need to be rebuilt, so that you
> > >> can't get it wrong (as you're alluding to above).
> > >
> > > I've mentioned this in the past, but didn't seem to get any traction, so 
> > > I'll try it again ;-)
> >
> > Probably because we agree with you, but don't have all the answers :-)
>
> Agreed.
>
> > > The focus on indexes when a collation changes is, in my opinion, the 
> > > least of the problems. You definitely have to worry about indexes, but 
> > > they can be easily rebuilt. What about other places where collation is 
> > > hardened into the system, such as constraints?
> >
> > We have to start somewhere and indexes are the first thing that people
> > notice, and are much likely to actually be a problem (personally I've
> > encountered many cases of index corruption due to collation changes in
> > the wild, but never a constraint corruption, though I fully understand
> > the theoretical concern).  Several of us have observed specifically
> > that the same problems apply to CHECK constraints and PARTITION
> > boundaries, and there may be other things like that.  You could
> > imagine tracking collation dependencies on those, requiring a RECHECK
> > or REPARTITION operation to update them after a depended-on collation
> > version changes.
> >
> > Perhaps that suggests that there should be a more general way to store
> > collation dependencies -- something more like pg_depend, rather than
> > bolting something like indcollversion onto indexes and every other
> > kind of catalog that might need it.  I don't know.
>
> Agreed.  If we start thinking about pg_depend then maybe we realize
> that this all comes back to pg_attribute as the holder of the
> column-level information and maybe what we should be thinking about is a
> way to encode version information into the typmod for text-based
> types...

So to be more concrete: pg_depend could have a new column
"refobjversion".  Whenever indexes are created or rebuilt, we'd
capture the current version string in the pg_depend rows that link
index attributes and collations.  Then we'd compare those against the
current value when we first open an index and complain if they don't
match.  (In this model there would be no "collversion" column in the
pg_collation catalog.)

That'd leave a place for other kinds of database objects (CHECKs,
PARTITIONS, ...) to store their version dependency, if someone later
wants to add support for that.

I'm not sure if my idea about updating the default collation row in
newly created databases has legs though.  Any thoughts on that?

-- 
Thomas Munro
http://www.enterprisedb.com

Re: Collation versioning

2018-09-17 Thread Douglas Doole

On Mon, Sep 17, 2018 at 12:32 PM Greg Stark  wrote:

> This seems like a terrible idea in the open source world. Surely collation
> versioning means new ICU libraries can still provide the old collation
> rules so even if you update the library you can request the old version? We
> shouldn't need that actual old code with all its security holes and bugs
> just to get the old collation version.
>

We asked long and hard for this feature from the ICU team but they kept
arguing it was too hard to do. There are apparently some tight couplings
between the code and each version of CLDR. So the only way to support old
collations is to ship the entire old library. (They even added make rules
to allow the entire API to be version extended to accommodate this
requirement.)

Even bug fixes are potentially problematic because the fix may alter how
some code points collate. The ICU team won't (or at least wouldn't - been a
few years since I dealt with them) guarantee any sort of backwards
compatibility between code drops.

As an aside, they did look at making the CLDR data a separate data file
that could be read by any version of the code (before finding there were
too many dependencies). One thing that they discovered is that this
approach didn't save much disk since the CLDR data is something like 90-95%
of the library. So while it would have made the calling code a lot cleaner,
it wasn't the huge footprint win we'd been hoping for.

Re: [HACKERS] proposal: schema variables

2018-09-17 Thread Pavel Stehule

so 15. 9. 2018 v 18:06 odesílatel Pavel Stehule 
napsal:

>
>
>
>
>> The code is more cleaner now, there are more tests, and documentation is
>> mostly complete. I am sorry - my English is not good.
>> New features:
>>
>> o ON COMMIT DROP and ON TRANSACTION END RESET -- remove temp variable on
>> commit, reset variable on transaction end (commit, rollback)
>> o LET var = DEFAULT -- reset specified variable
>>
>>
> fix some forgotten warnings and dependency issue
> few more tests
>
>
new update:

o support NOT NULL check
o implementation limited transaction variables - these variables doesn't
respects subtransactions(this is much more complex), drop variable drops
content although the drop can be reverted (maybe this limit will be
removed).

CREATE TRANSACTION VARIABLE fx AS int;

LET fx = 10;
BEGIN
  LET fx = 20;
ROLLBACK;

SELECT fx;

Regards

Pavel


> Regards
>
> Pavel
>
>
>> Regards
>>
>> Pavel
>>
>>
>>> Regards,
>>> Dean
>>>
>>


schema-variables-20180917-01.patch.gz
Description: application/gzip

Re: Collation versioning

2018-09-17 Thread Greg Stark

On Mon 17 Sep 2018, 13:02 Douglas Doole,  wrote:

> On Sun, Sep 16, 2018 at 1:14 PM Thomas Munro <
> thomas.mu...@enterprisedb.com> wrote:
>
>> We have to start somewhere and indexes are the first thing that people
>> notice, and are much likely to actually be a problem (personally I've
>> encountered many cases of index corruption due to collation changes in
>> the wild, but never a constraint corruption,
>
>
> Problems that people notice are good - it's the ones that they don't
> notice that cause real grief ;-)
> ...
>
> Once you get into downstream effects of changes (whether they are
>>
> recorded in the database or elsewhere), I think it's basically beyond
>> our event horizon.
>
> ...
>
>> I think we should make a clear distinction between
>> things that invalidate the correct working of the database, and more
>> nebulous effects that we can't possibly track in general.
>>
>
> I agree that PostgreSQL can't resolve the downstream changes, but that's a
> very subtle distinction. As a user, if an upgrade caused my data to no
> longer comply with my carefully architected and database enforced BI rules,
> I would definitely argue that the correct working of the database has been
> invalidated. (You can make technical arguments about the OS upgrading the
> library, but the fundamental issue is that my PostgreSQL database is
> broken.)
>

Well things like partition exclusion and even join elimination depend on
constraints being consistent so I don't think it's as easy to write off as
that either.

But it's true that we have to start somewhere and collation changes are
much more likely to be spotted in indexes and cause much more visible
issues.


> You can probably argue that PostgreSQL and DB2 users look at the world
> differently, but that's why DB2 ended up shipping its own copies of the ICU
> library. Once a user creates an object using ICU vX, we made sure that
> version of the library was always available to avoid these problems.
>

This seems like a terrible idea in the open source world. Surely collation
versioning means new ICU libraries can still provide the old collation
rules so even if you update the library you can request the old version? We
shouldn't need that actual old code with all its security holes and bugs
just to get the old collation version.

Re: Stored procedures and out parameters

2018-09-17 Thread Pavel Stehule

po 17. 9. 2018 v 18:24 odesílatel Tom Lane  napsal:

> "Jonathan S. Katz"  writes:
> > Just to chime in real quick: from the perspective of the RMT we did not
> > look at these as a series of "right/wrong" options but what would make
> > the most sense for the v11 release so the community could continue to
> > improve support for stored procedures and make it easier for users to
> > work with them going forward.
>
> Yeah.  I think the real options we had here were:
>
> 1. Ship v11 with CALL as it stands, try to improve going forward.
> 2. Delay v11 for months while we figure out something better.
> 3. Remove CALL from v11, try again in v12.
>
> Neither #2 nor #3 are at all attractive.
>
> People will need to realize that CALL behavior is still a work in
> progress.  That may mean that JDBC etc shouldn't try to support it
> yet, which isn't great, but none of the above options would have
> led to near-term support of CALL in JDBC ...
>

+1

Pavel


> regards, tom lane
>
>

Re: Online verification of checksums

2018-09-17 Thread Stephen Frost

Greetings,

On Mon, Sep 17, 2018 at 13:38 Michael Banck 
wrote:

> so, trying some intermediate summary here, sorry for (also) top-posting:
>
> 1. the basebackup checksum verification logic only checks pages not
> changed since the checkpoint, which makes sense for the basebackup.


Right. I’m tending towards the idea that this also be adopted for
pg_verify_checksums.

2. However, it would be desirable to go further for pg_verify_checksums
> and (re-)check all pages.


Maybe.  I’m not entirely convinced that it’s all that useful.

3. pg_verify_checksums should read the checkpoint LSN on startup and
> compare the page LSN against it on re-read, and discard pages which have
> checksum failures but are new. (Maybe it should read new checkpoint LSNs
> as they come in during its runtime as well? See below).


I’m not sure that we really need to but I’m not against it either- but in
that case you’re definitely going to see checksum failures on torn pages.

4. The pg_sleep should go.


I know that pgbackrest does not have a sleep currently and we’ve not yet
seen or been able to reproduce this case where, on a reread, we still see
an older LSN, but we check the LSN first also.  If it’s possible that the
LSN still hasn’t changed on the reread then maybe we do need to have a
sleep to force ourselves off CPU to allow the other process to finish
writing, or maybe finish the file and come back around to these pages
later, but we have yet to see this behavior in the wild anywhere, nor have
we been able to reproduce it.

5. There seems to be no consensus on whether the number of skipped pages
> should be summarized at the end.


I agree with printing the number of skipped pages, that does seem like a
nice to have.  I don’t know that actually printing the pages themselves is
all that useful though.

Further comments:
>
> Am Montag, den 17.09.2018, 19:19 +0200 schrieb Tomas Vondra:
> > On 09/17/2018 07:11 PM, Stephen Frost wrote:
> > > * Tomas Vondra (tomas.von...@2ndquadrant.com) wrote:
> > > > On 09/17/2018 06:42 PM, Stephen Frost wrote:
> > > > Without the checkpoint that's not guaranteed, and simply re-reading
> the
> > > > page and rechecking it vs. the first read does not help:
> > > >
> > > > 1) write the first 512B of the page (sector), which includes the LSN
> > > >
> > > > 2) read the whole page, which will be a mix [new 512B, ... old ... ]
> > > >
> > > > 3) the checksum verification fails
> > > >
> > > > 4) read the page again (possibly reading a bit more new data)
> > > >
> > > > 5) the LSN did not change compared to the first read, yet the
> checksum
> > > > still fails
> > >
> > > So, I agree with all of the above though I've found it to be extremely
> > > rare to get a single read which you've managed to catch part-way
> through
> > > a write, getting multiple of them over a period of time strikes me as
> > > even more unlikely.  Still, if we can come up with a solution to solve
> > > all of this, great, but I'm not sure that I'm hearing one.
> >
> > I don't recall claiming catching many such torn pages - I'm sure it's
> > not very common in most workloads. But I suspect constructing workloads
> > hitting them regularly is not very difficult either (something with a
> > lot of churn in shared buffers should do the trick).
> >
> > > > > Sure, because we don't care about it any longer- that page isn't
> > > > > interesting because the WAL will replay over it.  IIRC it actually
> goes
> > > > > something like: check the checksum, if it failed then check if the
> LSN
> > > > > is greater than the checkpoint (of the backup start..), if not,
> then
> > > > > re-read, if the LSN is now newer than the checkpoint then skip, if
> the
> > > > > LSN is the same then throw an error.
> > > >
> > > > Nope, we only verify the checksum if it's LSN precedes the
> checkpoint:
> > > >
> > > >
> https://github.com/postgres/postgres/blob/master/src/backend/replication/basebackup.c#L1454
> > >
> > > That seems like it's leaving something on the table, but, to be fair,
> we
> > > know that all of those pages should be rewritten by WAL anyway so they
> > > aren't all that interesting to us, particularly in the basebackup case.
> >
> > Yep.
>
> Right, see point 1 above.
>
> > > > > I actually tend to disagree with you that, for this purpose, it's
> > > > > actually necessary to check against the checkpoint LSN- if the LSN
> > > > > changed and everything is operating correctly then the new LSN
> must be
> > > > > more recent than the last checkpoint location or things are broken
> > > > > badly.
> > > >
> > > > I don't follow. Are you suggesting we don't need the checkpoint LSN?
> > > >
> > > > I'm pretty sure that's not the case. The thing is - the LSN may not
> > > > change between the two reads, but that's not a guarantee the page was
> > > > not torn. The example I posted earlier in this message illustrates
> that.
> > >
> > > I agree that there's some risk there, but it's certainly much less
> > > likely.
> >
> > Well. If we're

Re: Online verification of checksums

2018-09-17 Thread Michael Banck

Hi,

so, trying some intermediate summary here, sorry for (also) top-posting:

1. the basebackup checksum verification logic only checks pages not
changed since the checkpoint, which makes sense for the basebackup. 

2. However, it would be desirable to go further for pg_verify_checksums
and (re-)check all pages.

3. pg_verify_checksums should read the checkpoint LSN on startup and
compare the page LSN against it on re-read, and discard pages which have
checksum failures but are new. (Maybe it should read new checkpoint LSNs
as they come in during its runtime as well? See below).

4. The pg_sleep should go.

5. There seems to be no consensus on whether the number of skipped pages
should be summarized at the end.

Further comments:

Am Montag, den 17.09.2018, 19:19 +0200 schrieb Tomas Vondra:
> On 09/17/2018 07:11 PM, Stephen Frost wrote:
> > * Tomas Vondra (tomas.von...@2ndquadrant.com) wrote:
> > > On 09/17/2018 06:42 PM, Stephen Frost wrote:
> > > Without the checkpoint that's not guaranteed, and simply re-reading the
> > > page and rechecking it vs. the first read does not help:
> > > 
> > > 1) write the first 512B of the page (sector), which includes the LSN
> > > 
> > > 2) read the whole page, which will be a mix [new 512B, ... old ... ]
> > > 
> > > 3) the checksum verification fails
> > > 
> > > 4) read the page again (possibly reading a bit more new data)
> > > 
> > > 5) the LSN did not change compared to the first read, yet the checksum
> > > still fails
> > 
> > So, I agree with all of the above though I've found it to be extremely
> > rare to get a single read which you've managed to catch part-way through
> > a write, getting multiple of them over a period of time strikes me as
> > even more unlikely.  Still, if we can come up with a solution to solve
> > all of this, great, but I'm not sure that I'm hearing one.
> 
> I don't recall claiming catching many such torn pages - I'm sure it's
> not very common in most workloads. But I suspect constructing workloads
> hitting them regularly is not very difficult either (something with a
> lot of churn in shared buffers should do the trick).
> 
> > > > Sure, because we don't care about it any longer- that page isn't
> > > > interesting because the WAL will replay over it.  IIRC it actually goes
> > > > something like: check the checksum, if it failed then check if the LSN
> > > > is greater than the checkpoint (of the backup start..), if not, then
> > > > re-read, if the LSN is now newer than the checkpoint then skip, if the
> > > > LSN is the same then throw an error.
> > > 
> > > Nope, we only verify the checksum if it's LSN precedes the checkpoint:
> > > 
> > > https://github.com/postgres/postgres/blob/master/src/backend/replication/basebackup.c#L1454
> > 
> > That seems like it's leaving something on the table, but, to be fair, we
> > know that all of those pages should be rewritten by WAL anyway so they
> > aren't all that interesting to us, particularly in the basebackup case.
> 
> Yep.

Right, see point 1 above.

> > > > I actually tend to disagree with you that, for this purpose, it's
> > > > actually necessary to check against the checkpoint LSN- if the LSN
> > > > changed and everything is operating correctly then the new LSN must be
> > > > more recent than the last checkpoint location or things are broken
> > > > badly.
> > > 
> > > I don't follow. Are you suggesting we don't need the checkpoint LSN?
> > > 
> > > I'm pretty sure that's not the case. The thing is - the LSN may not
> > > change between the two reads, but that's not a guarantee the page was
> > > not torn. The example I posted earlier in this message illustrates that.
> > 
> > I agree that there's some risk there, but it's certainly much less
> > likely.
> 
> Well. If we're going to report a checksum failure, we better be sure it
> actually is a broken page. I don't want users to start chasing bogus
> data corruption issues.

I agree.

> > > > Now, that said, I do think it's a good *idea* to check against the
> > > > checkpoint LSN (presuming this is for online checking of checksums- for
> > > > basebackup, we could just check against the backup-start LSN as anything
> > > > after that point will be rewritten by WAL anyway).  The reason that I
> > > > think it's a good idea to check against the checkpoint LSN is that we'd
> > > > want to throw a big warning if the kernel is just feeding us random
> > > > garbage on reads and only finding a difference between two reads isn't
> > > > really doing any kind of validation, whereas checking against the
> > > > checkpoint-LSN would at least give us some idea that the value being
> > > > read isn't completely ridiculous.

Are you suggesting here that we always check against the current
checkpoint, or is checking against the checkpoint that we saw at startup
enough? I think re-reading pg_control all the time might be more
errorprone that what we could get from this, so I would prefer not to do
this.

> > > > When it comes to if the pg_

Re: Online verification of checksums

2018-09-17 Thread Stephen Frost

Greetings,

On Mon, Sep 17, 2018 at 13:20 Tomas Vondra 
wrote:

> On 09/17/2018 07:11 PM, Stephen Frost wrote:
> > Greetings,
> >
> > * Tomas Vondra (tomas.von...@2ndquadrant.com) wrote:
> >> On 09/17/2018 06:42 PM, Stephen Frost wrote:
> >>> Ok, good, though I'm not sure what you mean by 'eliminates the
> >>> consistency guarantees provided by the checkpoint'.  The point is that
> >>> the page will be in the WAL and the WAL will be replayed during the
> >>> restore of the backup.
> >>
> >> The checkpoint guarantees that the whole page was written and flushed to
> >> disk with an LSN before the ckeckpoint LSN. So when you read a page with
> >> that LSN, you know the whole write already completed and a read won't
> >> return data from before the LSN.
> >
> > Well, you know that the first part was written out at some prior point,
> > but you could end up reading the first part of a page with an older LSN
> > while also reading the second part with new data.



Doesn't the checkpoint fsync pretty much guarantee this can't happen?


How? Either it’s possible for the latter half of a page to be updated
before the first half (where the LSN lives), or it isn’t. If it’s possible
then that LSN could be ancient and it wouldn’t matter.

>> Without the checkpoint that's not guaranteed, and simply re-reading the
> >> page and rechecking it vs. the first read does not help:
> >>
> >> 1) write the first 512B of the page (sector), which includes the LSN
> >>
> >> 2) read the whole page, which will be a mix [new 512B, ... old ... ]
> >>
> >> 3) the checksum verification fails
> >>
> >> 4) read the page again (possibly reading a bit more new data)
> >>
> >> 5) the LSN did not change compared to the first read, yet the checksum
> >> still fails
> >
> > So, I agree with all of the above though I've found it to be extremely
> > rare to get a single read which you've managed to catch part-way through
> > a write, getting multiple of them over a period of time strikes me as
> > even more unlikely.  Still, if we can come up with a solution to solve
> > all of this, great, but I'm not sure that I'm hearing one.


I don't recall claiming catching many such torn pages - I'm sure it's
> not very common in most workloads. But I suspect constructing workloads
> hitting them regularly is not very difficult either (something with a
> lot of churn in shared buffers should do the trick).


The question is if it’s possible to catch a torn page where the second half
is updated *before* the first half of the page in a read (and then in
subsequent reads having that state be maintained).  I have some skepticism
that it’s really possible to happen in the first place but having an
interrupted system call be stalled across two more system calls just seems
terribly unlikely, and this is all based on the assumption that the kernel
might write the second half of a write before the first to the kernel cache
in the first place.

>>> Sure, because we don't care about it any longer- that page isn't
> >>> interesting because the WAL will replay over it.  IIRC it actually goes
> >>> something like: check the checksum, if it failed then check if the LSN
> >>> is greater than the checkpoint (of the backup start..), if not, then
> >>> re-read, if the LSN is now newer than the checkpoint then skip, if the
> >>> LSN is the same then throw an error.
> >>
> >> Nope, we only verify the checksum if it's LSN precedes the checkpoint:
> >>
> >>
> https://github.com/postgres/postgres/blob/master/src/backend/replication/basebackup.c#L1454
> >
> > That seems like it's leaving something on the table, but, to be fair, we
> > know that all of those pages should be rewritten by WAL anyway so they
> > aren't all that interesting to us, particularly in the basebackup case.
> >
>
> Yep.
>
> >>> I actually tend to disagree with you that, for this purpose, it's
> >>> actually necessary to check against the checkpoint LSN- if the LSN
> >>> changed and everything is operating correctly then the new LSN must be
> >>> more recent than the last checkpoint location or things are broken
> >>> badly.
> >>
> >> I don't follow. Are you suggesting we don't need the checkpoint LSN?
> >>
> >> I'm pretty sure that's not the case. The thing is - the LSN may not
> >> change between the two reads, but that's not a guarantee the page was
> >> not torn. The example I posted earlier in this message illustrates that.
> >
> > I agree that there's some risk there, but it's certainly much less
> > likely.
> >
>
> Well. If we're going to report a checksum failure, we better be sure it
> actually is a broken page. I don't want users to start chasing bogus
> data corruption issues.


Yes, I definitely agree that we don’t want to mis-report checksum failures
if we can avoid it.

>>> Now, that said, I do think it's a good *idea* to check against the
> >>> checkpoint LSN (presuming this is for online checking of checksums- for
> >>> basebackup, we could just check against the backup-start LSN as
> anything
>

Re: Code of Conduct plan

2018-09-17 Thread Dimitri Maziuk

On 09/17/2018 10:39 AM, Chris Travers wrote:
> On Mon, Sep 17, 2018 at 5:28 PM Joshua D. Drake 
> wrote:
...
>> My feedback is that those two sentences provide an overarching authority
>> that .Org does not have the right to enforce 
...
> Fascinating that this would, on its face, not apply to a harassment
> campaign carried out over twitter, but it would apply to a few comments
> made over drinks at a bar.

There is a flip side: if you have written standards, you can be held
liable for not enforcing them. Potentially including enforcement of
twitbook AUP on the list subscribers who also have a slackogger account.

-- 
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu



signature.asc
Description: OpenPGP digital signature

Re: Online verification of checksums

2018-09-17 Thread Tomas Vondra

On 09/17/2018 07:11 PM, Stephen Frost wrote:
> Greetings,
> 
> * Tomas Vondra (tomas.von...@2ndquadrant.com) wrote:
>> On 09/17/2018 06:42 PM, Stephen Frost wrote:
>>> Ok, good, though I'm not sure what you mean by 'eliminates the
>>> consistency guarantees provided by the checkpoint'.  The point is that
>>> the page will be in the WAL and the WAL will be replayed during the
>>> restore of the backup.
>>
>> The checkpoint guarantees that the whole page was written and flushed to
>> disk with an LSN before the ckeckpoint LSN. So when you read a page with
>> that LSN, you know the whole write already completed and a read won't
>> return data from before the LSN.
> 
> Well, you know that the first part was written out at some prior point,
> but you could end up reading the first part of a page with an older LSN
> while also reading the second part with new data.
> 

Doesn't the checkpoint fsync pretty much guarantee this can't happen?

>> Without the checkpoint that's not guaranteed, and simply re-reading the
>> page and rechecking it vs. the first read does not help:
>>
>> 1) write the first 512B of the page (sector), which includes the LSN
>>
>> 2) read the whole page, which will be a mix [new 512B, ... old ... ]
>>
>> 3) the checksum verification fails
>>
>> 4) read the page again (possibly reading a bit more new data)
>>
>> 5) the LSN did not change compared to the first read, yet the checksum
>> still fails
> 
> So, I agree with all of the above though I've found it to be extremely
> rare to get a single read which you've managed to catch part-way through
> a write, getting multiple of them over a period of time strikes me as
> even more unlikely.  Still, if we can come up with a solution to solve
> all of this, great, but I'm not sure that I'm hearing one.
> 

I don't recall claiming catching many such torn pages - I'm sure it's
not very common in most workloads. But I suspect constructing workloads
hitting them regularly is not very difficult either (something with a
lot of churn in shared buffers should do the trick).

>>> Sure, because we don't care about it any longer- that page isn't
>>> interesting because the WAL will replay over it.  IIRC it actually goes
>>> something like: check the checksum, if it failed then check if the LSN
>>> is greater than the checkpoint (of the backup start..), if not, then
>>> re-read, if the LSN is now newer than the checkpoint then skip, if the
>>> LSN is the same then throw an error.
>>
>> Nope, we only verify the checksum if it's LSN precedes the checkpoint:
>>
>> https://github.com/postgres/postgres/blob/master/src/backend/replication/basebackup.c#L1454
> 
> That seems like it's leaving something on the table, but, to be fair, we
> know that all of those pages should be rewritten by WAL anyway so they
> aren't all that interesting to us, particularly in the basebackup case.
> 

Yep.

>>> I actually tend to disagree with you that, for this purpose, it's
>>> actually necessary to check against the checkpoint LSN- if the LSN
>>> changed and everything is operating correctly then the new LSN must be
>>> more recent than the last checkpoint location or things are broken
>>> badly.
>>
>> I don't follow. Are you suggesting we don't need the checkpoint LSN?
>>
>> I'm pretty sure that's not the case. The thing is - the LSN may not
>> change between the two reads, but that's not a guarantee the page was
>> not torn. The example I posted earlier in this message illustrates that.
> 
> I agree that there's some risk there, but it's certainly much less
> likely.
> 

Well. If we're going to report a checksum failure, we better be sure it
actually is a broken page. I don't want users to start chasing bogus
data corruption issues.

>>> Now, that said, I do think it's a good *idea* to check against the
>>> checkpoint LSN (presuming this is for online checking of checksums- for
>>> basebackup, we could just check against the backup-start LSN as anything
>>> after that point will be rewritten by WAL anyway).  The reason that I
>>> think it's a good idea to check against the checkpoint LSN is that we'd
>>> want to throw a big warning if the kernel is just feeding us random
>>> garbage on reads and only finding a difference between two reads isn't
>>> really doing any kind of validation, whereas checking against the
>>> checkpoint-LSN would at least give us some idea that the value being
>>> read isn't completely ridiculous.
>>>
>>> When it comes to if the pg_sleep() is necessary or not, I have to admit
>>> to being unsure about that..  I could see how it might be but it seems a
>>> bit surprising- I'd probably want to see exactly what the page was at
>>> the time of the failure and at the time of the second (no-sleep) re-read
>>> and then after a delay and convince myself that it was just an unlucky
>>> case of being scheduled in twice to read that page before the process
>>> writing it out got a chance to finish the write.
>>
>> I think the pg_sleep() is a pretty strong sign there's

Re: Online verification of checksums

2018-09-17 Thread Stephen Frost

Greetings,

* Tomas Vondra (tomas.von...@2ndquadrant.com) wrote:
> On 09/17/2018 06:42 PM, Stephen Frost wrote:
> > Ok, good, though I'm not sure what you mean by 'eliminates the
> > consistency guarantees provided by the checkpoint'.  The point is that
> > the page will be in the WAL and the WAL will be replayed during the
> > restore of the backup.
> 
> The checkpoint guarantees that the whole page was written and flushed to
> disk with an LSN before the ckeckpoint LSN. So when you read a page with
> that LSN, you know the whole write already completed and a read won't
> return data from before the LSN.

Well, you know that the first part was written out at some prior point,
but you could end up reading the first part of a page with an older LSN
while also reading the second part with new data.

> Without the checkpoint that's not guaranteed, and simply re-reading the
> page and rechecking it vs. the first read does not help:
> 
> 1) write the first 512B of the page (sector), which includes the LSN
> 
> 2) read the whole page, which will be a mix [new 512B, ... old ... ]
> 
> 3) the checksum verification fails
> 
> 4) read the page again (possibly reading a bit more new data)
> 
> 5) the LSN did not change compared to the first read, yet the checksum
> still fails

So, I agree with all of the above though I've found it to be extremely
rare to get a single read which you've managed to catch part-way through
a write, getting multiple of them over a period of time strikes me as
even more unlikely.  Still, if we can come up with a solution to solve
all of this, great, but I'm not sure that I'm hearing one.

> > Sure, because we don't care about it any longer- that page isn't
> > interesting because the WAL will replay over it.  IIRC it actually goes
> > something like: check the checksum, if it failed then check if the LSN
> > is greater than the checkpoint (of the backup start..), if not, then
> > re-read, if the LSN is now newer than the checkpoint then skip, if the
> > LSN is the same then throw an error.
> 
> Nope, we only verify the checksum if it's LSN precedes the checkpoint:
> 
> https://github.com/postgres/postgres/blob/master/src/backend/replication/basebackup.c#L1454

That seems like it's leaving something on the table, but, to be fair, we
know that all of those pages should be rewritten by WAL anyway so they
aren't all that interesting to us, particularly in the basebackup case.

> > I actually tend to disagree with you that, for this purpose, it's
> > actually necessary to check against the checkpoint LSN- if the LSN
> > changed and everything is operating correctly then the new LSN must be
> > more recent than the last checkpoint location or things are broken
> > badly.
> 
> I don't follow. Are you suggesting we don't need the checkpoint LSN?
> 
> I'm pretty sure that's not the case. The thing is - the LSN may not
> change between the two reads, but that's not a guarantee the page was
> not torn. The example I posted earlier in this message illustrates that.

I agree that there's some risk there, but it's certainly much less
likely.

> > Now, that said, I do think it's a good *idea* to check against the
> > checkpoint LSN (presuming this is for online checking of checksums- for
> > basebackup, we could just check against the backup-start LSN as anything
> > after that point will be rewritten by WAL anyway).  The reason that I
> > think it's a good idea to check against the checkpoint LSN is that we'd
> > want to throw a big warning if the kernel is just feeding us random
> > garbage on reads and only finding a difference between two reads isn't
> > really doing any kind of validation, whereas checking against the
> > checkpoint-LSN would at least give us some idea that the value being
> > read isn't completely ridiculous.
> > 
> > When it comes to if the pg_sleep() is necessary or not, I have to admit
> > to being unsure about that..  I could see how it might be but it seems a
> > bit surprising- I'd probably want to see exactly what the page was at
> > the time of the failure and at the time of the second (no-sleep) re-read
> > and then after a delay and convince myself that it was just an unlucky
> > case of being scheduled in twice to read that page before the process
> > writing it out got a chance to finish the write.
> 
> I think the pg_sleep() is a pretty strong sign there's something broken.
> At the very least, it's likely to misbehave on machines with different
> timings, machines under memory and/or memory pressure, etc.

If we assume that what you've outlined above is a serious enough issue
that we have to address it, and do so without a pg_sleep(), then I think
we have to bake into this a way for the process to check with PG as to
what the page's current LSN is, in shared buffers, because that's the
only place where we've got the locking required to ensure that we don't
end up with a read of a partially written page, and I'm really not
entirely convinced that we need to go to that

Re: Collation versioning

2018-09-17 Thread Douglas Doole

On Sun, Sep 16, 2018 at 1:14 PM Thomas Munro 
wrote:

> We have to start somewhere and indexes are the first thing that people
> notice, and are much likely to actually be a problem (personally I've
> encountered many cases of index corruption due to collation changes in
> the wild, but never a constraint corruption,

Problems that people notice are good - it's the ones that they don't notice
that cause real grief ;-)

Given that PostgreSQL requires equal values to be binary identical I think
that you'll avoid a lot of the problems that caused me so much trouble in
DB2. Even if someone creates a range constraint or partitioned table, the
boundary values aren't typically going to be impacted (my French example
was a little contrived). That said, there have been a few changes that
would have much more obvious impacts. One language stopped treating 'ch' as
a single collating element that came somewhere after 'c'+'z' and treated it
as 'c'+'h' instead. And and one of the North Germanic languages moved an
accented character from after 'z' to before 'z' (or maybe it was vice versa
- I miss my library of Unicode presentations).

Once you get into downstream effects of changes (whether they are
>
recorded in the database or elsewhere), I think it's basically beyond
> our event horizon.

...

> I think we should make a clear distinction between
> things that invalidate the correct working of the database, and more
> nebulous effects that we can't possibly track in general.
>

I agree that PostgreSQL can't resolve the downstream changes, but that's a
very subtle distinction. As a user, if an upgrade caused my data to no
longer comply with my carefully architected and database enforced BI rules,
I would definitely argue that the correct working of the database has been
invalidated. (You can make technical arguments about the OS upgrading the
library, but the fundamental issue is that my PostgreSQL database is
broken.)

You can probably argue that PostgreSQL and DB2 users look at the world
differently, but that's why DB2 ended up shipping its own copies of the ICU
library. Once a user creates an object using ICU vX, we made sure that
version of the library was always available to avoid these problems. (The
libraries were on a private path with non-standard names so there was no
collision with the OS library. Fortunately I'd moved on before anyone
really started complaining about why 5 copies of ICU were being installed
when they weren't even using Unicode.)

> [1]
> https://www.postgresql.org/message-id/CAEepm%3D30SQpEUjau%3DdScuNeVZaK2kJ6QQDCHF75u5W%3DCz%3D3Scw%40mail.gmail.com

I'd missed this post. In it you asked:

I wonder how it manages to deal with fr_CA's reversed secondary weighting
rule which requires you to consider diacritics in reverse order --
apparently abandoned in France but still used in Canada -- using a fixed
size space for state between calls.

ucol_nextSortKeyPart() only keeps track of the current position in the
generated sort key as its state. So, if you call it multiple times to
generate the sort key piecemeal, it recomputes the entire sort key until it
has enough bytes to satisfy your request. (That is, if you're doing 4 bytes
at a time, on the first call it generates and returns bytes 1-4. On the
second call it generates 1-8 and returns 5-8. Next it generates 1-12 and
returns 9-12.) Needless to say, this gets very expensive very fast.

Re: Online verification of checksums

2018-09-17 Thread Tomas Vondra

Hi,

On 09/17/2018 06:42 PM, Stephen Frost wrote:
> Greetings,
> 
> * Tomas Vondra (tomas.von...@2ndquadrant.com) wrote:
>> On 09/17/2018 04:46 PM, Stephen Frost wrote:
>>> * Michael Banck (michael.ba...@credativ.de) wrote:
 On Mon, Sep 03, 2018 at 10:29:18PM +0200, Tomas Vondra wrote:
> Obviously, pg_verify_checksums can't do that easily because it's
> supposed to run from outside the database instance. 

 It reads pg_control anyway, so couldn't we just take
 ControlFile->checkPoint?

 Other than that, basebackup.c seems to only look at pages which haven't
 been changed since the backup starting checkpoint (see above if
 statement). That's reasonable for backups, but is it just as reasonable
 for online verification?
>>>
>>> Right, basebackup doesn't need to look at other pages.
>>>
> The pg_verify_checksum apparently ignores this skip logic, because on
> the retry it simply re-reads the page again, verifies the checksum and
> reports an error. Which is broken, because the newly read page might be
> torn again due to a concurrent write.

 Well, ok.
>>>
>>> The newly read page will have an updated LSN though then on the re-read,
>>> in which case basebackup can know that what happened was a rewrite of
>>> the page and it no longer has to care about the page and can skip it.
>>>
>>> I haven't looked, but if basebackup isn't checking the LSN again for the
>>> newly read page then that'd be broken, but I believe it does (at least,
>>> that's the algorithm we came up with for pgBackRest, and I know David
>>> shared that when the basebackup code was being written).
>>
>> Yes, basebackup does check the LSN on re-read, and skips the page if it
>> changed on re-read (because it eliminates the consistency guarantees
>> provided by the checkpoint).
> 
> Ok, good, though I'm not sure what you mean by 'eliminates the
> consistency guarantees provided by the checkpoint'.  The point is that
> the page will be in the WAL and the WAL will be replayed during the
> restore of the backup.
> 

The checkpoint guarantees that the whole page was written and flushed to
disk with an LSN before the ckeckpoint LSN. So when you read a page with
that LSN, you know the whole write already completed and a read won't
return data from before the LSN.

Without the checkpoint that's not guaranteed, and simply re-reading the
page and rechecking it vs. the first read does not help:

1) write the first 512B of the page (sector), which includes the LSN

2) read the whole page, which will be a mix [new 512B, ... old ... ]

3) the checksum verification fails

4) read the page again (possibly reading a bit more new data)

5) the LSN did not change compared to the first read, yet the checksum
still fails


 So how about we do check every page, but if one fails on retry, and the
 LSN is newer than the checkpoint, we then skip it? Is that logic sound?
>>>
>>> I thought that's what basebackup did- if it doesn't do that today, then
>>> it really should.
>>
>> The crucial distinction here is that the trick is not in comparing LSNs
>> from the two page reads, but comparing it to the checkpoint LSN. If it's
>> greater, the page may be torn or broken, and there's no way to know
>> which case it is - so basebackup simply skips it.
> 
> Sure, because we don't care about it any longer- that page isn't
> interesting because the WAL will replay over it.  IIRC it actually goes
> something like: check the checksum, if it failed then check if the LSN
> is greater than the checkpoint (of the backup start..), if not, then
> re-read, if the LSN is now newer than the checkpoint then skip, if the
> LSN is the same then throw an error.
> 

Nope, we only verify the checksum if it's LSN precedes the checkpoint:

https://github.com/postgres/postgres/blob/master/src/backend/replication/basebackup.c#L1454

 In any case, if we decide we really should skip the page if it is newer
 than the checkpoint, I think it makes sense to track those skipped pages
 and print their number out at the end, if there are any.
>>>
>>> Not sure what the point of this is.  If we wanted to really do something
>>> to cross-check here, we'd track the pages that were skipped and then
>>> look through the WAL to make sure that they're there.  That's something
>>> we've talked about doing with pgBackRest, but don't currently.
>>
>> I agree simply printing the page numbers seems rather useless. What we
>> could do is remember which pages we skipped and then try checking them
>> after another checkpoint. Or something like that.
> 
> I'm still not sure I'm seeing the point of that.  They're still going to
> almost certainly be in the kernel cache.  The reason for checking
> against the WAL would be to detect errors in PG where we aren't putting
> a page into the WAL when it really should be, or something similar,
> which seems like it at least could be useful.
> 
> Maybe to put it another way- there's very little point

Re: Online verification of checksums

2018-09-17 Thread Stephen Frost

Greetings,

* Tomas Vondra (tomas.von...@2ndquadrant.com) wrote:
> On 09/17/2018 04:46 PM, Stephen Frost wrote:
> > * Michael Banck (michael.ba...@credativ.de) wrote:
> >> On Mon, Sep 03, 2018 at 10:29:18PM +0200, Tomas Vondra wrote:
> >>> Obviously, pg_verify_checksums can't do that easily because it's
> >>> supposed to run from outside the database instance. 
> >>
> >> It reads pg_control anyway, so couldn't we just take
> >> ControlFile->checkPoint?
> >>
> >> Other than that, basebackup.c seems to only look at pages which haven't
> >> been changed since the backup starting checkpoint (see above if
> >> statement). That's reasonable for backups, but is it just as reasonable
> >> for online verification?
> > 
> > Right, basebackup doesn't need to look at other pages.
> > 
> >>> The pg_verify_checksum apparently ignores this skip logic, because on
> >>> the retry it simply re-reads the page again, verifies the checksum and
> >>> reports an error. Which is broken, because the newly read page might be
> >>> torn again due to a concurrent write.
> >>
> >> Well, ok.
> > 
> > The newly read page will have an updated LSN though then on the re-read,
> > in which case basebackup can know that what happened was a rewrite of
> > the page and it no longer has to care about the page and can skip it.
> > 
> > I haven't looked, but if basebackup isn't checking the LSN again for the
> > newly read page then that'd be broken, but I believe it does (at least,
> > that's the algorithm we came up with for pgBackRest, and I know David
> > shared that when the basebackup code was being written).
> 
> Yes, basebackup does check the LSN on re-read, and skips the page if it
> changed on re-read (because it eliminates the consistency guarantees
> provided by the checkpoint).

Ok, good, though I'm not sure what you mean by 'eliminates the
consistency guarantees provided by the checkpoint'.  The point is that
the page will be in the WAL and the WAL will be replayed during the
restore of the backup.

> >> So how about we do check every page, but if one fails on retry, and the
> >> LSN is newer than the checkpoint, we then skip it? Is that logic sound?
> > 
> > I thought that's what basebackup did- if it doesn't do that today, then
> > it really should.
> 
> The crucial distinction here is that the trick is not in comparing LSNs
> from the two page reads, but comparing it to the checkpoint LSN. If it's
> greater, the page may be torn or broken, and there's no way to know
> which case it is - so basebackup simply skips it.

Sure, because we don't care about it any longer- that page isn't
interesting because the WAL will replay over it.  IIRC it actually goes
something like: check the checksum, if it failed then check if the LSN
is greater than the checkpoint (of the backup start..), if not, then
re-read, if the LSN is now newer than the checkpoint then skip, if the
LSN is the same then throw an error.

> >> In any case, if we decide we really should skip the page if it is newer
> >> than the checkpoint, I think it makes sense to track those skipped pages
> >> and print their number out at the end, if there are any.
> > 
> > Not sure what the point of this is.  If we wanted to really do something
> > to cross-check here, we'd track the pages that were skipped and then
> > look through the WAL to make sure that they're there.  That's something
> > we've talked about doing with pgBackRest, but don't currently.
> 
> I agree simply printing the page numbers seems rather useless. What we
> could do is remember which pages we skipped and then try checking them
> after another checkpoint. Or something like that.

I'm still not sure I'm seeing the point of that.  They're still going to
almost certainly be in the kernel cache.  The reason for checking
against the WAL would be to detect errors in PG where we aren't putting
a page into the WAL when it really should be, or something similar,
which seems like it at least could be useful.

Maybe to put it another way- there's very little point in checking the
checksum of a page which we know must be re-written during recovery to
get to a consistent point.  I don't think it hurts in the general case,
but I wouldn't write a lot of code which then needs to be tested to
handle it.  I also don't think that we really need to make
pg_verify_checksum spend lots of extra cycles trying to verify that
*every* page had its checksum validated when we know that lots of pages
are going to be in memory marked dirty and our checking of them will be
ultimately pointless as they'll either be written out by the
checkpointer or some other process, or we'll replay them from the WAL if
we crash.

> >>> FWIW I also don't understand the purpose of pg_sleep(), it does not seem
> >>> to protect against anything, really.
> >>
> >> Well, I've noticed that without it I get sporadic checksum failures on
> >> reread, so I've added it to make them go away. It was certainly a
> >> phenomenological decision that I am happy to trade fo

Re: Online verification of checksums

2018-09-17 Thread Tomas Vondra




On 09/17/2018 04:04 PM, Michael Banck wrote:
> Hi,
> 
> On Mon, Sep 03, 2018 at 10:29:18PM +0200, Tomas Vondra wrote:
>> The patch is mostly copying the verification / retry logic from
>> basebackup.c, but I think it omitted a rather important detail that
>> makes it incorrect in the presence of concurrent writes.
>>
>> The very first thing basebackup does is this:
>>
>> startptr = do_pg_start_backup(...);
>>
>> i.e. it waits for a checkpoint, remembering the LSN. And then when
>> checking a page it does this:
>>
>>if (!PageIsNew(page) && PageGetLSN(page) < startptr)
>>{
>>... verify the page checksum
>>}
>>
>> Obviously, pg_verify_checksums can't do that easily because it's
>> supposed to run from outside the database instance. 
> 
> It reads pg_control anyway, so couldn't we just take
> ControlFile->checkPoint?
> 
> Other than that, basebackup.c seems to only look at pages which haven't
> been changed since the backup starting checkpoint (see above if
> statement). That's reasonable for backups, but is it just as reasonable
> for online verification?
> 

I suppose we might refresh the checkpoint LSN regularly, and use the
most recent one. On large/busy databases that would allow checking
larger part of the database.

>> But the startptr detail is pretty important because it supports this
>> retry reasoning:
>>
>> /*
>>  * Retry the block on the first failure.  It's
>>  * possible that we read the first 4K page of the
>>  * block just before postgres updated the entire block
>>  * so it ends up looking torn to us.  We only need to
>>  * retry once because the LSN should be updated to
>>  * something we can ignore on the next pass.  If the
>>  * error happens again then it is a true validation
>>  * failure.
>>  */
>>
>> Imagine the 8kB page as two 4kB pages, with the initial state being
>> [A1,A2] and another process over-writing it with [B1,B2]. If you read
>> the 8kB page, what states can you see?
>>
>> I don't think POSIX provides any guarantees about atomicity of the write
>> calls (and even if it does, the filesystems on Linux don't seem to). So
>> you may observe both [A1,B2] or [A2,B1], or various inconsistent mixes
>> of the two versions, depending on timing. Well, torn pages ...
>>
>> Pretty much the only thing you can rely on is that when one process does
>>
>> write([B1,B2])
>>
>> the other process may first read [A1,B2], but the next read will return
>> [B1,B2] (or possibly newer data, if there was another write). It will
>> not read the "stale" A1 again.
>>
>> The basebackup relies on this kinda implicitly - on the retry it'll
>> notice the LSN changed (thanks to the startptr check), and the page will
>> be skipped entirely. This is pretty important, because the new page
>> might be torn in some other way.
>>
>> The pg_verify_checksum apparently ignores this skip logic, because on
>> the retry it simply re-reads the page again, verifies the checksum and
>> reports an error. Which is broken, because the newly read page might be
>> torn again due to a concurrent write.
> 
> Well, ok.
>  
>> So IMHO this should do something similar to basebackup - check the page
>> LSN, and if it changed then skip the page.
>>
>> I'm afraid this requires using the last checkpoint LSN, the way startptr
>> is used in basebackup. In particular we can't simply remember LSN from
>> the first read, because we might actually read [B1,A2] on the first try,
>> and then [B1,B2] or [B1,C2] on the retry. (Actually, the page may be
>> torn in various other ways, not necessarily at the 4kB boundary - it
>> might be torn right after the LSN, for example).
> 
> I'd prefer to come up with a plan where we don't just give up once we
> see a new LSN, if possible. If I run a modified pg_verify_checksums
> which skips on newer pages in a tight benchmark, basically everything
> gets skipped as checkpoints don't happen often enough.
> 

But in that case the checksums are verified when reading the buffer into
shared buffers, it's not like we don't notice the checksum error at all.
We are interested in the pages that have not been read/written for an
extended period time. So I think this is not a problem.

> So how about we do check every page, but if one fails on retry, and the
> LSN is newer than the checkpoint, we then skip it? Is that logic sound?
> 

Hmmm, maybe.

> In any case, if we decide we really should skip the page if it is newer
> than the checkpoint, I think it makes sense to track those skipped pages
> and print their number out at the end, if there are any.
> 

I agree it might be useful to know how many pages were skipped, and how
many actually passed the checksum check.

>> FWIW I also don't understand the purpose of pg_sleep(), it does not seem
>> to protect against anything, really.
> 
> Well, I've noticed that without it I get sporadic checksum failures on
> reread, so I've added it to make them go away. It was certainly a
> phenomenological decis

Re: Stored procedures and out parameters

2018-09-17 Thread Tom Lane

"Jonathan S. Katz"  writes:
> Just to chime in real quick: from the perspective of the RMT we did not
> look at these as a series of "right/wrong" options but what would make
> the most sense for the v11 release so the community could continue to
> improve support for stored procedures and make it easier for users to
> work with them going forward.

Yeah.  I think the real options we had here were:

1. Ship v11 with CALL as it stands, try to improve going forward.
2. Delay v11 for months while we figure out something better.
3. Remove CALL from v11, try again in v12.

Neither #2 nor #3 are at all attractive.

People will need to realize that CALL behavior is still a work in
progress.  That may mean that JDBC etc shouldn't try to support it
yet, which isn't great, but none of the above options would have
led to near-term support of CALL in JDBC ...

regards, tom lane

Re: Stored procedures and out parameters

2018-09-17 Thread Merlin Moncure

On Mon, Sep 17, 2018 at 7:45 AM Jonathan S. Katz  wrote:
>
> Hi,
>
> On 9/2/18 4:32 PM, Robert Haas wrote:
> > On Thu, Aug 30, 2018 at 4:14 PM, Dave Cramer  wrote:
> >> Reading this from the (JDBC) drivers perspective, which is probably a 
> >> fairly
> >> popular one,
> >> We now have a standard that we can't really support. Either the driver will
> >> have to support
> >> the new PROCEDURE with the {call } mechanism or stay with the existing
> >> FUNCTIONS.
> >> This puts the drivers in a no win situation.
> >
> > I understand and I agree.
> >
> >> Undoubtedly.. surely the opportunity to do something about this has not
> >> passed as this has not been
> >> officially released ?
> >
> > I agree with that, too, but I can't make other people do things they
> > don't want to do, and then there's the question of time.  On the basis
> > of previous experience, there is going to be little appetite for
> > holding up the release to address this issue no matter how badly
> > busted it is.  Ultimately, it's the RMT that must decide what to do in
> > cases like this.  I have confidence that they are watching this
> > thread, but I don't know what they will decide to do about it.
> >
>
> First, I want everyone to know that the RMT took this very seriously and
> took time collect feedback and consult with as many people as we could
> in order to make the best possible decision for v11 and ensure that any
> decision we made did not hinder any future implementation for stored
> procedures nor introduced something that would break backwards
> compatibility.
>
> Ultimately, the decision came down to one of four options:
>
> #1: Do nothing and remove the open item
> #2: Introduce nonstandard syntax for calling functions / procedures
> #3: Have CALL support both functions & procedures (or SELECT support
> calling functions)
> #4: Disable CALL
>
> The RMT has decided to go with option #1 and will be removing the open
> item for the PostgreSQL 11 release.

Hooray for making the right choice.   This is exhibit "Z" as to why
abstracting the SQL language behind a programming API can lead to
problems.  The workaround is to simply not do that and you can get
precise control of behavior.  It's a little unfortunate that API
{call} maps to a select but such is life.

merlin

Re: Online verification of checksums

2018-09-17 Thread Tomas Vondra



On 09/17/2018 04:46 PM, Stephen Frost wrote:
> Greetings,
> 
> * Michael Banck (michael.ba...@credativ.de) wrote:
>> On Mon, Sep 03, 2018 at 10:29:18PM +0200, Tomas Vondra wrote:
>>> Obviously, pg_verify_checksums can't do that easily because it's
>>> supposed to run from outside the database instance. 
>>
>> It reads pg_control anyway, so couldn't we just take
>> ControlFile->checkPoint?
>>
>> Other than that, basebackup.c seems to only look at pages which haven't
>> been changed since the backup starting checkpoint (see above if
>> statement). That's reasonable for backups, but is it just as reasonable
>> for online verification?
> 
> Right, basebackup doesn't need to look at other pages.
> 
>>> The pg_verify_checksum apparently ignores this skip logic, because on
>>> the retry it simply re-reads the page again, verifies the checksum and
>>> reports an error. Which is broken, because the newly read page might be
>>> torn again due to a concurrent write.
>>
>> Well, ok.
> 
> The newly read page will have an updated LSN though then on the re-read,
> in which case basebackup can know that what happened was a rewrite of
> the page and it no longer has to care about the page and can skip it.
> 
> I haven't looked, but if basebackup isn't checking the LSN again for the
> newly read page then that'd be broken, but I believe it does (at least,
> that's the algorithm we came up with for pgBackRest, and I know David
> shared that when the basebackup code was being written).
> 

Yes, basebackup does check the LSN on re-read, and skips the page if it
changed on re-read (because it eliminates the consistency guarantees
provided by the checkpoint).

>>> So IMHO this should do something similar to basebackup - check the page
>>> LSN, and if it changed then skip the page.
>>>
>>> I'm afraid this requires using the last checkpoint LSN, the way startptr
>>> is used in basebackup. In particular we can't simply remember LSN from
>>> the first read, because we might actually read [B1,A2] on the first try,
>>> and then [B1,B2] or [B1,C2] on the retry. (Actually, the page may be
>>> torn in various other ways, not necessarily at the 4kB boundary - it
>>> might be torn right after the LSN, for example).
>>
>> I'd prefer to come up with a plan where we don't just give up once we
>> see a new LSN, if possible. If I run a modified pg_verify_checksums
>> which skips on newer pages in a tight benchmark, basically everything
>> gets skipped as checkpoints don't happen often enough.
> 
> I'm really not sure how you expect to be able to do something different
> here.  Even if we started poking into shared buffers, all you'd be able
> to see is that there's a bunch of dirty pages- and we don't maintain the
> checksums in shared buffers, so it's not like you could verify them
> there.
> 
> You could possibly have an option that says "force a checkpoint" but,
> honestly, that's really not all that interesting either- all you'd be
> doing is forcing all the pages to be written out from shared buffers
> into the kernel cache and then reading them from there instead, it's not
> like you'd actually be able to tell if there was a disk/storage error
> because you'll only be looking at the kernel cache.
> 

Yeah.

>> So how about we do check every page, but if one fails on retry, and the
>> LSN is newer than the checkpoint, we then skip it? Is that logic sound?
> 
> I thought that's what basebackup did- if it doesn't do that today, then
> it really should.
> 

The crucial distinction here is that the trick is not in comparing LSNs
from the two page reads, but comparing it to the checkpoint LSN. If it's
greater, the page may be torn or broken, and there's no way to know
which case it is - so basebackup simply skips it.

>> In any case, if we decide we really should skip the page if it is newer
>> than the checkpoint, I think it makes sense to track those skipped pages
>> and print their number out at the end, if there are any.
> 
> Not sure what the point of this is.  If we wanted to really do something
> to cross-check here, we'd track the pages that were skipped and then
> look through the WAL to make sure that they're there.  That's something
> we've talked about doing with pgBackRest, but don't currently.
> 

I agree simply printing the page numbers seems rather useless. What we
could do is remember which pages we skipped and then try checking them
after another checkpoint. Or something like that.

>>> FWIW I also don't understand the purpose of pg_sleep(), it does not seem
>>> to protect against anything, really.
>>
>> Well, I've noticed that without it I get sporadic checksum failures on
>> reread, so I've added it to make them go away. It was certainly a
>> phenomenological decision that I am happy to trade for a better one.
> 
> That then sounds like we really aren't re-checking the LSN, and we
> really should be, to avoid getting these sporadic checksum failures on
> reread..
> 

Again, it's not enough to check the LSN against the prec

Re: Stored procedures and out parameters

2018-09-17 Thread Jonathan S. Katz

On 9/17/18 11:47 AM, Vladimir Sitnikov wrote:

> Merlin>The workaround is to simply not do that and you can get
> Merlin>precise control of behavior
> 
> You are absolutely right.
> On top of that, the whole concept of DB-drivers and libpq is useless.
> Users should just simply exchange wire messages for precise control of
> behavior.

..

Merlin>> Hooray for making the right choice.

Just to chime in real quick: from the perspective of the RMT we did not
look at these as a series of "right/wrong" options but what would make
the most sense for the v11 release so the community could continue to
improve support for stored procedures and make it easier for users to
work with them going forward.

I would suggest we move forward and figure out what needs to be
implemented so people who use JDBC, ODBC, and other drivers are able to
fully take advantage of stored procedures the way they are used to.

Thanks,

Jonathan

signature.asc
Description: OpenPGP digital signature

Re: Stored procedures and out parameters

2018-09-17 Thread Vladimir Sitnikov

Merlin>The workaround is to simply not do that and you can get
Merlin>precise control of behavior

You are absolutely right.
On top of that, the whole concept of DB-drivers and libpq is useless.
Users should just simply exchange wire messages for precise control of
behavior.

Vladimir

Re: Code of Conduct plan

2018-09-17 Thread Chris Travers

On Mon, Sep 17, 2018 at 5:28 PM Joshua D. Drake 
wrote:

> On 09/17/2018 08:11 AM, Dmitri Maziuk wrote:
>
> On Sun, 16 Sep 2018 12:52:34 +
> Martin Mueller  
>  wrote:
>
>
> ... The overreach is dubious on both practical and theoretical grounds. 
> "Stick to your knitting " or the KISS principle seem good advice in this 
> context.
>
> Moderated mailing lists ain't been broken all these years, therefore they 
> need fixing. Obviously.
>
>
> Folks,
>
> At this point it is important to accept that the CoC is happening. We
> aren't going to stop that. The goal now is to insure a CoC that is
> equitable for all community members and that has appropriate
> accountability. At hand it appears that major concern is the CoC trying to
> be authoritative outside of community channels. As well as wording that is
> a bit far reaching. Specifically I think people's main concern is these two
> sentences:
>
> "To that end, we have established this Code of Conduct for community
> interaction and participation in the project’s work and the community at
> large. This Code is meant to cover all interaction between community
> members, whether or not it takes place within postgresql.org
> infrastructure, so long as there is not another Code of Conduct that takes
> precedence (such as a conference's Code of Conduct)."
>

Exactly.  And actually the first sentence is not new.  The second one is a
real problem though.  I am going to try one last time at an additional
alternative.

" To that end, we have established this Code of Conduct for community
interaction and participation in the project’s work and the community at
large.   This code of conduct covers all interaction between community
members on the postgresql.org infrastructure.  Conduct outside the
postgresql.org infrastructure may call the Code of Conduct committee to act
as long as the interaction (or interaction pattern) is community-related,
other parties are unable to act, and the Code of Conduct committee
determines that it is in the best interest of the community to apply this
Code of Conduct."

This solves a number of important problems.

1.  It provides a backstop (as Tom Lane suggested was needed) against a
conference refusing to enforce their own code of conduct in a way the
community finds acceptable while the current wording does not provide any
backstop as long as there is a code of conduct for a conference.
2.  It provides a significant barrier to applying the code of conduct to,
say, political posts on, say, Twitter.
3.  It preserves the ability of the Code of Conduct Committee to act in the
case where someone takes a pattern of harassment off-list and
off-infrastructure.  And it avoids arguing whether Facebook's Community
Standards constitute "another Code of Conduct that takes precedence."

>
> If we can constructively provide feedback about those two sentences, great
> (or constructive feedback on other areas of the CoC). If we can't then this
> thread needs to stop. It has become unproductive.
>
> My feedback is that those two sentences provide an overarching authority
> that .Org does not have the right to enforce and that it is also largely
> redundant because we allow that the idea that if another CoC exists, then
> ours doesn't apply. Well every single major collaboration channel we would
> be concerned with (including something like Blogger) has its own CoC within
> its Terms of use. That effectively neuters the PostgreSQL CoC within places
> like Slack, Facebook, Twitter etc...
>

Fascinating that this would, on its face, not apply to a harassment
campaign carried out over twitter, but it would apply to a few comments
made over drinks at a bar.

>
> JD
>
> --
> Command Prompt, Inc. || http://the.postgres.company/ || @cmdpromptinc
> ***  A fault and talent of mine is to tell it exactly how it is.  ***
> PostgreSQL centered full stack support, consulting and development.
> Advocate: @amplifypostgres || Learn: https://postgresconf.org
> * Unless otherwise stated, opinions are my own.   *
>
>

-- 
Best Wishes,
Chris Travers

Efficito:  Hosted Accounting and ERP.  Robust and Flexible.  No vendor
lock-in.
http://www.efficito.com/learn_more

Re: Multiple primary key on partition table?

2018-09-17 Thread amul sul

Nice catch Rajkumar.

In index_check_primary_key(), relationHasPrimaryKey() called only for the an
alter command but I think we need to call in this case as well, like this:

diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 7eb3e35166..c8714395fe 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -223,7 +223,7 @@ index_check_primary_key(Relation heapRel,
 * and CREATE INDEX doesn't have a way to say PRIMARY KEY, so it's no
 * problem either.
 */
-   if (is_alter_table &&
+   if ((is_alter_table || heapRel->rd_rel->relispartition) &&
relationHasPrimaryKey(heapRel))
{
ereport(ERROR,

Thoughts?

Regards,
Amul
On Mon, Sep 17, 2018 at 4:51 PM Rajkumar Raghuwanshi
 wrote:
>
> Hi,
>
> I am able to create multiple primary key on partition table by executing 
> below statement.
>
> [edb@localhost bin]$ ./psql postgres
> psql (11beta3)
> Type "help" for help.
>
> postgres=# CREATE TABLE t1 (a int PRIMARY KEY,b int) PARTITION BY RANGE (a);
> CREATE TABLE
> postgres=# CREATE TABLE t1_p1 PARTITION OF t1 (b PRIMARY KEY) FOR VALUES FROM 
> (MINVALUE) TO (MAXVALUE);
> CREATE TABLE
> postgres=# \d+ t1_p1
>Table "public.t1_p1"
>  Column |  Type   | Collation | Nullable | Default | Storage | Stats target | 
> Description
> +-+---+--+-+-+--+-
>  a  | integer |   | not null | | plain   |  |
>  b  | integer |   | not null | | plain   |  |
> Partition of: t1 FOR VALUES FROM (MINVALUE) TO (MAXVALUE)
> Partition constraint: (a IS NOT NULL)
> Indexes:
> "t1_p1_pkey" PRIMARY KEY, btree (a)
> "t1_p1_pkey1" PRIMARY KEY, btree (b)
>
> Is this fine to allow it?
> eventually it cause to pg_upgrade failed with below error.
>
> pg_restore: creating TABLE "public.t1"
> pg_restore: creating TABLE "public.t1_p1"
> pg_restore: INFO:  partition constraint for table "t1_p1" is implied by 
> existing constraints
> pg_restore: creating CONSTRAINT "public.t1 t1_pkey"
> pg_restore: creating CONSTRAINT "public.t1_p1 t1_p1_pkey"
> pg_restore: creating CONSTRAINT "public.t1_p1 t1_p1_pkey1"
> pg_restore: [archiver (db)] Error while PROCESSING TOC:
> pg_restore: [archiver (db)] Error from TOC entry 2920; 2606 16395 CONSTRAINT 
> t1_p1 t1_p1_pkey1 edb
> pg_restore: [archiver (db)] could not execute query: ERROR:  multiple primary 
> keys for table "t1_p1" are not allowed
> Command was:
> -- For binary upgrade, must preserve pg_class oids
> SELECT 
> pg_catalog.binary_upgrade_set_next_index_pg_class_oid('16394'::pg_catalog.oid);
>
> ALTER TABLE ONLY "public"."t1_p1"
> ADD CONSTRAINT "t1_p1_pkey1" PRIMARY KEY ("b");
>
> Thanks & Regards,
> Rajkumar Raghuwanshi
> QMG, EnterpriseDB Corporation

Re: Code of Conduct plan

2018-09-17 Thread Joshua D. Drake


On 09/17/2018 08:11 AM, Dmitri Maziuk wrote:

On Sun, 16 Sep 2018 12:52:34 +
Martin Mueller  wrote:


... The overreach is dubious on both practical and theoretical grounds. "Stick to 
your knitting " or the KISS principle seem good advice in this context.

Moderated mailing lists ain't been broken all these years, therefore they need 
fixing. Obviously.


Folks,

At this point it is important to accept that the CoC is happening. We 
aren't going to stop that. The goal now is to insure a CoC that is 
equitable for all community members and that has appropriate 
accountability. At hand it appears that major concern is the CoC trying 
to be authoritative outside of community channels. As well as wording 
that is a bit far reaching. Specifically I think people's main concern 
is these two sentences:


"To that end, we have established this Code of Conduct for community 
interaction and participation in the project’s work and the community at 
large. This Code is meant to cover all interaction between community 
members, whether or not it takes place within postgresql.org 
infrastructure, so long as there is not another Code of Conduct that 
takes precedence (such as a conference's Code of Conduct)."


If we can constructively provide feedback about those two sentences, 
great (or constructive feedback on other areas of the CoC). If we can't 
then this thread needs to stop. It has become unproductive.


My feedback is that those two sentences provide an overarching authority 
that .Org does not have the right to enforce and that it is also largely 
redundant because we allow that the idea that if another CoC exists, 
then ours doesn't apply. Well every single major collaboration channel 
we would be concerned with (including something like Blogger) has its 
own CoC within its Terms of use. That effectively neuters the PostgreSQL 
CoC within places like Slack, Facebook, Twitter etc...


JD

--
Command Prompt, Inc. || http://the.postgres.company/ || @cmdpromptinc
***  A fault and talent of mine is to tell it exactly how it is.  ***
PostgreSQL centered full stack support, consulting and development.
Advocate: @amplifypostgres || Learn: https://postgresconf.org
* Unless otherwise stated, opinions are my own.   *

Re: Code of Conduct plan

2018-09-17 Thread Dmitri Maziuk

On Sun, 16 Sep 2018 12:52:34 +
Martin Mueller  wrote:

> ... The overreach is dubious on both practical and theoretical grounds. 
> "Stick to your knitting " or the KISS principle seem good advice in this 
> context. 

Moderated mailing lists ain't been broken all these years, therefore they need 
fixing. Obviously.

-- 
Dmitri Maziuk

Re: Online verification of checksums

2018-09-17 Thread Stephen Frost

Greetings,

* Michael Banck (michael.ba...@credativ.de) wrote:
> On Mon, Sep 03, 2018 at 10:29:18PM +0200, Tomas Vondra wrote:
> > Obviously, pg_verify_checksums can't do that easily because it's
> > supposed to run from outside the database instance. 
> 
> It reads pg_control anyway, so couldn't we just take
> ControlFile->checkPoint?
> 
> Other than that, basebackup.c seems to only look at pages which haven't
> been changed since the backup starting checkpoint (see above if
> statement). That's reasonable for backups, but is it just as reasonable
> for online verification?

Right, basebackup doesn't need to look at other pages.

> > The pg_verify_checksum apparently ignores this skip logic, because on
> > the retry it simply re-reads the page again, verifies the checksum and
> > reports an error. Which is broken, because the newly read page might be
> > torn again due to a concurrent write.
> 
> Well, ok.

The newly read page will have an updated LSN though then on the re-read,
in which case basebackup can know that what happened was a rewrite of
the page and it no longer has to care about the page and can skip it.

I haven't looked, but if basebackup isn't checking the LSN again for the
newly read page then that'd be broken, but I believe it does (at least,
that's the algorithm we came up with for pgBackRest, and I know David
shared that when the basebackup code was being written).

> > So IMHO this should do something similar to basebackup - check the page
> > LSN, and if it changed then skip the page.
> > 
> > I'm afraid this requires using the last checkpoint LSN, the way startptr
> > is used in basebackup. In particular we can't simply remember LSN from
> > the first read, because we might actually read [B1,A2] on the first try,
> > and then [B1,B2] or [B1,C2] on the retry. (Actually, the page may be
> > torn in various other ways, not necessarily at the 4kB boundary - it
> > might be torn right after the LSN, for example).
> 
> I'd prefer to come up with a plan where we don't just give up once we
> see a new LSN, if possible. If I run a modified pg_verify_checksums
> which skips on newer pages in a tight benchmark, basically everything
> gets skipped as checkpoints don't happen often enough.

I'm really not sure how you expect to be able to do something different
here.  Even if we started poking into shared buffers, all you'd be able
to see is that there's a bunch of dirty pages- and we don't maintain the
checksums in shared buffers, so it's not like you could verify them
there.

You could possibly have an option that says "force a checkpoint" but,
honestly, that's really not all that interesting either- all you'd be
doing is forcing all the pages to be written out from shared buffers
into the kernel cache and then reading them from there instead, it's not
like you'd actually be able to tell if there was a disk/storage error
because you'll only be looking at the kernel cache.

> So how about we do check every page, but if one fails on retry, and the
> LSN is newer than the checkpoint, we then skip it? Is that logic sound?

I thought that's what basebackup did- if it doesn't do that today, then
it really should.

> In any case, if we decide we really should skip the page if it is newer
> than the checkpoint, I think it makes sense to track those skipped pages
> and print their number out at the end, if there are any.

Not sure what the point of this is.  If we wanted to really do something
to cross-check here, we'd track the pages that were skipped and then
look through the WAL to make sure that they're there.  That's something
we've talked about doing with pgBackRest, but don't currently.

> > FWIW I also don't understand the purpose of pg_sleep(), it does not seem
> > to protect against anything, really.
> 
> Well, I've noticed that without it I get sporadic checksum failures on
> reread, so I've added it to make them go away. It was certainly a
> phenomenological decision that I am happy to trade for a better one.

That then sounds like we really aren't re-checking the LSN, and we
really should be, to avoid getting these sporadic checksum failures on
reread..

> Also, I noticed there's sometimes a 'data/global/pg_internal.init.606'
> or some such file which pg_verify_checksums gets confused on, I guess we
> should skip that as well.  Can we assume that all files that start with
> the ones in skip[] are safe to skip or should we have an exception for
> files starting with pg_internal.init?

Everything listed in skip is safe to skip on a restore..  I've not
really thought too much about if they're all safe to skip when checking
checksums for an online system, but I would generally think so..

Thanks!

Stephen

signature.asc
Description: PGP signature

Re: Strange OSX make check-world failure

2018-09-17 Thread Tom Lane

Chris Travers  writes:
> Logs are below.  This happens on master, and on 10.  I suspect it is an
> issue with something regarding ecpg.  Wondering what I am doing wrong.

"make check" generally won't work on OSX unless you've disabled SIP:

https://www.howtogeek.com/230424/how-to-disable-system-integrity-protection-on-a-mac-and-why-you-shouldnt/

That might not be the issue --- I'd have rather expected a failure
sooner --- but it's worth checking.

The reason why it doesn't work is basically that Apple sabotages
the DYLD_LIBRARY_PATH mechanism, causing the tests to load whatever
version of libpq.dylib (and the ecpg libraries) might exist in your
system library directories or the install target directory, rather
than the files in the build tree.  Possibly the reason you got this
far is that your install target is already reasonably up to date
for libpq, but not so much for ecpg.

regards, tom lane

Re: Online verification of checksums

2018-09-17 Thread Michael Banck

Hi,

On Mon, Sep 03, 2018 at 10:29:18PM +0200, Tomas Vondra wrote:
> The patch is mostly copying the verification / retry logic from
> basebackup.c, but I think it omitted a rather important detail that
> makes it incorrect in the presence of concurrent writes.
> 
> The very first thing basebackup does is this:
> 
> startptr = do_pg_start_backup(...);
> 
> i.e. it waits for a checkpoint, remembering the LSN. And then when
> checking a page it does this:
> 
>if (!PageIsNew(page) && PageGetLSN(page) < startptr)
>{
>... verify the page checksum
>}
> 
> Obviously, pg_verify_checksums can't do that easily because it's
> supposed to run from outside the database instance. 

It reads pg_control anyway, so couldn't we just take
ControlFile->checkPoint?

Other than that, basebackup.c seems to only look at pages which haven't
been changed since the backup starting checkpoint (see above if
statement). That's reasonable for backups, but is it just as reasonable
for online verification?

> But the startptr detail is pretty important because it supports this
> retry reasoning:
> 
> /*
>  * Retry the block on the first failure.  It's
>  * possible that we read the first 4K page of the
>  * block just before postgres updated the entire block
>  * so it ends up looking torn to us.  We only need to
>  * retry once because the LSN should be updated to
>  * something we can ignore on the next pass.  If the
>  * error happens again then it is a true validation
>  * failure.
>  */
> 
> Imagine the 8kB page as two 4kB pages, with the initial state being
> [A1,A2] and another process over-writing it with [B1,B2]. If you read
> the 8kB page, what states can you see?
> 
> I don't think POSIX provides any guarantees about atomicity of the write
> calls (and even if it does, the filesystems on Linux don't seem to). So
> you may observe both [A1,B2] or [A2,B1], or various inconsistent mixes
> of the two versions, depending on timing. Well, torn pages ...
> 
> Pretty much the only thing you can rely on is that when one process does
> 
> write([B1,B2])
> 
> the other process may first read [A1,B2], but the next read will return
> [B1,B2] (or possibly newer data, if there was another write). It will
> not read the "stale" A1 again.
> 
> The basebackup relies on this kinda implicitly - on the retry it'll
> notice the LSN changed (thanks to the startptr check), and the page will
> be skipped entirely. This is pretty important, because the new page
> might be torn in some other way.
>
> The pg_verify_checksum apparently ignores this skip logic, because on
> the retry it simply re-reads the page again, verifies the checksum and
> reports an error. Which is broken, because the newly read page might be
> torn again due to a concurrent write.

Well, ok.

> So IMHO this should do something similar to basebackup - check the page
> LSN, and if it changed then skip the page.
> 
> I'm afraid this requires using the last checkpoint LSN, the way startptr
> is used in basebackup. In particular we can't simply remember LSN from
> the first read, because we might actually read [B1,A2] on the first try,
> and then [B1,B2] or [B1,C2] on the retry. (Actually, the page may be
> torn in various other ways, not necessarily at the 4kB boundary - it
> might be torn right after the LSN, for example).

I'd prefer to come up with a plan where we don't just give up once we
see a new LSN, if possible. If I run a modified pg_verify_checksums
which skips on newer pages in a tight benchmark, basically everything
gets skipped as checkpoints don't happen often enough.

So how about we do check every page, but if one fails on retry, and the
LSN is newer than the checkpoint, we then skip it? Is that logic sound?

In any case, if we decide we really should skip the page if it is newer
than the checkpoint, I think it makes sense to track those skipped pages
and print their number out at the end, if there are any.

> FWIW I also don't understand the purpose of pg_sleep(), it does not seem
> to protect against anything, really.

Well, I've noticed that without it I get sporadic checksum failures on
reread, so I've added it to make them go away. It was certainly a
phenomenological decision that I am happy to trade for a better one.

Also, I noticed there's sometimes a 'data/global/pg_internal.init.606'
or some such file which pg_verify_checksums gets confused on, I guess we
should skip that as well.  Can we assume that all files that start with
the ones in skip[] are safe to skip or should we have an exception for
files starting with pg_internal.init?

Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.ba...@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt

Strange OSX make check-world failure

2018-09-17 Thread Chris Travers

Logs are below.  This happens on master, and on 10.  I suspect it is an
issue with something regarding ecpg.  Wondering what I am doing wrong.

== creating temporary instance==

== initializing database system   ==

== starting postmaster==

running on port 50853 with PID 20314

== creating database "ecpg1_regression"   ==

CREATE DATABASE

ALTER DATABASE

== creating database "ecpg2_regression"   ==

CREATE DATABASE

ALTER DATABASE

== creating role "regress_ecpg_user1" ==

CREATE ROLE

GRANT

GRANT

== creating role "regress_ecpg_user2" ==

CREATE ROLE

GRANT

GRANT

== running regression test queries==

test compat_informix/dec_test ... ok

test compat_informix/charfuncs ... ok

test compat_informix/rfmtdate ... ok

test compat_informix/rfmtlong ... ok

test compat_informix/rnull... ok

test compat_informix/sqlda... ok

test compat_informix/describe ... ok

test compat_informix/test_informix ... ok

test compat_informix/test_informix2 ... ok

test connect/test2... ok

test connect/test3... ok

test connect/test4... ok

test connect/test5... ok

test pgtypeslib/dt_test   ... stdout stderr FAILED (test process was
terminated by signal 6: Abort trap)

test pgtypeslib/dt_test2  ... stdout stderr FAILED (test process was
terminated by signal 6: Abort trap)

test pgtypeslib/num_test  ... stdout stderr FAILED (test process was
terminated by signal 6: Abort trap)

test pgtypeslib/num_test2 ... stdout stderr FAILED (test process was
terminated by signal 6: Abort trap)

test pgtypeslib/nan_test  ... ok

test preproc/array_of_struct  ... ok

test preproc/pointer_to_struct ... ok

test preproc/autoprep ... ok

test preproc/comment  ... ok

test preproc/cursor   ... ok

test preproc/define   ... ok

test preproc/init ... ok

test preproc/strings  ... ok

test preproc/type ... ok

test preproc/variable ... ok

test preproc/outofscope   ... ok

test preproc/whenever ... ok

test sql/array... ok

test sql/binary   ... ok

test sql/code100  ... ok

test sql/copystdout   ... ok

test sql/define   ... ok

test sql/desc ... ok

test sql/sqlda... stdout stderr FAILED (test process was
terminated by signal 6: Abort trap)

test sql/describe ... ok

test sql/dynalloc ... ok

test sql/dynalloc2... ok

test sql/dyntest  ... ok

test sql/execute  ... ok

test sql/fetch... ok

test sql/func ... ok

test sql/indicators   ... ok

test sql/oldexec  ... ok

test sql/quote... ok

test sql/show ... ok

test sql/insupd   ... ok

test sql/parser   ... ok

test thread/thread... ok

test thread/thread_implicit   ... ok

test thread/prep  ... ok

test thread/alloc ... ok

test thread/descriptor... ok

test sql/twophase ... stderr FAILED

== shutting down postmaster   ==


===

 6 of 56 tests failed.

===

-- 
Best Regards,
Chris Travers
Head of Database

Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com
Saarbrücker Straße 37a, 10405 Berlin

Re: [PATCH] Fix for infinite signal loop in parallel scan

2018-09-17 Thread Chris Travers

On Mon, Sep 17, 2018 at 2:59 PM Oleksii Kliukin  wrote:

>
>
> > On 7. Sep 2018, at 17:57, Chris Travers 
> wrote:
> >
> > Hi;
> >
> > Attached is the patch we are fully testing at Adjust.  There are a few
> non-obvious aspects of the code around where the patch hits.I have run
> make check on Linux and MacOS, and make check-world on Linux (check-world
> fails on MacOS on all versions and all branches due to ecpg failures).
> Automatic tests are difficult because it is to a rare race condition which
> is difficult (or possibly impossible) to automatically create.  Our current
> approach testing is to force the issue using a debugger.  Any ideas on how
> to reproduce automatically are appreciated but even on our heavily loaded
> systems this seems to be a very small portion of queries that hit this case
> (meaning the issue happens about once a week for us).
>
>
> I did some manual testing on it, putting breakpoints in the
> ResolveRecoveryConflictWithLock in the startup process on a streaming
> replica
> (configured with a very low max_standby_streaming_delay, i.e. 100ms) and
> to the
> posix_fallocate call on the normal backend on the same replica. At this
> point I
> also instructed gdb not to stop upon receiving SIGUSR1 (handle SIGUSR1
> nonstop)
> and resumed execution on both the backend and the startup process.
>
> Then I simulated a conflict by creating a rather big (3GB) table on the
> master,
> doing some updates on it and then running an aggregate on the replica
> backend
> (i.e. 'select count(1) from test' with 'force_parallel_mode = true')
> where I
> set the breakpoint. The aggregate and force_parallel_mode ensured that
> the query was executed as a parallel one, leading to allocation of the DSM
>
> Once the backend reached the posix_fallocate call and was waiting on a
> breakpoint, I called 'vacuum full test' on the master that lead to a
> conflict
> on the replica running 'select from test' (in a vast majority of cases),
> triggering the breakpoint in ResolveRecoveryConflictWithLock in the startup
> process, since the startup process tried to cancel the conflicting
> backend. At
> that point I continued execution of the startup process (which would loop
> in
> CancelVirtualTransaction sending SIGUSR1 to the backend while the backend
> waited to be resumed from the breakpoint). Right after that, I changed the
> size
> parameter on the backend to something that would make posix_fallocate run
> for a
> bit longer, typically ten to hundred MB. Once the backend process was
> resumed,
> it started receiving SIGUSR1 from the startup process, resulting in
> posix_fallocate existing with EINTR.
>
> With the patch applied, the posix_fallocate loop terminated right away
> (because
> of QueryCancelPending flag set to true) and the backend went through the
> cleanup, showing an ERROR of cancelling due to the conflict with recovery.
> Without the patch, it looped indefinitely in the dsm_impl_posix_resize,
> while
> the startup process were looping forever, trying to send SIGUSR1.
>
> One thing I’m wondering is whether we could do the same by just blocking
> SIGUSR1
> for the duration of posix_fallocate?
>

Many thanks!

If we were to do that, I would say we should mask all signals we can mask
during the call.

I don't have a problem going down that road instead if people think it is
better.

As a note, we have a patched version of PostgreSQL 10.5 ready to deploy to
a system affected by this and expect that to be done this week.


>
> Cheers,
> Oleksii Kliukin



-- 
Best Regards,
Chris Travers
Head of Database

Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com
Saarbrücker Straße 37a, 10405 Berlin

SSL tests failing with "ee key too small" error on Debian SID

2018-09-17 Thread Michael Paquier

Hi all,

On a rather freshly-updated Debian SID server, I am able to see failures
for the SSL TAP tests:
2018-09-17 22:00:27.389 JST [13072] LOG:  database system is shut down
2018-09-17 22:00:27.506 JST [13082] FATAL:  could not load server
certificate file "server-cn-only.crt": ee key too small
2018-09-17 22:00:27.506 JST [13082] LOG:  database system is shut down
2018-09-17 22:00:27.720 JST [13084] FATAL:  could not load server
certificate file "server-cn-only.crt": ee key too small

Wouldn't it be better to rework the rules used to generate the different
certificates and reissue them in the tree?  It seems to me that this is
just waiting to fail in other platforms as well..

Thanks,
--
Michael


signature.asc
Description: PGP signature

Re: [PATCH] Fix for infinite signal loop in parallel scan

2018-09-17 Thread Chris Travers

First, I have attached a cleaned-up revision (pg_indent, removing a
dangling comment etc)

On Fri, Sep 14, 2018 at 1:16 PM Thomas Munro 
wrote:

> On Sat, Sep 8, 2018 at 3:57 AM Chris Travers 
> wrote:
> > Attached is the patch we are fully testing at Adjust.
>
> Thanks!
>
> > I have run make check on Linux and MacOS, and make check-world on Linux
> (check-world fails on MacOS on all versions and all branches due to ecpg
> failures).
>
> FWIW it's entirely possible to get make check-world passing on a Mac.
> Maybe post the problem you're seeing to a new thread?
>

Will do.

>
> > ...
>
> > In the past it had been suggested we do PG_TRY(); and PG_CATCH(), but
> given that it is not consistent whether we can raise an error or whether we
> MUST raise an error, I don't see how this approach can work.  As far as I
> can see, we MUST raise an error in the appropriate spot if and only if
> elevel is set to a sufficient level.
>
> Yeah, your way looks a bit nicer than something involving PG_TRY().
>
> > Is there any feedback on this approach before I add it to the next
> commitfest?
>
> Please go ahead and add it.  Being a bug fix, we'll commit it sooner
> than the open commitfest anyway, but it's useful to have it in there.
>
> + if (errno == EINTR && elevel >= ERROR)
> + CHECK_FOR_INTERRUPTS();
>
> I think we might as well just CHECK_FOR_INTERRUPTS() unconditionally.
> In this branch elevel is always ERROR as you noted, and the code
> around there is confusing enough already.
>

The reason I didn't do that is future safety and for the off chance that
someone could do something strange with extensions today that might use
this in an unsafe way.While I cannot think of any use case for calling
this in an unsafe way, I know for a fact that one might write extensions,
background workers, etc. to do things with this API that are not in our
current code right now.  For something that could be back ported I really
want to work as much as possible in a way that does not possibly brake even
exotic extensions.

Longer-run I would like to see if I can help refactor this code so that
responsibility for error handling is clearer and such problems cannot
exist.  But that's not something to back port.

>
> + } while (rc == EINTR && !(ProcDiePending || QueryCancelPending));
>
> There seems to be a precedent for checking QueryCancelPending directly
> to break out of a loop in regcomp.c and syncrep.c.  So this seems OK.
>

Yeah, I checked.


> Hmm, I wonder if there should be an INTERRUPT_PENDING() macro that
> hides those details, but allows you to break out of a loop and then do
> some cleanup before CHECK_FOR_INTERRUPT().


That is a good idea.

>
> --
> Thomas Munro
> http://www.enterprisedb.com
>


-- 
Best Regards,
Chris Travers
Head of Database

Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com
Saarbrücker Straße 37a, 10405 Berlin


racecond.patch
Description: Binary data

Re: [PATCH] Fix for infinite signal loop in parallel scan

2018-09-17 Thread Oleksii Kliukin

> On 7. Sep 2018, at 17:57, Chris Travers  wrote:
> 
> Hi;
> 
> Attached is the patch we are fully testing at Adjust.  There are a few 
> non-obvious aspects of the code around where the patch hits.I have run 
> make check on Linux and MacOS, and make check-world on Linux (check-world 
> fails on MacOS on all versions and all branches due to ecpg failures).  
> Automatic tests are difficult because it is to a rare race condition which is 
> difficult (or possibly impossible) to automatically create.  Our current 
> approach testing is to force the issue using a debugger.  Any ideas on how to 
> reproduce automatically are appreciated but even on our heavily loaded 
> systems this seems to be a very small portion of queries that hit this case 
> (meaning the issue happens about once a week for us).

I did some manual testing on it, putting breakpoints in the
ResolveRecoveryConflictWithLock in the startup process on a streaming replica
(configured with a very low max_standby_streaming_delay, i.e. 100ms) and to the
posix_fallocate call on the normal backend on the same replica. At this point I
also instructed gdb not to stop upon receiving SIGUSR1 (handle SIGUSR1 nonstop)
and resumed execution on both the backend and the startup process.

Then I simulated a conflict by creating a rather big (3GB) table on the master,
doing some updates on it and then running an aggregate on the replica backend
(i.e. 'select count(1) from test' with 'force_parallel_mode = true')  where I
set the breakpoint. The aggregate and force_parallel_mode ensured that
the query was executed as a parallel one, leading to allocation of the DSM

Once the backend reached the posix_fallocate call and was waiting on a
breakpoint, I called 'vacuum full test' on the master that lead to a conflict
on the replica running 'select from test' (in a vast majority of cases),
triggering the breakpoint in ResolveRecoveryConflictWithLock in the startup
process, since the startup process tried to cancel the conflicting backend. At
that point I continued execution of the startup process (which would loop in
CancelVirtualTransaction sending SIGUSR1 to the backend while the backend
waited to be resumed from the breakpoint). Right after that, I changed the size
parameter on the backend to something that would make posix_fallocate run for a
bit longer, typically ten to hundred MB. Once the backend process was resumed,
it started receiving SIGUSR1 from the startup process, resulting in
posix_fallocate existing with EINTR.

With the patch applied, the posix_fallocate loop terminated right away (because
of QueryCancelPending flag set to true) and the backend went through the
cleanup, showing an ERROR of cancelling due to the conflict with recovery.
Without the patch, it looped indefinitely in the dsm_impl_posix_resize, while
the startup process were looping forever, trying to send SIGUSR1.

One thing I’m wondering is whether we could do the same by just blocking SIGUSR1
for the duration of posix_fallocate?

Cheers,
Oleksii Kliukin

Re: Stored procedures and out parameters

2018-09-17 Thread Jonathan S. Katz

Hi,

On 9/2/18 4:32 PM, Robert Haas wrote:
> On Thu, Aug 30, 2018 at 4:14 PM, Dave Cramer  wrote:
>> Reading this from the (JDBC) drivers perspective, which is probably a fairly
>> popular one,
>> We now have a standard that we can't really support. Either the driver will
>> have to support
>> the new PROCEDURE with the {call } mechanism or stay with the existing
>> FUNCTIONS.
>> This puts the drivers in a no win situation.
> 
> I understand and I agree.
> 
>> Undoubtedly.. surely the opportunity to do something about this has not
>> passed as this has not been
>> officially released ?
> 
> I agree with that, too, but I can't make other people do things they
> don't want to do, and then there's the question of time.  On the basis
> of previous experience, there is going to be little appetite for
> holding up the release to address this issue no matter how badly
> busted it is.  Ultimately, it's the RMT that must decide what to do in
> cases like this.  I have confidence that they are watching this
> thread, but I don't know what they will decide to do about it.
> 

First, I want everyone to know that the RMT took this very seriously and
took time collect feedback and consult with as many people as we could
in order to make the best possible decision for v11 and ensure that any
decision we made did not hinder any future implementation for stored
procedures nor introduced something that would break backwards
compatibility.

Ultimately, the decision came down to one of four options:

#1: Do nothing and remove the open item
#2: Introduce nonstandard syntax for calling functions / procedures
#3: Have CALL support both functions & procedures (or SELECT support
calling functions)
#4: Disable CALL

The RMT has decided to go with option #1 and will be removing the open
item for the PostgreSQL 11 release.

We understand that this will impact how drivers such as JDBC & ODBC will
support stored procedures for v11 and this was a detail we took into
great consideration. Through our discussions, we also know that there
are other ways that users can call stored procedures, and understand
that for people who are used to programming with the JDBC/ODBC
interfaces that this is considered a "workaround."

We hope that the community can continue to improve the stored procedure
functionality for v12 and that there will be continued work on CALL such
that we can make it easier for our driver maintainers to make stored
procedures more easily available for our users. Personally, I know this
is a feature that many people are very excited for, and I look forward
to future work that will continue to improve upon what we are releasing
in v11.

While some of you may be disappointed that we are removing the open
item, we do hope this frees the group up to discuss, plan, and implement
a solution for v12 without the pressure of a release deadline.

Sincerely,

Jonathan

signature.asc
Description: OpenPGP digital signature

Re: XMLNAMESPACES (was Re: Clarification of nodeToString() use cases)

2018-09-17 Thread Andrew Dunstan





On 09/16/2018 11:11 PM, Tom Lane wrote:

Andrew Dunstan  writes:

On 09/16/2018 02:05 PM, Tom Lane wrote:

The change the attached patch makes is to represent a DEFAULT namespace
as a NULL list entry, rather than a T_String Value node containing a
null.

Seems related to this CF item that's been around for a year:

?

Hm, seems like that is operating at the next level down; it starts with
XmlTableSetNamespace's response to a null "name" argument, whereas what
I'm on about is what happens before we get to that function.

There's quite a bit I don't like about that patch now that I look at it
:-(, but I don't think it's relevant to this thread.





OK, good. Please do add your comments to the appropriate thread and 
change the CF status if required.


cheers

andrew

--
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Connection slots reserved for replication

2018-09-17 Thread Alexander Kukushkin

Hi,

2018-09-14 12:23 GMT+02:00 Masahiko Sawada :

>> 2. If we know that this is neither superuser nor replication connection, we
>> should check that there are at least (superuser_reserved_connections +
>> NumWalSenders() - max_wal_senders) connection slots are available.
>
> You wanted to mean (superuser_reserved_connections + max_wal_senders -
> NumWalSenders()) in the second point?

Sure, my bad. Did a mistake when writing an email, but in the attached
file it looks good.

>
> One argrable point of the second option could be that it breaks
> backward compatibility of the parameter configurations. That is, the
> existing systems need to re-configure the max_connections. So it might
> be better to take the first option with
> replication_reservd_connections = 0 by default.

Please find attached the new version of the patch, which introduces
replication_reservd_connections GUC

Regards,
--
Alexander Kukushkin
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index e1073ac6d3..80e6ef9f67 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3059,6 +3059,32 @@ include_dir 'conf.d'

   
 
+ 
+  replication_reserved_connections
+  (integer)
+  
+   replication_reserved_connections configuration parameter
+  
+  
+  
+   
+Determines the number of connection slots that
+are reserved for replication connections. Whenever the number
+of active concurrent connections is at least
+max_connections minus
+replication_reserved_connections plus
+number of active wal senders, new
+non-superuser and non-replication connections will not be accepted.
+   
+
+   
+The default value is zero. The value should not exceed max_wal_senders.
+This parameter can only be set at server start.
+   
+  
+ 
+
   
max_replication_slots (integer)

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 305ff36258..a5a95ee92c 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -903,6 +903,10 @@ PostmasterMain(int argc, char *argv[])
 	if (max_wal_senders > 0 && wal_level == WAL_LEVEL_MINIMAL)
 		ereport(ERROR,
 (errmsg("WAL streaming (max_wal_senders > 0) requires wal_level \"replica\" or \"logical\"")));
+	if (replication_reserved_connections > max_wal_senders)
+		ereport(WARNING,
+(errmsg("Value of replication_reserved_connections (%d) exceeds value of max_wal_senders (%d)",
+		replication_reserved_connections, max_wal_senders)));
 
 	/*
 	 * Other one-time internal sanity checks can go here, if they are fast.
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 370429d746..e64d5ed44d 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -122,6 +122,8 @@ int			max_wal_senders = 0;	/* the maximum number of concurrent
 int			wal_sender_timeout = 60 * 1000; /* maximum time to send one WAL
 			 * data message */
 bool		log_replication_commands = false;
+int			replication_reserved_connections = 0; /* the number of connection slots
+   * reserved for replication connections */
 
 /*
  * State for WalSndWakeupRequest
@@ -2284,6 +2286,10 @@ InitWalSenderSlot(void)
 			walsnd->applyLag = -1;
 			walsnd->state = WALSNDSTATE_STARTUP;
 			walsnd->latch = &MyProc->procLatch;
+
+			/* increment the number of allocated wal sender slots */
+			pg_atomic_fetch_add_u32(&WalSndCtl->num_wal_senders, 1);
+
 			SpinLockRelease(&walsnd->mutex);
 			/* don't need the lock anymore */
 			MyWalSnd = (WalSnd *) walsnd;
@@ -2317,6 +2323,10 @@ WalSndKill(int code, Datum arg)
 	walsnd->latch = NULL;
 	/* Mark WalSnd struct as no longer being in use. */
 	walsnd->pid = 0;
+
+	/* decrement the number of allocated wal sender slots */
+	pg_atomic_fetch_sub_u32(&WalSndCtl->num_wal_senders, 1);
+
 	SpinLockRelease(&walsnd->mutex);
 }
 
@@ -3033,6 +3043,7 @@ WalSndShmemInit(void)
 	{
 		/* First time through, so initialize */
 		MemSet(WalSndCtl, 0, WalSndShmemSize());
+		pg_atomic_init_u32(&WalSndCtl->num_wal_senders, 0);
 
 		for (i = 0; i < NUM_SYNC_REP_WAIT_MODE; i++)
 			SHMQueueInit(&(WalSndCtl->SyncRepQueue[i]));
@@ -3587,3 +3598,9 @@ LagTrackerRead(int head, XLogRecPtr lsn, TimestampTz now)
 	Assert(time != 0);
 	return now - time;
 }
+
+/* Return the amount of allocated wal_sender slots */
+uint32 NumWalSenders(void)
+{
+	return pg_atomic_read_u32(&WalSndCtl->num_wal_senders);
+}
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 5ef6315d20..436574e85d 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -789,17 +789,28 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
 	}
 
 	/*
-	 * The last few connection slots are reserved for superusers.  Although
-	 * replication connections currently

Re: pgsql: Improve autovacuum logging for aggressive and anti-wraparound ru

2018-09-17 Thread Michael Paquier

On Fri, Sep 14, 2018 at 12:35:54PM -0300, Alvaro Herrera wrote:
> On 2018-Sep-13, Michael Paquier wrote:
>> Improve autovacuum logging for aggressive and anti-wraparound runs
>> 
>> A log message was being generated when log_min_duration is reached for
>> autovacuum on a given relation to indicate if it was an aggressive run,
>> and missed the point of mentioning if it is doing an anti-wrapround
>> run.  The log message generated is improved so as one, both or no extra
>> details are added depending on the option set.
> 
> Hmm, can a for-wraparound vacuum really not be aggressive?  I think one
> of those four cases is really dead code.

Sure, at the same time it is a no-brainer to keep the code as is, and
seeing log messages where (!aggressive && wraparound) would be an
indication of surrounding bugs, no?  There have been issues in this area
in the past.
--
Michael


signature.asc
Description: PGP signature

Re: Tid scan improvements

2018-09-17 Thread David Rowley

On 15 August 2018 at 11:11, Edmund Horner  wrote:

> So we'd extend that to:
>   - Include in the OR-list "range" subquals of the form (ctid > ? AND
> ctid < ?) (either side could be optional, and we have to deal with >=
> and <= and having ctid on the rhs, etc.).
>   - Cost the range subquals by assuming they don't overlap, and
> estimating how many blocks and tuples they span.
>   - When beginning the scan, evaluate all the ?s and build an array of
> "tid ranges" to fetch.  A tid range is a struct with a starting tid,
> and an ending tid, and might just be a single tid item.
>   - Sort and remove duplicates.
>   - Iterate over the array, using a single fetch for single-item tid
> ranges, and starting/ending a heap scan for multi-item tid ranges.
>
> I think I'll try implementing this.
>

I've set this patch as waiting on author in the commitfest app.


-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Multiple primary key on partition table?

2018-09-17 Thread Rajkumar Raghuwanshi

Hi,

I am able to create multiple primary key on partition table by executing
below statement.

[edb@localhost bin]$ ./psql postgres
psql (11beta3)
Type "help" for help.

postgres=# CREATE TABLE t1 (a int PRIMARY KEY,b int) PARTITION BY RANGE (a);
CREATE TABLE
postgres=# CREATE TABLE t1_p1 PARTITION OF t1 (b PRIMARY KEY) FOR VALUES
FROM (MINVALUE) TO (MAXVALUE);
CREATE TABLE
postgres=# \d+ t1_p1
   Table "public.t1_p1"
 Column |  Type   | Collation | Nullable | Default | Storage | Stats target
| Description
+-+---+--+-+-+--+-
 a  | integer |   | not null | | plain   |
|
 b  | integer |   | not null | | plain   |
|
Partition of: t1 FOR VALUES FROM (MINVALUE) TO (MAXVALUE)
Partition constraint: (a IS NOT NULL)
Indexes:
"t1_p1_pkey" PRIMARY KEY, btree (a)
"t1_p1_pkey1" PRIMARY KEY, btree (b)

Is this fine to allow it?
eventually it cause to pg_upgrade failed with below error.

pg_restore: creating TABLE "public.t1"
pg_restore: creating TABLE "public.t1_p1"
pg_restore: INFO:  partition constraint for table "t1_p1" is implied by
existing constraints
pg_restore: creating CONSTRAINT "public.t1 t1_pkey"
pg_restore: creating CONSTRAINT "public.t1_p1 t1_p1_pkey"
pg_restore: creating CONSTRAINT "public.t1_p1 t1_p1_pkey1"
pg_restore: [archiver (db)] Error while PROCESSING TOC:
pg_restore: [archiver (db)] Error from TOC entry 2920; 2606 16395
CONSTRAINT t1_p1 t1_p1_pkey1 edb
pg_restore: [archiver (db)] could not execute query: ERROR:  multiple
primary keys for table "t1_p1" are not allowed
Command was:
-- For binary upgrade, must preserve pg_class oids
SELECT
pg_catalog.binary_upgrade_set_next_index_pg_class_oid('16394'::pg_catalog.oid);

ALTER TABLE ONLY "public"."t1_p1"
ADD CONSTRAINT "t1_p1_pkey1" PRIMARY KEY ("b");

Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation

Re: More deficiencies in outfuncs/readfuncs processing

2018-09-17 Thread Stephen Frost

Greetings,

* Tom Lane (t...@sss.pgh.pa.us) wrote:
> Patch 0002 addresses several more-or-less-independent issues that are
> exposed by running the regression tests with patch 0001 activated:
> 
> * Query.withCheckOptions fails to propagate through write+read, because
> it's intentionally ignored by outfuncs/readfuncs on the grounds that
> it'd always be NIL anyway in stored rules.  This seems like premature
> optimization of the worst sort, since it's not even saving very much:
> if the assumption holds, what's suppressed from a stored Query is only
> " :withCheckOptions <>", hardly a big savings.  And of course if the
> assumption ever fails to hold, it's just broken, and broken in a non
> obvious way too (you'd only notice if you expected a check option
> failure and didn't get one).  So this is pretty dumb and I think we
> ought to fix it by treating that field normally in outfuncs/readfuncs.
> That'd require a catversion bump, but we only need to do it in HEAD.
> The only plausible alternative is to change _equalQuery to ignore
> withCheckOptions, which does not sound like a good idea at all.

I'm fine with this change (as I believe I've said before...).  I agree
that it's just a minor optimization and shouldn't be an issue to remove
it.

Thanks!

Stephen


signature.asc
Description: PGP signature

Re: Collation versioning

2018-09-17 Thread Stephen Frost

Greetings,

* Thomas Munro (thomas.mu...@enterprisedb.com) wrote:
> I'd like to propose the 3 more humble goals I mentioned a few messages
> back as earlier steps.  OS collation changes aren't really like Monty
> Python's Spanish Inquisition: they usually hit you when you're doing
> major operating system upgrades or setting up a streaming replica to a
> different OS version IIUC.  That is, they probably happen during
> maintenance windows when REINDEX would hopefully be plausible, and
> presumably critical systems get tested on the new OS version before
> production is upgraded.  It'd be kind to our users to make the problem
> non-silent at that time so they can plan for it (and of course also
> alert them if it happens when nobody expects it, because knowing you
> have a problem is better than not knowing).

Just to be clear, I'm all for this, but wanted to bring up the farther
out goal to make sure we're thinking about how to eventually get there
from here- and to make sure we aren't making it more difficult to get
there with the proposed catalog changes for these shorter-term goals.

Thanks!

Stephen


signature.asc
Description: PGP signature

Re: [HACKERS] [PATCH] kNN for SP-GiST

2018-09-17 Thread Andrey Borodin

Hi!

> 17 сент. 2018 г., в 2:03, Alexander Korotkov  
> написал(а):
> 
> Also, it appears to me that it's OK to be a single patch

+1, ISTM that these 6 patches represent atomic unit of work.

Best regards, Andrey Borodin.

Re: [HACKERS] advanced partition matching algorithm for partition-wise join

2018-09-17 Thread Ashutosh Bapat

On Thu, Sep 13, 2018 at 1:45 AM Dmitry Dolgov <9erthali...@gmail.com> wrote:

> > On Fri, 31 Aug 2018 at 08:23, Ashutosh Bapat <
> ashutosh.ba...@enterprisedb.com> wrote:
> >
> > On Thu, Aug 30, 2018 at 2:23 PM, Dmitry Dolgov <9erthali...@gmail.com>
> wrote:
> > >
> > >> I won't be working on this actively in the next commitfest. I will be
> > >> glad if somebody else wants to take this up. If there's nobody,
> > >> probably we should mark this entry as "returned with feedback" in the
> > >> next commitfest.
> > >
> > > Since I'm more or less familiar with the code and I believe it's an
> interesting
> > > feature, I can try to take over it for now if you don't mind (but
> without any
> > > strong commitments to make it perfectly shining for the next CF).
> >
> > Please do. Thanks.
>
> I've noticed that the patch is outdated already, so here is the rebased
> version. I also removed the last part with the extra tests since it
> was something
> internal

I am fine with that. It was never meant to be committed. I used to run
those tests to make sure that any changes to the core logic do not break
any working scenarios. Whenever I found a new failure in the extra tests
which wasn't there in tests to be committed, I used to move that test from
the first to the second. Over the time, the number of new failures in extra
has reduced and recently I didn't see any extra failures. So, may be it's
time for the extra tests to be dropped. I will suggest that keep the extra
tests running from time to time and certainly around the time the feature
gets committed.

> and merged the debug message into the implementation part. Ashutosh,
> please let me know if you're not happy with these modifications.
>

Robert Haas raised objections, and I agreed to those, about a similar debug
message I had included in the basic partition-wise join patches. I think
those reasons still apply, so you will need to remove the debug message
before the patches get committed. Said that the debug message is a good
debugging aid, so keeping it around till that time is a good idea.

>
> Other than that I haven't changed anything yet, but hope this will come
> soon.
> And this version is more to keep it updated for those people who may be
> interested.
>

Thanks.

--
Best Wishes,
Ashutosh Bapat

Re: Speeding up INSERTs and UPDATEs to partitioned tables

2018-09-17 Thread David Rowley

On 23 August 2018 at 00:30, David Rowley 
wrote:

> I've attached a v8. The only change from your v7 is in the "go making"
> comment.
>

v9 patch attached. Fixes conflict with 6b78231d.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


v9-0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patch
Description: Binary data

Re: infinite loop in parallel hash joins / DSA / get_best_segment

2018-09-17 Thread Thomas Munro

On Mon, Sep 17, 2018 at 10:42 AM Thomas Munro
 wrote:
> On Mon, Sep 17, 2018 at 10:38 AM Tomas Vondra
>  wrote:
> > While performing some benchmarks on REL_11_STABLE (at 55c2d9), I've
> > repeatedly hit an apparent infinite loop on TPC-H query 4. I don't know
> > what exactly are the triggering conditions, but the symptoms are these:
> >
> > ...
>
> Urgh.  Thanks Tomas.  I will investigate.

Thanks very much to Tomas for giving me access to his benchmarking
machine where this could be reproduced.  Tomas was doing performance
testing with no assertions, but with a cassert built I was able to hit
an assertion failure after a while and eventually figure out what was
going wrong.  The problem is that the 'segment bins' (linked lists
that group segments by the largest contiguous run of free pages) can
become corrupted when segments become completely free and are returned
to the operating system and then the same segment slot (index number)
is recycled, with the right sequence of allocations and frees and
timing.  There is an LWLock that protects segment slot and bin
manipulations, but there is a kind of ABA problem where one backend
can finish up looking at the defunct former inhabitant of a slot that
another backend has recently create a new segment in.  There is
handling for that in the form of freed_segment_counter, a kind of
generation/invalidation signalling, but there are a couple of paths
that fail to check it at the right times.

With the attached draft patch, Tomas's benchmark script runs happily
for long periods.  A bit more study required with fresh eyes,
tomorrow.

-- 
Thomas Munro
http://www.enterprisedb.com

fix-dsa-segment-free-bug.patch
Description: Binary data

69 matches

Mail list logo