Re: Multiple hosts in connection string failed to failover in non-hot standby mode

2021-09-13 Thread Michael Paquier
On Mon, Sep 13, 2021 at 04:09:26PM -0400, Tom Lane wrote:
> I got around to looking at this issue today, and verified that only one
> place needs to be changed, as attached.

Thanks!  This looks fine to me at quick glance.
--
Michael


signature.asc
Description: PGP signature


Re: Multiple hosts in connection string failed to failover in non-hot standby mode

2021-09-13 Thread Tom Lane
Michael Paquier  writes:
> On Sun, May 30, 2021 at 08:25:00PM -0500, Justin Pryzby wrote:
>> ..But I think it's not useful to put details into errorMessage on success,
>> unless you're going to document that.  It would never have occurred to me to
>> look there, or that it would even be safe.

> Yeah.  On the contrary, it could be confusing if one sees an error
> message but there is nothing to worry about, because things are
> working in the scope of what the user wanted at connection time.

I got around to looking at this issue today, and verified that only one
place needs to be changed, as attached.

Although I was initially thinking that maybe we should leave the code
as-is, I now agree that resetting errorMessage is a good idea, because
what tends to be in it at this point is something like

"connection to server on socket "/tmp/.s.PGSQL.5432" failed: "

(ie the string made by emitHostIdentityInfo).  Anybody who does
look at that is likely to be confused, because the connection
*didn't* fail.

There might be some value in my original idea of preserving a trace of
the servers we tried before succeeding.  But it would take additional
work to present it in a non-confusing way, and given the lack of any
field requests for that, I'm not excited about doing it right now.
(One could also argue that it ought to get tied into the PQtrace
facilities somehow, rather than being done in this ad-hoc way.)

regards, tom lane

diff --git a/src/interfaces/libpq/fe-connect.c b/src/interfaces/libpq/fe-connect.c
index 49eec3e835..b288d346f9 100644
--- a/src/interfaces/libpq/fe-connect.c
+++ b/src/interfaces/libpq/fe-connect.c
@@ -3654,6 +3654,13 @@ keep_going:		/* We will come back to here until there is
 /* We can release the address list now. */
 release_conn_addrinfo(conn);
 
+/*
+ * Contents of conn->errorMessage are no longer interesting
+ * (and it seems some clients expect it to be empty after a
+ * successful connection).
+ */
+resetPQExpBuffer(>errorMessage);
+
 /* We are open for business! */
 conn->status = CONNECTION_OK;
 return PGRES_POLLING_OK;


Re: Multiple hosts in connection string failed to failover in non-hot standby mode

2021-06-10 Thread Tom Lane
I wrote:
> Now I'm inclined to go back to the first-draft patch I had, which just
> dropped the first problematic test case, and added gssencmode=disable
> to the second one.

Done that way.  If we figure out why the GSS code is acting strangely
on hamerkop, maybe this can be reverted --- but it seems like we already
spent more time than is justified looking for band-aids.

regards, tom lane




Re: Multiple hosts in connection string failed to failover in non-hot standby mode

2021-06-09 Thread Tom Lane
Michael Paquier  writes:
> On Wed, Jun 09, 2021 at 12:05:10PM -0400, Tom Lane wrote:
>> Here's a draft patch that renames regress_ecpg_user2 to ecpg2_regression,

> Using ecpg2_regression for the role goes a bit against the recent rule
> to not create any role not suffixed by "regress_" as part of the
> regression tests, but I am fine to live with that here.

Oh dear, I forgot to check that carefully.  I'd been thinking the rule was
that such names must *contain* "regress", but looking at user.c, it's
stricter:

#ifdef ENFORCE_REGRESSION_TEST_NAME_RESTRICTIONS
if (strncmp(stmt->role, "regress_", 8) != 0)
elog(WARNING, "roles created by regression test cases should 
have names starting with \"regress_\"");
#endif

Meanwhile, the rule for database names is:

#ifdef ENFORCE_REGRESSION_TEST_NAME_RESTRICTIONS
if (IsUnderPostmaster && strstr(dbname, "regression") == NULL)
elog(WARNING, "databases created by regression test cases 
should have names including \"regression\"");
#endif

So unless we want to relax one or both of those, we can't have a user
name that matches the database name.

Now I'm inclined to go back to the first-draft patch I had, which just
dropped the first problematic test case, and added gssencmode=disable
to the second one.

regards, tom lane




Re: Multiple hosts in connection string failed to failover in non-hot standby mode

2021-06-09 Thread Michael Paquier
On Wed, Jun 09, 2021 at 12:05:10PM -0400, Tom Lane wrote:
> Here's a draft patch that renames regress_ecpg_user2 to ecpg2_regression,
> which matches the name of one of the databases used, allowing the test
> cases with defaulted database name to succeed.  That gets rid of one of
> the problematic diffs.

Yeah, I agree that this does not matter much for this one, as we want
to stress the quotes and the grammar for the connections here, as
99a5619 implies.  It is good to check for the failure path as well, so
what you have here looks fine to me.

> As it stood, though, that meant that connect/test5
> wasn't exercising the connection-failure code path at all, which didn't
> seem like what we want.  So I adjusted the second place that had been
> failing to again fail on no-such-database, and stuck in gssencmode=disable
> so that we shouldn't get any test diff on hamerkop.

Using ecpg2_regression for the role goes a bit against the recent rule
to not create any role not suffixed by "regress_" as part of the
regression tests, but I am fine to live with that here.

The changes for test1 with MinGW look right, I have not been able to
test them.
--
Michael


signature.asc
Description: PGP signature


Re: Multiple hosts in connection string failed to failover in non-hot standby mode

2021-06-09 Thread Tom Lane
I wrote:
> ...  I'd be okay with dropping that test; or maybe we could
> fix things so that the default case succeeds?

Here's a draft patch that renames regress_ecpg_user2 to ecpg2_regression,
which matches the name of one of the databases used, allowing the test
cases with defaulted database name to succeed.  That gets rid of one of
the problematic diffs.  As it stood, though, that meant that connect/test5
wasn't exercising the connection-failure code path at all, which didn't
seem like what we want.  So I adjusted the second place that had been
failing to again fail on no-such-database, and stuck in gssencmode=disable
so that we shouldn't get any test diff on hamerkop.

regards, tom lane

diff --git a/src/interfaces/ecpg/test/Makefile b/src/interfaces/ecpg/test/Makefile
index be53b7b94d..abea3fcc85 100644
--- a/src/interfaces/ecpg/test/Makefile
+++ b/src/interfaces/ecpg/test/Makefile
@@ -77,7 +77,7 @@ $(remaining_files_build): $(abs_builddir)/%: $(srcdir)/%
 endif
 
 # Common options for tests. Also pick up anything passed in EXTRA_REGRESS_OPTS
-REGRESS_OPTS = --dbname=ecpg1_regression,ecpg2_regression --create-role=regress_ecpg_user1,regress_ecpg_user2 $(EXTRA_REGRESS_OPTS)
+REGRESS_OPTS = --dbname=ecpg1_regression,ecpg2_regression --create-role=regress_ecpg_user1,ecpg2_regression $(EXTRA_REGRESS_OPTS)
 
 check: all
 	$(with_temp_install) ./pg_regress $(REGRESS_OPTS) --temp-instance=./tmp_check $(TEMP_CONF) --bindir= $(pg_regress_locale_flags) $(THREAD) --schedule=$(srcdir)/ecpg_schedule sql/twophase
diff --git a/src/interfaces/ecpg/test/connect/test1.pgc b/src/interfaces/ecpg/test/connect/test1.pgc
index 961bd72ef2..77e571ec86 100644
--- a/src/interfaces/ecpg/test/connect/test1.pgc
+++ b/src/interfaces/ecpg/test/connect/test1.pgc
@@ -26,16 +26,16 @@ exec sql end declare section;
 	exec sql connect to ecpg2_regression@localhost as main;
 	exec sql disconnect main;
 
-	exec sql connect to @localhost as main user regress_ecpg_user2;
+	exec sql connect to @localhost as main user ecpg2_regression;
 	exec sql disconnect main;
 
-	/* exec sql connect to :@TEMP_PORT@ as main user regress_ecpg_user2;
+	/* exec sql connect to :@TEMP_PORT@ as main user ecpg2_regression;
 	exec sql disconnect main; */
 
 	exec sql connect to tcp:postgresql://localhost/ecpg2_regression user regress_ecpg_user1 identified by connectpw;
 	exec sql disconnect;
 
-	exec sql connect to tcp:postgresql://localhost/ user regress_ecpg_user2;
+	exec sql connect to tcp:postgresql://localhost/ user ecpg2_regression;
 	exec sql disconnect;
 
 	strcpy(pw, "connectpw");
diff --git a/src/interfaces/ecpg/test/connect/test5.pgc b/src/interfaces/ecpg/test/connect/test5.pgc
index e712fa8778..96a7b3e892 100644
--- a/src/interfaces/ecpg/test/connect/test5.pgc
+++ b/src/interfaces/ecpg/test/connect/test5.pgc
@@ -21,7 +21,7 @@ exec sql end declare section;
 	ECPGdebug(1, stderr);
 
 	exec sql connect to ecpg2_regression as main;
-	exec sql alter user regress_ecpg_user2 ENCRYPTED PASSWORD 'insecure';
+	exec sql alter user ecpg2_regression ENCRYPTED PASSWORD 'insecure';
 	exec sql alter user regress_ecpg_user1 ENCRYPTED PASSWORD 'connectpw';
 	exec sql commit;
 	exec sql disconnect;  /* <-- "main" not specified */
@@ -40,7 +40,7 @@ exec sql end declare section;
 	exec sql connect to 'ecpg2_regression' as main;
 	exec sql disconnect main;
 
-	exec sql connect to as main user regress_ecpg_user2/insecure;
+	exec sql connect to as main user ecpg2_regression/insecure;
 	exec sql disconnect main;
 
 	exec sql connect to ecpg2_regression as main user regress_ecpg_user1/connectpw;
@@ -61,7 +61,7 @@ exec sql end declare section;
 	exec sql connect to "unix:postgresql://200.46.204.71/ecpg2_regression" as main user regress_ecpg_user1/connectpw;
 	exec sql disconnect main;
 
-	exec sql connect to unix:postgresql://localhost/ as main user regress_ecpg_user2 IDENTIFIED BY insecure;
+	exec sql connect to "unix:postgresql://localhost/no_such_db?gssencmode=disable" as main user ecpg2_regression IDENTIFIED BY insecure;
 	exec sql disconnect main;
 
 	/* connect twice */
diff --git a/src/interfaces/ecpg/test/expected/connect-test1-minGW32.stderr b/src/interfaces/ecpg/test/expected/connect-test1-minGW32.stderr
index c5b5248eb2..e5f16bd854 100644
--- a/src/interfaces/ecpg/test/expected/connect-test1-minGW32.stderr
+++ b/src/interfaces/ecpg/test/expected/connect-test1-minGW32.stderr
@@ -14,30 +14,18 @@
 [NO_PID]: sqlca: code: 0, state: 0
 [NO_PID]: ecpg_finish: connection main closed
 [NO_PID]: sqlca: code: 0, state: 0
-[NO_PID]: ECPGconnect: opening database  on localhost port   for user regress_ecpg_user2
-[NO_PID]: sqlca: code: 0, state: 0
-[NO_PID]: ECPGconnect: connection to server failed: FATAL:  database "regress_ecpg_user2" does not exist
+[NO_PID]: ECPGconnect: opening database  on localhost port   for user ecpg2_regression
 [NO_PID]: sqlca: code: 0, state: 0
 [NO_PID]: ecpg_finish: connection main closed
 [NO_PID]: sqlca: code: 

Re: Multiple hosts in connection string failed to failover in non-hot standby mode

2021-06-08 Thread Tom Lane
Michael Paquier  writes:
> On Tue, Jun 08, 2021 at 11:21:34AM -0400, Tom Lane wrote:
>> IIUC, what you are proposing to do is replace pgwin32_setenv with
>> src/port/setenv.c.  I don't think that's an improvement.  setenv.c
>> leaks memory on repeat calls, because it cannot know what
>> pgwin32_setenv knows about how putenv works on that platform.

> Is gaur the only animal that needs this file, by the way?

I think it is.  setenv has been in POSIX for awhile, so probably
only very old systems would need that.  (This is why I don't care
that much that setenv.c leaks memory.  But we can't start using it
on platforms where we *do* care about performance.)

>> (3) Don't try to use the environment variable for this
>> purpose.  I'd originally tried to change test5.pgc to just
>> specify gssmode=disable in-line, but that only works
>> nicely for one of the two failing cases.  The other one
>> is testing the case of a completely defaulted connection
>> target, so there's no place to add an option without
>> breaking the only unique aspect of that test case.

> FWIW, I'd be rather in favor of doing (3) because this remains simple
> just to take care of an edge case, even if that partially breaks the
> promise to rely on a default connection.

Yeah, it doesn't seem like we need to test that case all that
badly.  I'd be okay with dropping that test; or maybe we could
fix things so that the default case succeeds?

regards, tom lane




Re: Multiple hosts in connection string failed to failover in non-hot standby mode

2021-06-08 Thread Michael Paquier
On Tue, Jun 08, 2021 at 11:21:34AM -0400, Tom Lane wrote:
> IIUC, what you are proposing to do is replace pgwin32_setenv with
> src/port/setenv.c.  I don't think that's an improvement.  setenv.c
> leaks memory on repeat calls, because it cannot know what
> pgwin32_setenv knows about how putenv works on that platform.

Is gaur the only animal that needs this file, by the way?

> (1) allow just this one ECPG test to include port.h (or
> probably c.h).  However, there's a whole other can of worms
> there, which is that I wonder if we aren't doing it wrong
> on the Unix side by linking libpgport when we shouldn't.
> We've not been bit by that yet, but I wonder if it isn't
> just a matter of time.  The MSVC build, by not linking
> those libs in the first place, is really doing this the
> correct way.

I don't really want to include this stuff in the ECPG tests just to
bypass an environment configuration.

> (2) Let pg_regress_ecpg.c pass down the environment setting.
> 
> (3) Don't try to use the environment variable for this
> purpose.  I'd originally tried to change test5.pgc to just
> specify gssmode=disable in-line, but that only works
> nicely for one of the two failing cases.  The other one
> is testing the case of a completely defaulted connection
> target, so there's no place to add an option without
> breaking the only unique aspect of that test case.

> (2) is starting to seem attractive now that we've seen
> the downsides of (1) and (3).

FWIW, I'd be rather in favor of doing (3) because this remains simple
just to take care of an edge case, even if that partially breaks the
promise to rely on a default connection.

(4) would be to revisit the decision to make libpq report all the
errors stored in its stack with multiple attempts.  That would bring
back the buildfarm to green at least, and we still need to take a
decision about that for 14 anyway as it involves a compatibility
breakage.  But I agree that we also should do something for 12~ for
those tests.

> (BTW, I just noticed that regress.c is unsetenv'ing the
> SSL connection environment variables, but not the GSS ones.
> Seens like that needs work similar to 8279f68a1.)

Yes, I saw that I was able to break things is many fancy ways when
working on 8279f68a1, but the list of parameters to reset needs to
diverge a bit compared to the TAP tests.
--
Michael


signature.asc
Description: PGP signature


Re: Multiple hosts in connection string failed to failover in non-hot standby mode

2021-06-08 Thread Tom Lane
Michael Paquier  writes:
> Now, I also see that using pgwin32_setenv() instead of
> src/port/setenv.c causes cl to be confused once we update
> ecpg_regression.proj because it cannot find setenv().  Bringing the
> question, why is it necessary to have both setenv.c and
> pgwin32_setenv() on HEAD?  setenv.c should be enough once you have the
> fallback implementation of putenv() available.

IIUC, what you are proposing to do is replace pgwin32_setenv with
src/port/setenv.c.  I don't think that's an improvement.  setenv.c
leaks memory on repeat calls, because it cannot know what
pgwin32_setenv knows about how putenv works on that platform.

It'd be okay to do it like that for the ECPG tests, perhaps,
because we don't really care about small leaks in those.
But I don't want it to happen across-the-board.

Thinking more, the real problem is that use of libpgport
goes hand-in-hand with #including port.h; it's not going
to work real well if you do one without the other.
And I don't think we want to include port.h in the ECPG
test programs, because those are trying to model the
environment that typical user applications see.

Alternatives seem to be

(1) allow just this one ECPG test to include port.h (or
probably c.h).  However, there's a whole other can of worms
there, which is that I wonder if we aren't doing it wrong
on the Unix side by linking libpgport when we shouldn't.
We've not been bit by that yet, but I wonder if it isn't
just a matter of time.  The MSVC build, by not linking
those libs in the first place, is really doing this the
correct way.

(2) Let pg_regress_ecpg.c pass down the environment setting.

(3) Don't try to use the environment variable for this
purpose.  I'd originally tried to change test5.pgc to just
specify gssmode=disable in-line, but that only works
nicely for one of the two failing cases.  The other one
is testing the case of a completely defaulted connection
target, so there's no place to add an option without
breaking the only unique aspect of that test case.

(2) is starting to seem attractive now that we've seen
the downsides of (1) and (3).

(BTW, I just noticed that regress.c is unsetenv'ing the
SSL connection environment variables, but not the GSS ones.
Seens like that needs work similar to 8279f68a1.)

regards, tom lane




Re: Multiple hosts in connection string failed to failover in non-hot standby mode

2021-06-07 Thread Michael Paquier
On Mon, Jun 07, 2021 at 10:38:03AM -0400, Tom Lane wrote:
> Hmm.  We do include "-lpgcommon -lpgport" when building the ecpg test
> programs on Unix, so I'd assumed that the MSVC scripts did the same.
> Is there a good reason not to make them do so?

I was looking at that this morning, and yes we need to add more
references here.  Actually, adding only libpgport.lib allows the
compilation and the tests to work, but I agree to add also
libpgcommon.lib so as we don't fall into the same compilation trap
again in the future.

Now, I also see that using pgwin32_setenv() instead of
src/port/setenv.c causes cl to be confused once we update
ecpg_regression.proj because it cannot find setenv().  Bringing the
question, why is it necessary to have both setenv.c and
pgwin32_setenv() on HEAD?  setenv.c should be enough once you have the
fallback implementation of putenv() available.

Attached is the patch I am finishing with, that also brings all this
stuff closer to what I did in 12 and 13 for hamerkop.  The failing
test is passing for me now with MSVC and GSSAPI builds.

Thoughts?
--
Michael
diff --git a/src/include/port/win32_port.h b/src/include/port/win32_port.h
index 05c5a53442..e25f65b054 100644
--- a/src/include/port/win32_port.h
+++ b/src/include/port/win32_port.h
@@ -490,11 +490,9 @@ extern void _dosmaperr(unsigned long);
 
 /* in port/win32env.c */
 extern int	pgwin32_putenv(const char *);
-extern int	pgwin32_setenv(const char *name, const char *value, int overwrite);
 extern int	pgwin32_unsetenv(const char *name);
 
 #define putenv(x) pgwin32_putenv(x)
-#define setenv(x,y,z) pgwin32_setenv(x,y,z)
 #define unsetenv(x) pgwin32_unsetenv(x)
 
 /* in port/win32security.c */
diff --git a/src/port/win32env.c b/src/port/win32env.c
index a03556078c..c8d43af381 100644
--- a/src/port/win32env.c
+++ b/src/port/win32env.c
@@ -1,7 +1,7 @@
 /*-
  *
  * win32env.c
- *	  putenv(), setenv(), and unsetenv() for win32.
+ *	  putenv() and unsetenv() for win32.
  *
  * These functions update both the process environment and caches in
  * (potentially multiple) C run-time library (CRT) versions.
@@ -117,35 +117,6 @@ pgwin32_putenv(const char *envval)
 	return _putenv(envval);
 }
 
-int
-pgwin32_setenv(const char *name, const char *value, int overwrite)
-{
-	int			res;
-	char	   *envstr;
-
-	/* Error conditions, per POSIX */
-	if (name == NULL || name[0] == '\0' || strchr(name, '=') != NULL ||
-		value == NULL)
-	{
-		errno = EINVAL;
-		return -1;
-	}
-
-	/* No work if variable exists and we're not to replace it */
-	if (overwrite == 0 && getenv(name) != NULL)
-		return 0;
-
-	envstr = (char *) malloc(strlen(name) + strlen(value) + 2);
-	if (!envstr)/* not much we can do if no memory */
-		return -1;
-
-	sprintf(envstr, "%s=%s", name, value);
-
-	res = pgwin32_putenv(envstr);
-	free(envstr);
-	return res;
-}
-
 int
 pgwin32_unsetenv(const char *name)
 {
diff --git a/src/interfaces/ecpg/test/connect/test5.pgc b/src/interfaces/ecpg/test/connect/test5.pgc
index e712fa8778..f7263935ce 100644
--- a/src/interfaces/ecpg/test/connect/test5.pgc
+++ b/src/interfaces/ecpg/test/connect/test5.pgc
@@ -18,6 +18,9 @@ exec sql begin declare section;
 	char *user="regress_ecpg_user1";
 exec sql end declare section;
 
+	/* disable GSSENC to ensure stability of connection failure reports */
+	setenv("PGGSSENCMODE", "disable", 1);
+
 	ECPGdebug(1, stderr);
 
 	exec sql connect to ecpg2_regression as main;
diff --git a/src/interfaces/ecpg/test/expected/connect-test5.c b/src/interfaces/ecpg/test/expected/connect-test5.c
index 6ae5b589de..a86a5e4331 100644
--- a/src/interfaces/ecpg/test/expected/connect-test5.c
+++ b/src/interfaces/ecpg/test/expected/connect-test5.c
@@ -38,37 +38,33 @@ main(void)
 #line 19 "test5.pgc"
 
 
+	/* disable GSSENC to ensure stability of connection failure reports */
+	setenv("PGGSSENCMODE", "disable", 1);
+
 	ECPGdebug(1, stderr);
 
 	{ ECPGconnect(__LINE__, 0, "ecpg2_regression" , NULL, NULL , "main", 0); }
-#line 23 "test5.pgc"
-
-	{ ECPGdo(__LINE__, 0, 1, NULL, 0, ECPGst_normal, "alter user regress_ecpg_user2 encrypted password 'insecure'", ECPGt_EOIT, ECPGt_EORT);}
-#line 24 "test5.pgc"
-
-	{ ECPGdo(__LINE__, 0, 1, NULL, 0, ECPGst_normal, "alter user regress_ecpg_user1 encrypted password 'connectpw'", ECPGt_EOIT, ECPGt_EORT);}
-#line 25 "test5.pgc"
-
-	{ ECPGtrans(__LINE__, NULL, "commit");}
 #line 26 "test5.pgc"
 
-	{ ECPGdisconnect(__LINE__, "CURRENT");}
+	{ ECPGdo(__LINE__, 0, 1, NULL, 0, ECPGst_normal, "alter user regress_ecpg_user2 encrypted password 'insecure'", ECPGt_EOIT, ECPGt_EORT);}
 #line 27 "test5.pgc"
+
+	{ ECPGdo(__LINE__, 0, 1, NULL, 0, ECPGst_normal, "alter user regress_ecpg_user1 encrypted password 'connectpw'", ECPGt_EOIT, ECPGt_EORT);}
+#line 28 "test5.pgc"
+
+	{ ECPGtrans(__LINE__, NULL, "commit");}
+#line 29 "test5.pgc"
+
+	{ ECPGdisconnect(__LINE__, "CURRENT");}
+#line 30 "test5.pgc"
   /* <-- "main" not specified */
 
 	

Re: Multiple hosts in connection string failed to failover in non-hot standby mode

2021-06-07 Thread Tom Lane
Michael Paquier  writes:
> On Sun, Jun 06, 2021 at 05:27:49PM -0400, Tom Lane wrote:
>> So I took
>> a look at disabling GSSENC in these test cases to try to silence
>> hamerkop's test failure that way.  Here's a proposed patch.
>> It relies on setenv() being available, but I think that's fine
>> because we link the ECPG test programs with libpgport.

> No, that's not it.  The compilation of the tests happens when
> triggering the tests as of ecpgcheck() in vcregress.pl so I think that
> this is going to fail.  This requires at least the addition of a
> reference to libpgport in ecpg_regression.proj, perhaps more.

Hmm.  We do include "-lpgcommon -lpgport" when building the ecpg test
programs on Unix, so I'd assumed that the MSVC scripts did the same.
Is there a good reason not to make them do so?

regards, tom lane




Re: Multiple hosts in connection string failed to failover in non-hot standby mode

2021-06-07 Thread Michael Paquier
On Sun, Jun 06, 2021 at 05:27:49PM -0400, Tom Lane wrote:
> It seems like nobody's terribly interested in figuring out why
> pg_GSS_have_cred_cache() is misbehaving on Windows.

I have been investigating that for a couple of hours in total, but
nothing to report yet.

> So I took
> a look at disabling GSSENC in these test cases to try to silence
> hamerkop's test failure that way.  Here's a proposed patch.
> It relies on setenv() being available, but I think that's fine
> because we link the ECPG test programs with libpgport.

No, that's not it.  The compilation of the tests happens when
triggering the tests as of ecpgcheck() in vcregress.pl so I think that
this is going to fail.  This requires at least the addition of a
reference to libpgport in ecpg_regression.proj, perhaps more.
--
Michael


signature.asc
Description: PGP signature


Re: Multiple hosts in connection string failed to failover in non-hot standby mode

2021-06-06 Thread Tom Lane
It seems like nobody's terribly interested in figuring out why
pg_GSS_have_cred_cache() is misbehaving on Windows.  So I took
a look at disabling GSSENC in these test cases to try to silence
hamerkop's test failure that way.  Here's a proposed patch.
It relies on setenv() being available, but I think that's fine
because we link the ECPG test programs with libpgport.

regards, tom lane

diff --git a/src/interfaces/ecpg/test/connect/test5.pgc b/src/interfaces/ecpg/test/connect/test5.pgc
index e712fa8778..f7263935ce 100644
--- a/src/interfaces/ecpg/test/connect/test5.pgc
+++ b/src/interfaces/ecpg/test/connect/test5.pgc
@@ -18,6 +18,9 @@ exec sql begin declare section;
 	char *user="regress_ecpg_user1";
 exec sql end declare section;
 
+	/* disable GSSENC to ensure stability of connection failure reports */
+	setenv("PGGSSENCMODE", "disable", 1);
+
 	ECPGdebug(1, stderr);
 
 	exec sql connect to ecpg2_regression as main;
diff --git a/src/interfaces/ecpg/test/expected/connect-test5.c b/src/interfaces/ecpg/test/expected/connect-test5.c
index 6ae5b589de..a86a5e4331 100644
--- a/src/interfaces/ecpg/test/expected/connect-test5.c
+++ b/src/interfaces/ecpg/test/expected/connect-test5.c
@@ -38,37 +38,33 @@ main(void)
 #line 19 "test5.pgc"
 
 
+	/* disable GSSENC to ensure stability of connection failure reports */
+	setenv("PGGSSENCMODE", "disable", 1);
+
 	ECPGdebug(1, stderr);
 
 	{ ECPGconnect(__LINE__, 0, "ecpg2_regression" , NULL, NULL , "main", 0); }
-#line 23 "test5.pgc"
+#line 26 "test5.pgc"
 
 	{ ECPGdo(__LINE__, 0, 1, NULL, 0, ECPGst_normal, "alter user regress_ecpg_user2 encrypted password 'insecure'", ECPGt_EOIT, ECPGt_EORT);}
-#line 24 "test5.pgc"
+#line 27 "test5.pgc"
 
 	{ ECPGdo(__LINE__, 0, 1, NULL, 0, ECPGst_normal, "alter user regress_ecpg_user1 encrypted password 'connectpw'", ECPGt_EOIT, ECPGt_EORT);}
-#line 25 "test5.pgc"
+#line 28 "test5.pgc"
 
 	{ ECPGtrans(__LINE__, NULL, "commit");}
-#line 26 "test5.pgc"
+#line 29 "test5.pgc"
 
 	{ ECPGdisconnect(__LINE__, "CURRENT");}
-#line 27 "test5.pgc"
+#line 30 "test5.pgc"
   /* <-- "main" not specified */
 
 	strcpy(db, "ecpg2_regression");
 	strcpy(id, "main");
 	{ ECPGconnect(__LINE__, 0, db , NULL, NULL , id, 0); }
-#line 31 "test5.pgc"
-
-	{ ECPGdisconnect(__LINE__, id);}
-#line 32 "test5.pgc"
-
-
-	{ ECPGconnect(__LINE__, 0, "ecpg2_regression" , NULL, NULL , "main", 0); }
 #line 34 "test5.pgc"
 
-	{ ECPGdisconnect(__LINE__, "main");}
+	{ ECPGdisconnect(__LINE__, id);}
 #line 35 "test5.pgc"
 
 
@@ -86,21 +82,21 @@ main(void)
 #line 41 "test5.pgc"
 
 
-	{ ECPGconnect(__LINE__, 0, "" , "regress_ecpg_user2" , "insecure" , "main", 0); }
+	{ ECPGconnect(__LINE__, 0, "ecpg2_regression" , NULL, NULL , "main", 0); }
 #line 43 "test5.pgc"
 
 	{ ECPGdisconnect(__LINE__, "main");}
 #line 44 "test5.pgc"
 
 
-	{ ECPGconnect(__LINE__, 0, "ecpg2_regression" , "regress_ecpg_user1" , "connectpw" , "main", 0); }
+	{ ECPGconnect(__LINE__, 0, "" , "regress_ecpg_user2" , "insecure" , "main", 0); }
 #line 46 "test5.pgc"
 
 	{ ECPGdisconnect(__LINE__, "main");}
 #line 47 "test5.pgc"
 
 
-	{ ECPGconnect(__LINE__, 0, "unix:postgresql://localhost/ecpg2_regression" , "regress_ecpg_user1" , "connectpw" , "main", 0); }
+	{ ECPGconnect(__LINE__, 0, "ecpg2_regression" , "regress_ecpg_user1" , "connectpw" , "main", 0); }
 #line 49 "test5.pgc"
 
 	{ ECPGdisconnect(__LINE__, "main");}
@@ -114,48 +110,55 @@ main(void)
 #line 53 "test5.pgc"
 
 
-	{ ECPGconnect(__LINE__, 0, "unix:postgresql://localhost/ecpg2_regression" , user , "connectpw" , "main", 0); }
+	{ ECPGconnect(__LINE__, 0, "unix:postgresql://localhost/ecpg2_regression" , "regress_ecpg_user1" , "connectpw" , "main", 0); }
 #line 55 "test5.pgc"
 
 	{ ECPGdisconnect(__LINE__, "main");}
 #line 56 "test5.pgc"
 
 
-	{ ECPGconnect(__LINE__, 0, "unix:postgresql://localhost/ecpg2_regression?connect_timeout=180 & client_encoding=latin1" , "regress_ecpg_user1" , "connectpw" , "main", 0); }
+	{ ECPGconnect(__LINE__, 0, "unix:postgresql://localhost/ecpg2_regression" , user , "connectpw" , "main", 0); }
 #line 58 "test5.pgc"
 
 	{ ECPGdisconnect(__LINE__, "main");}
 #line 59 "test5.pgc"
 
 
-	{ ECPGconnect(__LINE__, 0, "unix:postgresql://200.46.204.71/ecpg2_regression" , "regress_ecpg_user1" , "connectpw" , "main", 0); }
+	{ ECPGconnect(__LINE__, 0, "unix:postgresql://localhost/ecpg2_regression?connect_timeout=180 & client_encoding=latin1" , "regress_ecpg_user1" , "connectpw" , "main", 0); }
 #line 61 "test5.pgc"
 
 	{ ECPGdisconnect(__LINE__, "main");}
 #line 62 "test5.pgc"
 
 
-	{ ECPGconnect(__LINE__, 0, "unix:postgresql://localhost/" , "regress_ecpg_user2" , "insecure" , "main", 0); }
+	{ ECPGconnect(__LINE__, 0, "unix:postgresql://200.46.204.71/ecpg2_regression" , "regress_ecpg_user1" , "connectpw" , "main", 0); }
 #line 64 "test5.pgc"
 
 	{ ECPGdisconnect(__LINE__, "main");}
 #line 65 "test5.pgc"
 
 
+	{ ECPGconnect(__LINE__, 0, "unix:postgresql://localhost/" , "regress_ecpg_user2" , "insecure" , "main", 

Re: Multiple hosts in connection string failed to failover in non-hot standby mode

2021-06-01 Thread Michael Paquier
On Mon, May 31, 2021 at 09:07:38PM -0400, Tom Lane wrote:
> Michael Paquier  writes:
>>> Agreed, that seems bogus.
> 
>> There may be others, and I have not checked yet.  I'd rather do a
>> backpatch for this part, would you agree?
> 
> +1

Playing with all those variables and broken values here and there, I
have been able to break a bunch of tests.  Most of the failures were
in the authentication and SSL tests, but there were also fancier
cases.  For example, PGCLIENTENCODING would cause a failure with
pg_ctl, for any TAP test.

I got surprised that enforcing values for most of the PGSSL* ones did
not cause a failure when it came to the certs, CRLs keys and root
certs now.  Still, I think that we'd be safer to cancel these as
well.

Attached is the list I am finishing with.  I'd like to fix that, so
please let me know if there are any comments or objections.
--
Michael
diff --git a/src/test/perl/TestLib.pm b/src/test/perl/TestLib.pm
index d6c3eb8723..d4f9fc5f2b 100644
--- a/src/test/perl/TestLib.pm
+++ b/src/test/perl/TestLib.pm
@@ -104,13 +104,30 @@ BEGIN
 	delete $ENV{LC_ALL};
 	$ENV{LC_MESSAGES} = 'C';
 
+	delete $ENV{PGCHANNELBINDING};
+	delete $ENV{PGCLIENTENCODING};
 	delete $ENV{PGCONNECT_TIMEOUT};
 	delete $ENV{PGDATA};
 	delete $ENV{PGDATABASE};
+	delete $ENV{PGGSSENCMODE};
+	delete $ENV{PGGSSLIB};
 	delete $ENV{PGHOSTADDR};
+	delete $ENV{PGKRBSRVNAME};
+	delete $ENV{PGPASSFILE};
+	delete $ENV{PGPASSWORD};
+	delete $ENV{PGREQUIREPEER};
 	delete $ENV{PGREQUIRESSL};
 	delete $ENV{PGSERVICE};
+	delete $ENV{PGSERVICEFILE};
+	delete $ENV{PGSSLCERT};
+	delete $ENV{PGSSLCRL};
+	delete $ENV{PGSSLCRLDIR};
+	delete $ENV{PGSSLKEY};
+	delete $ENV{PGSSLMAXPROTOCOLVERSION};
+	delete $ENV{PGSSLMINPROTOCOLVERSION};
 	delete $ENV{PGSSLMODE};
+	delete $ENV{PGSSLROOTCERT};
+	delete $ENV{PGSSLSNI};
 	delete $ENV{PGUSER};
 	delete $ENV{PGPORT};
 	delete $ENV{PGHOST};


signature.asc
Description: PGP signature


Re: Multiple hosts in connection string failed to failover in non-hot standby mode

2021-05-31 Thread Tom Lane
Michael Paquier  writes:
>>> Another thing that strikes me as incorrect is that we don't unset
>>> PGGSSENCMODE or PGGSSLIB in TestLib.pm.  Just noting it on the way..

>> Agreed, that seems bogus.

> There may be others, and I have not checked yet.  I'd rather do a
> backpatch for this part, would you agree?

+1

regards, tom lane




Re: Multiple hosts in connection string failed to failover in non-hot standby mode

2021-05-31 Thread Michael Paquier
On Mon, May 31, 2021 at 09:36:20AM -0400, Tom Lane wrote:
> Interesting --- I was considering running such a test locally, but
> didn't get round to it yet.

Just to be clear, that's my Windows dev box.

> It seems like the ideal solution would be to make pg_GSS_have_cred_cache
> smarter, so that we don't attempt a GSS connection cycle here.

I am not sure yet what would be adapted here.  That requires diving a
bit into the upstream code.

>> Another thing that strikes me as incorrect is that we don't unset
>> PGGSSENCMODE or PGGSSLIB in TestLib.pm.  Just noting it on the way..
>
> Agreed, that seems bogus.

There may be others, and I have not checked yet.  I'd rather do a
backpatch for this part, would you agree?
--
Michael


signature.asc
Description: PGP signature


Re: Multiple hosts in connection string failed to failover in non-hot standby mode

2021-05-31 Thread Tom Lane
Michael Paquier  writes:
> On Mon, May 31, 2021 at 12:05:12AM -0400, Tom Lane wrote:
>> What is not clear is why GSS is acting that way.  We wouldn't
>> have tried a GSS connection unless pg_GSS_have_cred_cache
>> succeeded ... so how come that worked but then gss_init_sec_context
>> complained "Credential cache is empty"?

> I suspect that's just the way the upstream installation works with a
> credentials cache created from the beginning, as I see the same
> behavior and the same error on my own host for HEAD with a KRB5 server
> set up once the upstream installation runs.

Interesting --- I was considering running such a test locally, but
didn't get round to it yet.

> Leaving the specific
> topic of this thread aside for one moment, would there be an argument
> for just enforcing gssencmode=disable in this set of tests down to 12?

It seems like the ideal solution would be to make pg_GSS_have_cred_cache
smarter, so that we don't attempt a GSS connection cycle here.  But if
we can't, adding gssencmode=disable to these test cases is what I was
thinking about, too.

> Another thing that strikes me as incorrect is that we don't unset
> PGGSSENCMODE or PGGSSLIB in TestLib.pm.  Just noting it on the way..

Agreed, that seems bogus.

regards, tom lane




Re: Multiple hosts in connection string failed to failover in non-hot standby mode

2021-05-31 Thread Michael Paquier
On Mon, May 31, 2021 at 12:05:12AM -0400, Tom Lane wrote:
> Yeah, I was looking at that earlier today.  Evidently libpq is
> trying a GSS-encrypted connection, and that doesn't work, so
> it falls back to a regular connection where we get the expected
> error.  Probably all the connections in this test are hitting the
> GSS failure, but only the ones where the second attempt fails
> show a visible issue.

Yep.  This wastes cycles.

> What is not clear is why GSS is acting that way.  We wouldn't
> have tried a GSS connection unless pg_GSS_have_cred_cache
> succeeded ... so how come that worked but then gss_init_sec_context
> complained "Credential cache is empty"?
> 
> My rough guess is that Windows has implemented the GSS APIs in
> such a way that what pg_GSS_have_cred_cache is testing isn't
> sufficient to tell whether a sane credential actually exists.
> 
> Or there's something particularly weird about how hamerkop
> is set up.

I suspect that's just the way the upstream installation works with a
credentials cache created from the beginning, as I see the same
behavior and the same error on my own host for HEAD with a KRB5 server
set up once the upstream installation runs.  Leaving the specific
topic of this thread aside for one moment, would there be an argument
for just enforcing gssencmode=disable in this set of tests down to 12?
It is worth noting that the problem does not show up in 12 and 13 once
the compilation works, because we just mask the error there, but the
code path is still taken.

Another thing that strikes me as incorrect is that we don't unset
PGGSSENCMODE or PGGSSLIB in TestLib.pm.  Just noting it on the way..
--
Michael


signature.asc
Description: PGP signature


Re: Multiple hosts in connection string failed to failover in non-hot standby mode

2021-05-30 Thread Tom Lane
Michael Paquier  writes:
> In my recent quest to look at GSSAPI builds on Windows, I have bumped
> into another failure that's related to this thread.  hamerkop
> summarizes the situation here:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hamerkop=2021-05-29%2010%3A15%3A42
> There are two failures like this one as errorMessage piles up on
> failures, as of connect/test5:
> -[NO_PID]: ECPGconnect: connection to server failed: FATAL:  database
>  "regress_ecpg_user2" does not exist
> +[NO_PID]: ECPGconnect: connection to server failed: could not
>  initiate GSSAPI security context: Unspecified GSS failure.  Minor
>  code may provide more information: Credential cache is empty
> +connection to server failed: FATAL:  database "regress_ecpg_user2"
>  does not exist 

Yeah, I was looking at that earlier today.  Evidently libpq is
trying a GSS-encrypted connection, and that doesn't work, so
it falls back to a regular connection where we get the expected
error.  Probably all the connections in this test are hitting the
GSS failure, but only the ones where the second attempt fails
show a visible issue.

What is not clear is why GSS is acting that way.  We wouldn't
have tried a GSS connection unless pg_GSS_have_cred_cache
succeeded ... so how come that worked but then gss_init_sec_context
complained "Credential cache is empty"?

My rough guess is that Windows has implemented the GSS APIs in
such a way that what pg_GSS_have_cred_cache is testing isn't
sufficient to tell whether a sane credential actually exists.

Or there's something particularly weird about how hamerkop
is set up.

regards, tom lane




Re: Multiple hosts in connection string failed to failover in non-hot standby mode

2021-05-30 Thread Michael Paquier
On Mon, May 31, 2021 at 11:00:55AM +0900, Michael Paquier wrote:
> Yeah.  On the contrary, it could be confusing if one sees an error
> message but there is nothing to worry about, because things are
> working in the scope of what the user wanted at connection time.

In my recent quest to look at GSSAPI builds on Windows, I have bumped
into another failure that's related to this thread.  hamerkop
summarizes the situation here:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hamerkop=2021-05-29%2010%3A15%3A42

There are two failures like this one as errorMessage piles up on
failures, as of connect/test5:
-[NO_PID]: ECPGconnect: connection to server failed: FATAL:  database
 "regress_ecpg_user2" does not exist
+[NO_PID]: ECPGconnect: connection to server failed: could not
 initiate GSSAPI security context: Unspecified GSS failure.  Minor
 code may provide more information: Credential cache is empty
+connection to server failed: FATAL:  database "regress_ecpg_user2"
 does not exist 
--
Michael


signature.asc
Description: PGP signature


Re: Multiple hosts in connection string failed to failover in non-hot standby mode

2021-05-30 Thread Michael Paquier
On Sun, May 30, 2021 at 08:25:00PM -0500, Justin Pryzby wrote:
> ..But I think it's not useful to put details into errorMessage on success,
> unless you're going to document that.  It would never have occurred to me to
> look there, or that it would even be safe.

Yeah.  On the contrary, it could be confusing if one sees an error
message but there is nothing to worry about, because things are
working in the scope of what the user wanted at connection time.
--
Michael


signature.asc
Description: PGP signature


Re: Multiple hosts in connection string failed to failover in non-hot standby mode

2021-05-30 Thread Justin Pryzby
On Thu, May 06, 2021 at 01:22:27PM -0400, Tom Lane wrote:
> Justin Pryzby  writes:
> > 52a10224 broke sqlsmith, of all things.
> 
> > It was using errmsg as a test of success, instead of checking if the 
> > connection
> > result wasn't null:
> 
> >  conn = PQconnectdb(conninfo.c_str());
> >  char *errmsg = PQerrorMessage(conn);
> >  if (strlen(errmsg))
> >  throw dut::broken(errmsg, "08001");
> 
> > That's clearly the wrong thing to do, but maybe this should be described in 
> > the
> > release notes as a compatibility issue, in case other people had the same 
> > idea.
> > Clearing errorMessage during success is an option..
> 
> Hm.  I'd debated whether to clear conn->errorMessage at the end of
> a successful connection sequence, and decided not to on the grounds
> that it might be interesting info (eg it could tell you why you
> ended up connected to server Y and not server X).  But perhaps
> it's too much of a compatibility break for this small benefit.

I don't care if applications break because they check the errorMessage instead
of the return value...

..But I think it's not useful to put details into errorMessage on success,
unless you're going to document that.  It would never have occurred to me to
look there, or that it would even be safe.

-- 
Justin




Re: Multiple hosts in connection string failed to failover in non-hot standby mode

2021-05-17 Thread Daniel Gustafsson
> On 11 May 2021, at 09:29, Michael Paquier  wrote:

> FWIW, I think that the case of getting some information about any
> failed connections while a connection has been successfully made
> within the scope of the connection string parameters provided by the
> user is rather thin, and I really feel that this is going to cause
> more pain to users than this is worth it.  So my vote would be to
> clean up conn->errorMessage after a successful connection.

Agreed, given how conservative we typically are with backwards compatibility it
seems a too thin benefit to warrant potential breakage.

My vote would too be to restore the behavior by clearing conn->errorMessage.

--
Daniel Gustafsson   https://vmware.com/



Re: Multiple hosts in connection string failed to failover in non-hot standby mode

2021-05-11 Thread Michael Paquier
On Thu, May 06, 2021 at 01:22:27PM -0400, Tom Lane wrote:
> Hm.  I'd debated whether to clear conn->errorMessage at the end of
> a successful connection sequence, and decided not to on the grounds
> that it might be interesting info (eg it could tell you why you
> ended up connected to server Y and not server X).  But perhaps
> it's too much of a compatibility break for this small benefit.
> 
> I'm curious though why it took this long for anyone to complain.
> I'd supposed that people were running sqlsmith against HEAD on
> a pretty regular basis.

FWIW, I think that the case of getting some information about any
failed connections while a connection has been successfully made
within the scope of the connection string parameters provided by the
user is rather thin, and I really feel that this is going to cause
more pain to users than this is worth it.  So my vote would be to
clean up conn->errorMessage after a successful connection.

Now, I would not mind either if we finish by taking a decision here
after beta1, to see if there are actual complains on the matter based
on the feedback we get.
--
Michael


signature.asc
Description: PGP signature


Re: Multiple hosts in connection string failed to failover in non-hot standby mode

2021-05-06 Thread Andreas Seltenreich
> On Thu, May 06, 2021 at 01:22:27PM -0400, Tom Lane wrote:
>> I'm curious though why it took this long for anyone to complain.
>> I'd supposed that people were running sqlsmith against HEAD on
>> a pretty regular basis.

Last time I ran it was November 27.  I'm neglecting it on my spare time
and there is hardly any opportunity to sneak it onto my agenda at work.
I'll do my best to try to get either of these fixed.

Justin Pryzby writes:
> I think it's also becase sqlsmith would need to run against the v14 *client*
> library.  I don't know about anyone else's workflow, but I tend not to "make
> install", but work with binaries out of ./tmp_install.

My playbooks don't grab the client libraries of the test target either.
I'll change them.

> There's a few changes needed on sqlsmith HEAD, but I guess nobody would have
> gotten that far if the connection was failing (or rather, detected as such).

Thanks for the patch.

regards,
andreas




Re: Multiple hosts in connection string failed to failover in non-hot standby mode

2021-05-06 Thread Justin Pryzby
On Thu, May 06, 2021 at 01:22:27PM -0400, Tom Lane wrote:
> I'm curious though why it took this long for anyone to complain.
> I'd supposed that people were running sqlsmith against HEAD on
> a pretty regular basis.

I think it's also becase sqlsmith would need to run against the v14 *client*
library.  I don't know about anyone else's workflow, but I tend not to "make
install", but work with binaries out of ./tmp_install.

There's a few changes needed on sqlsmith HEAD, but I guess nobody would have
gotten that far if the connection was failing (or rather, detected as such).

diff --git a/grammar.cc b/grammar.cc
index 62aa8e9..76491ff 100644
--- a/grammar.cc
+++ b/grammar.cc
@@ -327,7 +327,11 @@ query_spec::query_spec(prod *p, struct scope *s, bool 
lateral) :
 
   search = bool_expr::factory(this);
 
-  if (d6() > 2) {
+  if (d6() > 4) {
+ostringstream cons;
+cons << "order by 1 fetch first " << d100() + d100() << " rows with ties";
+limit_clause = cons.str();
+  } else if (d6() > 2) {
 ostringstream cons;
 cons << "limit " << d100() + d100();
 limit_clause = cons.str();
diff --git a/postgres.cc b/postgres.cc
index f2a3627..1c0c55f 100644
--- a/postgres.cc
+++ b/postgres.cc
@@ -30,6 +30,7 @@ bool pg_type::consistent(sqltype *rvalue)
   case 'c': /* composite type */
   case 'd': /* domain */
   case 'r': /* range */
+  case 'm': /* multirange */
   case 'e': /* enum */
 return this == t;
 




Re: Multiple hosts in connection string failed to failover in non-hot standby mode

2021-05-06 Thread Tom Lane
Justin Pryzby  writes:
> 52a10224 broke sqlsmith, of all things.

> It was using errmsg as a test of success, instead of checking if the 
> connection
> result wasn't null:

>  conn = PQconnectdb(conninfo.c_str());
>  char *errmsg = PQerrorMessage(conn);
>  if (strlen(errmsg))
>  throw dut::broken(errmsg, "08001");

> That's clearly the wrong thing to do, but maybe this should be described in 
> the
> release notes as a compatibility issue, in case other people had the same 
> idea.
> Clearing errorMessage during success is an option..

Hm.  I'd debated whether to clear conn->errorMessage at the end of
a successful connection sequence, and decided not to on the grounds
that it might be interesting info (eg it could tell you why you
ended up connected to server Y and not server X).  But perhaps
it's too much of a compatibility break for this small benefit.

I'm curious though why it took this long for anyone to complain.
I'd supposed that people were running sqlsmith against HEAD on
a pretty regular basis.

regards, tom lane




Re: Multiple hosts in connection string failed to failover in non-hot standby mode

2021-05-06 Thread Justin Pryzby
On Sun, Jan 10, 2021 at 05:38:50PM -0500, Tom Lane wrote:
> I wrote:
> > The problems that I see in this area are first that there's no
> > real standardization in libpq as to whether to append error messages
> > together or just flush preceding messages; and second that no effort
> > is made in multi-connection-attempt cases to separate the errors from
> > different attempts, much less identify which error goes with which
> > host or IP address.  I think we really need to put some work into
> > that.
> 
> I spent some time on this, and here is a patch set that tries to
> improve matters.
> 
> 0001 changes the libpq coding rules to be that all error messages should
> be appended to conn->errorMessage, never overwritten (there are a couple
> of exceptions in fe-lobj.c) and we reset conn->errorMessage to empty only
> at the start of a connection request or new query.  This is something
> that's been bugging me for a long time and I'm glad to get it cleaned up.
> Formerly it seemed that a dartboard had been used to decide whether to use
> "printfPQExpBuffer" or "appendPQExpBuffer"; now it's always the latter.
> We can also get rid of several hacks that were used to get around the
> mess and force appending behavior.
> 
> 0002 then changes the reporting rules in fe-connect.c as I suggested,
> so that you might get errors like this:
> 
> $ psql -h localhost,/tmp -p 12345
> psql: error: could not connect to host "localhost" (::1), port 12345: 
> Connection refused
> Is the server running on that host and accepting TCP/IP connections?
> could not connect to host "localhost" (127.0.0.1), port 12345: Connection 
> refused
> Is the server running on that host and accepting TCP/IP connections?
> could not connect to socket "/tmp/.s.PGSQL.12345": No such file or directory
> Is the server running locally and accepting connections on that 
> socket?
> 
> and we have a pretty uniform rule that errors coming back from a
> connection attempt will be prefixed with the server address.

52a10224 broke sqlsmith, of all things.

It was using errmsg as a test of success, instead of checking if the connection
result wasn't null:

 conn = PQconnectdb(conninfo.c_str());
 char *errmsg = PQerrorMessage(conn);
 if (strlen(errmsg))
 throw dut::broken(errmsg, "08001");

That's clearly the wrong thing to do, but maybe this should be described in the
release notes as a compatibility issue, in case other people had the same idea.
Clearing errorMessage during success is an option..

-- 
Justin




Re: Multiple hosts in connection string failed to failover in non-hot standby mode

2021-01-11 Thread Tom Lane
Hubert Zhang  writes:
> As for patch 0004, find ':' after "could not connect to" may failed when 
> error message like:
> "could not connect to host "localhost" (::1), port 12345: Connection 
> refused", where p2 will point to "::1" instead of ": Connection refused". But 
> since it's only used for test case, we don't need to filter the error message 
> precisely.

Excellent point, and I think that could happen on a Windows installation.
We can make it look for ": " instead of just ':', and that'll reduce the
odds of trouble.

regards, tom lane




Re: Multiple hosts in connection string failed to failover in non-hot standby mode

2021-01-11 Thread Hubert Zhang
Hi Tom,

I agree to get detailed error message for each failed host as your patch 0001.

As for patch 0004, find ':' after "could not connect to" may failed when error 
message like:
"could not connect to host "localhost" (::1), port 12345: Connection refused", 
where p2 will point to "::1" instead of ": Connection refused". But since it's 
only used for test case, we don't need to filter the error message precisely.

```
ecpg_filter_stderr(const char *resultfile, const char *tmpfile)
{
..
char   *p1 = strstr(linebuf.data, "could not connect to ");
if (p1)
{
char   *p2 = strchr(p1, ':');
if (p2)
memmove(p1 + 17, p2, strlen(p2) + 1);
}
}
```

Thanks,
Hubert


From: Tom Lane 
Sent: Monday, January 11, 2021 10:56 AM
To: Hubert Zhang 
Cc: tsunakawa.ta...@fujitsu.com ; 
pgsql-hack...@postgresql.org 
Subject: Re: Multiple hosts in connection string failed to failover in non-hot 
standby mode

I wrote:
> I feel pretty good about 0001: it might be committable as-is.  0002 is
> probably subject to bikeshedding, plus it has a problem in the ECPG tests.
> Two of the error messages are now unstable because they expose
> chosen-at-random socket paths:
> ...
> I don't have any non-hacky ideas what to do about that.  The extra detail
> seems useful to end users, but we don't have any infrastructure that
> would allow filtering it out in the ECPG tests.

So far the only solution that comes to mind is to introduce some
infrastructure to do that filtering.  0001-0003 below are unchanged,
0004 patches up the ecpg test framework with a rather ad-hoc filtering
function.  I'd feel worse about this if there weren't already a very
ad-hoc filtering function there ;-)

This set passes check-world for me; we'll soon see what the cfbot
thinks.

regards, tom lane



Re: Multiple hosts in connection string failed to failover in non-hot standby mode

2021-01-10 Thread Tom Lane
I wrote:
> The problems that I see in this area are first that there's no
> real standardization in libpq as to whether to append error messages
> together or just flush preceding messages; and second that no effort
> is made in multi-connection-attempt cases to separate the errors from
> different attempts, much less identify which error goes with which
> host or IP address.  I think we really need to put some work into
> that.

I spent some time on this, and here is a patch set that tries to
improve matters.

0001 changes the libpq coding rules to be that all error messages should
be appended to conn->errorMessage, never overwritten (there are a couple
of exceptions in fe-lobj.c) and we reset conn->errorMessage to empty only
at the start of a connection request or new query.  This is something
that's been bugging me for a long time and I'm glad to get it cleaned up.
Formerly it seemed that a dartboard had been used to decide whether to use
"printfPQExpBuffer" or "appendPQExpBuffer"; now it's always the latter.
We can also get rid of several hacks that were used to get around the
mess and force appending behavior.

0002 then changes the reporting rules in fe-connect.c as I suggested,
so that you might get errors like this:

$ psql -h localhost,/tmp -p 12345
psql: error: could not connect to host "localhost" (::1), port 12345: 
Connection refused
Is the server running on that host and accepting TCP/IP connections?
could not connect to host "localhost" (127.0.0.1), port 12345: Connection 
refused
Is the server running on that host and accepting TCP/IP connections?
could not connect to socket "/tmp/.s.PGSQL.12345": No such file or directory
Is the server running locally and accepting connections on that socket?

and we have a pretty uniform rule that errors coming back from a
connection attempt will be prefixed with the server address.

Then 0003 is the part of your patch that I'm happy with.  Given 0001+0002
we could certainly consider looping back and retrying for more cases, but
I still want to tread lightly on that.  I don't see a lot of value in
retrying errors that seem to be on the client side, such as failure to
set socket properties; and in general I'm hesitant to add untestable
code paths here.

I feel pretty good about 0001: it might be committable as-is.  0002 is
probably subject to bikeshedding, plus it has a problem in the ECPG tests.
Two of the error messages are now unstable because they expose
chosen-at-random socket paths:

diff -U3 
/home/postgres/pgsql/src/interfaces/ecpg/test/expected/connect-test5.stderr 
/home/postgres/pgsql/src/interfaces/ecpg/test/results/connect-test5.stderr
--- /home/postgres/pgsql/src/interfaces/ecpg/test/expected/connect-test5.stderr 
2020-08-04 14:59:45.617380050 -0400
+++ /home/postgres/pgsql/src/interfaces/ecpg/test/results/connect-test5.stderr  
2021-01-10 16:12:22.539433702 -0500
@@ -36,7 +36,7 @@
 [NO_PID]: sqlca: code: 0, state: 0
 [NO_PID]: ECPGconnect: opening database  on  port   
for user regress_ecpg_user2
 [NO_PID]: sqlca: code: 0, state: 0
-[NO_PID]: ECPGconnect: could not open database: FATAL:  database 
"regress_ecpg_user2" does not exist
+[NO_PID]: ECPGconnect: could not open database: could not connect to socket 
"/tmp/pg_regress-EbHubF/.s.PGSQL.58080": FATAL:  database "regress_ecpg_user2" 
does not exist
 
 [NO_PID]: sqlca: code: 0, state: 0
 [NO_PID]: ecpg_finish: connection main closed
@@ -73,7 +73,7 @@
 [NO_PID]: sqlca: code: -220, state: 08003
 [NO_PID]: ECPGconnect: opening database  on  port   
for user regress_ecpg_user2
 [NO_PID]: sqlca: code: 0, state: 0
-[NO_PID]: ECPGconnect: could not open database: FATAL:  database 
"regress_ecpg_user2" does not exist
+[NO_PID]: ECPGconnect: could not open database: could not connect to socket 
"/tmp/pg_regress-EbHubF/.s.PGSQL.58080": FATAL:  database "regress_ecpg_user2" 
does not exist
 
 [NO_PID]: sqlca: code: 0, state: 0
 [NO_PID]: ecpg_finish: connection main closed

I don't have any non-hacky ideas what to do about that.  The extra detail
seems useful to end users, but we don't have any infrastructure that
would allow filtering it out in the ECPG tests.

regards, tom lane

diff --git a/src/interfaces/libpq/fe-auth-scram.c b/src/interfaces/libpq/fe-auth-scram.c
index b76f0befd0..8b60378379 100644
--- a/src/interfaces/libpq/fe-auth-scram.c
+++ b/src/interfaces/libpq/fe-auth-scram.c
@@ -208,13 +208,13 @@ pg_fe_scram_exchange(void *opaq, char *input, int inputlen,
 	{
 		if (inputlen == 0)
 		{
-			printfPQExpBuffer(>errorMessage,
+			appendPQExpBuffer(>errorMessage,
 			  libpq_gettext("malformed SCRAM message (empty message)\n"));
 			goto error;
 		}
 		if (inputlen != strlen(input))
 		{
-			printfPQExpBuffer(>errorMessage,
+			appendPQExpBuffer(>errorMessage,
 			  libpq_gettext("malformed SCRAM message (length mismatch)\n"));
 			goto error;
 		}
@@ -258,14 +258,14 @@ pg_fe_scram_exchange(void *opaq, char *input, 

Re: Multiple hosts in connection string failed to failover in non-hot standby mode

2021-01-07 Thread Tom Lane
Hubert Zhang  writes:
> [ 0001-Enhance-libpq-to-support-multiple-host-for-non-hot-s.patch ]

I took a quick look at this.  TBH, I'd just drop the first three hunks,
as they've got nothing to do with any failure mode that there's evidence
for in this thread or the prior one, and I'm afraid they're more likely
to create trouble than fix it.

As for the last hunk, why is it after rather than before the SSL/GSS
checks?  I doubt that retrying with/without SSL is going to change
a CANNOT_CONNECT_NOW result, unless maybe by slowing things down to
the point where recovery has finished ;-)

The bigger picture though is

(1) what set of failures should we retry on?  I think CANNOT_CONNECT_NOW
is reasonable, but are there others?

(2) what does this do to the quality of the error messages in cases
where all the connection attempts fail?

I think that error message quality was not thought too much about
in the original development of the multi-host feature, so to some
extent I'm asking you to clean up someone else's mess.  Nonetheless,
I feel that we do need to clean it up before we do things that risk
making it even more confusing.

The problems that I see in this area are first that there's no
real standardization in libpq as to whether to append error messages
together or just flush preceding messages; and second that no effort
is made in multi-connection-attempt cases to separate the errors from
different attempts, much less identify which error goes with which
host or IP address.  I think we really need to put some work into
that.  In some cases you can infer what happened from breadcrumbs
we already put into the text, for example

$ psql -h localhost,/tmp -p 12345
psql: error: could not connect to server: Connection refused
Is the server running on host "localhost" (::1) and accepting
TCP/IP connections on port 12345?
could not connect to server: Connection refused
Is the server running on host "localhost" (127.0.0.1) and accepting
TCP/IP connections on port 12345?
could not connect to server: No such file or directory
Is the server running locally and accepting
connections on Unix domain socket "/tmp/.s.PGSQL.12345"?

but this doesn't seem particularly helpfully laid out to me, and we don't
provide the breadcrumbs at all for a lot of other error cases.

I'm vaguely imagining that we could do something more like

could not connect to host "localhost" (::1), port 12345: Connection refused
could not connect to host "localhost" (127.0.0.1), port 12345: Connection 
refused
could not connect to socket "/tmp/.s.PGSQL.12345": No such file or directory

Not quite sure if the "Is the server running" hint is worth preserving.
We'd have to reword it quite a bit, and it'd be very duplicative.

The implementation of this might involve sticking the initial string
(up to the colon, in this example) into conn->errorMessage speculatively
as we try each host.  If we then append an error to it and go around
again, we're good.  If we successfully connect, then the contents of
conn->errorMessage don't matter.

regards, tom lane




RE: Multiple hosts in connection string failed to failover in non-hot standby mode

2020-10-28 Thread tsunakawa.ta...@fujitsu.com
From: Hubert Zhang  
> Hao Wu and I wrote a patch to fix this problem. Client side libpq should try 
> another hosts in connection string when it is rejected by a non-hot standby, 
> or the first host encounter some n/w problems during the libpq handshake.

Thank you.  Please add it to the November Commitfest.


> Thanks. Is this email in text format now? I just use outlook in chrome. Let 
> me know if it still in html format.

I'm afraid not.  The Outlook's title bar says that it's in HTML format.  I'm 
using Outlook 2016 client app on Windows 10.  I may have failed to convert my 
previous email to text, but it should be text format this time.


Regards
Takayuki Tsunakawa





Re: Multiple hosts in connection string failed to failover in non-hot standby mode

2020-10-28 Thread Hubert Zhang
Was the primary running and accepting connections when you encountered this 
error?  That is, if you specified host="host1 host2", host1 was the non-hot 
standby and host2 was a running primary?  Or only the non-hot standby was 
running?

If a primary was running, I'd say it's a bug...  Perhaps the following part in 
libpq gives up connection attempts wen the above FATAL error is returned from 
the server.  Maybe libpq should differentiate errors using SQLSTATE and 
continue connection attempts on other hosts.
Yes, the primary was running, but non-hot standby is in front of the primary in 
connection string.
Hao Wu and I wrote a patch to fix this problem. Client side libpq should try 
another hosts in connection string when it is rejected by a non-hot standby, or 
the first host encounter some n/w problems during the libpq handshake.

Please send emails in text format.  Your email was in HTML, and I changed this 
reply to text format.
Thanks. Is this email in text format now? I just use outlook in chrome. Let me 
know if it still in html format.

Hubert & Hao Wu


From: tsunakawa.ta...@fujitsu.com 
Sent: Tuesday, October 27, 2020 5:30 PM
To: Hubert Zhang 
Cc: pgsql-hack...@postgresql.org 
Subject: RE: Multiple hosts in connection string failed to failover in non-hot 
standby mode

Please send emails in text format.  Your email was in HTML, and I changed this 
reply to text format.


From: Hubert Zhang 
> Libpq has supported to specify multiple hosts in connection string and enable 
> auto failover when the previous PostgreSQL instance cannot be accessed.
> But when I tried to enable this feature for a non-hot standby, it cannot do 
> the failover with the following messages.
>
> psql: error: could not connect to server: FATAL:  the database system is 
> starting up

Was the primary running and accepting connections when you encountered this 
error?  That is, if you specified host="host1 host2", host1 was the non-hot 
standby and host2 was a running primary?  Or only the non-hot standby was 
running?

If a primary was running, I'd say it's a bug...  Perhaps the following part in 
libpq gives up connection attempts wen the above FATAL error is returned from 
the server.  Maybe libpq should differentiate errors using SQLSTATE and 
continue connection attempts on other hosts.

[fe-connect.c]
/* Handle errors. */
if (beresp == 'E')
{
if (PG_PROTOCOL_MAJOR(conn->pversion) >= 3)
...
#endif

goto error_return;
}

/* It is an authentication request. */
conn->auth_req_received = true;

/* Get the type of request. */


Regards
Takayuki Tsunakawa



0001-Enhance-libpq-to-support-multiple-host-for-non-hot-s.patch
Description:  0001-Enhance-libpq-to-support-multiple-host-for-non-hot-s.patch


Re: Multiple hosts in connection string failed to failover in non-hot standby mode

2020-10-28 Thread Noah Misch
On Tue, Oct 27, 2020 at 07:14:14AM +, Hubert Zhang wrote:
> Libpq has supported to specify multiple hosts in connection string and enable 
> auto failover when the previous PostgreSQL instance cannot be accessed.
> But when I tried to enable this feature for a non-hot standby, it cannot do 
> the failover with the following messages.
> 
> psql: error: could not connect to server: FATAL:  the database system is 
> starting up
> 
> Document says ' If a connection is established successfully, but 
> authentication fails, the remaining hosts in the list are not tried.'
> I'm wondering is it a feature by design or a bug? If it's a bug, I plan to 
> fix it.

I felt it was a bug, but the community as a whole may or may not agree:
https://postgr.es/m/flat/16508-1a63222835164566%40postgresql.org




RE: Multiple hosts in connection string failed to failover in non-hot standby mode

2020-10-27 Thread tsunakawa.ta...@fujitsu.com
Please send emails in text format.  Your email was in HTML, and I changed this 
reply to text format.


From: Hubert Zhang  
> Libpq has supported to specify multiple hosts in connection string and enable 
> auto failover when the previous PostgreSQL instance cannot be accessed.
> But when I tried to enable this feature for a non-hot standby, it cannot do 
> the failover with the following messages.
> 
> psql: error: could not connect to server: FATAL:  the database system is 
> starting up

Was the primary running and accepting connections when you encountered this 
error?  That is, if you specified host="host1 host2", host1 was the non-hot 
standby and host2 was a running primary?  Or only the non-hot standby was 
running?

If a primary was running, I'd say it's a bug...  Perhaps the following part in 
libpq gives up connection attempts wen the above FATAL error is returned from 
the server.  Maybe libpq should differentiate errors using SQLSTATE and 
continue connection attempts on other hosts.

[fe-connect.c]
/* Handle errors. */
if (beresp == 'E')
{
if (PG_PROTOCOL_MAJOR(conn->pversion) >= 3)
...
#endif

goto error_return;
}

/* It is an authentication request. */
conn->auth_req_received = true;

/* Get the type of request. */


Regards
Takayuki Tsunakawa





Multiple hosts in connection string failed to failover in non-hot standby mode

2020-10-27 Thread Hubert Zhang
Hi hackers,

Libpq has supported to specify multiple hosts in connection string and enable 
auto failover when the previous PostgreSQL instance cannot be accessed.
But when I tried to enable this feature for a non-hot standby, it cannot do the 
failover with the following messages.

psql: error: could not connect to server: FATAL:  the database system is 
starting up

Document says ' If a connection is established successfully, but authentication 
fails, the remaining hosts in the list are not tried.'
I'm wondering is it a feature by design or a bug? If it's a bug, I plan to fix 
it.

Thanks,
Hubert Zhang