Re: [HACKERS] OOM in libpq and infinite loop with getCopyStart()

2016-04-19 Thread Michael Paquier
On Mon, Apr 18, 2016 at 4:48 PM, Michael Paquier
 wrote:
> Another, perhaps, better idea is to add some more extra logic to
> pg_conn as follows:
> boolis_emergency;
> PGresult *emergency_result;
> boolis_emergency_consumed;
> So as when an OOM shows up, is_emergency is set to true. Then a first
> call of PQgetResult returns the PGresult with the OOM in
> emergency_result, setting is_emergency_consumed to true and switching
> is_emergency back to false. Then a second call to PQgetResult enforces
> the consumption of any results remaining without waiting for them, at
> the end returning NULL to the caller, resetting is_emergency_consumed
> to false. That's different compared to the extra failure
> PGASYNC_COPY_FAILED in the fact that it does not break PQisBusy().

So, in terms of code, it gives more or less the attached. I have
coupled the emergency_result with an enum flag emergency_status that
has 3 states:
- NONE, which is the default
- ENABLE, which is when an OOM or other error has been found in libpq.
Making the next call of PQgetResult return the emergency_result
- CONSUMED, set after PQgetResult is consuming emergency_result, to
return to caller NULL so as it can get out of any PQgetResult loop
expecting a result at the end of. Once NULL is returned, the flag is
switched back to NONE.
This works in the case of getCopyStart because the OOM failures are
happening before any messages are sent to server.

The checks for the emergency result are done in PQgetResult, so as any
upper-level routine gets the message. Tom, others, what do you think?
-- 
Michael
diff --git a/src/interfaces/libpq/fe-exec.c b/src/interfaces/libpq/fe-exec.c
index 2621767..c5d7efd 100644
--- a/src/interfaces/libpq/fe-exec.c
+++ b/src/interfaces/libpq/fe-exec.c
@@ -1689,6 +1689,30 @@ PQgetResult(PGconn *conn)
 	if (!conn)
 		return NULL;
 
+	if (!conn->emergency_result)
+	{
+		conn->emergency_result = PQmakeEmptyPGresult(conn, PGRES_FATAL_ERROR);
+		if (!conn->emergency_result)
+			return NULL;
+		conn->emergency_status = PGEMERGENCY_NONE;
+	}
+
+	/*
+	 * Check for any error that could have happened with the client.
+	 * Do this check before parsing any new commands.
+	 */
+	switch (conn->emergency_status)
+	{
+		case PGEMERGENCY_NONE:
+			break;
+		case PGEMERGENCY_ENABLE:
+			conn->emergency_status = PGEMERGENCY_CONSUMED;
+			return conn->emergency_result;
+		case PGEMERGENCY_CONSUMED:
+			conn->emergency_status = PGEMERGENCY_NONE;
+			return NULL;
+	}
+
 	/* Parse any available data, if our state permits. */
 	parseInput(conn);
 
@@ -1698,6 +1722,22 @@ PQgetResult(PGconn *conn)
 		int			flushResult;
 
 		/*
+		 * Check for any error that could have happened with
+		 * the client.
+		 */
+		switch (conn->emergency_status)
+		{
+			case PGEMERGENCY_NONE:
+break;
+			case PGEMERGENCY_ENABLE:
+conn->emergency_status = PGEMERGENCY_CONSUMED;
+return conn->emergency_result;
+			case PGEMERGENCY_CONSUMED:
+conn->emergency_status = PGEMERGENCY_NONE;
+return NULL;
+		}
+
+		/*
 		 * If data remains unsent, send it.  Else we might be waiting for the
 		 * result of a command the backend hasn't even got yet.
 		 */
diff --git a/src/interfaces/libpq/fe-protocol3.c b/src/interfaces/libpq/fe-protocol3.c
index 0b8c62f..6d99d12 100644
--- a/src/interfaces/libpq/fe-protocol3.c
+++ b/src/interfaces/libpq/fe-protocol3.c
@@ -1453,7 +1453,7 @@ getCopyStart(PGconn *conn, ExecStatusType copytype)
 
 	result = PQmakeEmptyPGresult(conn, copytype);
 	if (!result)
-		goto failure;
+		goto oom_failure;
 
 	if (pqGetc(&conn->copy_is_binary, conn))
 		goto failure;
@@ -1469,7 +1469,7 @@ getCopyStart(PGconn *conn, ExecStatusType copytype)
 		result->attDescs = (PGresAttDesc *)
 			pqResultAlloc(result, nfields * sizeof(PGresAttDesc), TRUE);
 		if (!result->attDescs)
-			goto failure;
+			goto oom_failure;
 		MemSet(result->attDescs, 0, nfields * sizeof(PGresAttDesc));
 	}
 
@@ -1492,6 +1492,10 @@ getCopyStart(PGconn *conn, ExecStatusType copytype)
 	conn->result = result;
 	return 0;
 
+oom_failure:
+	conn->emergency_status = PGEMERGENCY_ENABLE;
+	printfPQExpBuffer(&conn->errorMessage,
+	  libpq_gettext("out of memory\n"));
 failure:
 	PQclear(result);
 	return EOF;
diff --git a/src/interfaces/libpq/libpq-int.h b/src/interfaces/libpq/libpq-int.h
index 1183323..e9dd167 100644
--- a/src/interfaces/libpq/libpq-int.h
+++ b/src/interfaces/libpq/libpq-int.h
@@ -223,6 +223,19 @@ typedef enum
 	PGASYNC_COPY_BOTH			/* Copy In/Out data transfer in progress */
 } PGAsyncStatusType;
 
+/*
+ * PGemergencyState defines the consumption status of emergency_result
+ * for clients-side error handling, particularly out-of-memory errors
+ * that could happen.
+ */
+typedef enum
+{
+	PGEMERGENCY_NONE,			/* emergency result allocated, not needed */
+	PGEMERGENCY_ENABLE,			/* ready to be consumed by client */
+	PGEMERGENCY_CONSUMED		/* consumed by client, next call querying
+ * for a result will get NULL. */
+} PGemergencyState;
+
 /* PGQueryC

Re: [HACKERS] OOM in libpq and infinite loop with getCopyStart()

2016-04-18 Thread Michael Paquier
On Wed, Apr 6, 2016 at 3:09 PM, Michael Paquier
 wrote:
> On Sat, Apr 2, 2016 at 12:30 AM, Tom Lane  wrote:
>> I wrote:
>>> So the core of my complaint is that we need to fix things so that, whether
>>> or not we are able to create the PGRES_FATAL_ERROR PGresult (and we'd
>>> better consider the behavior when we cannot), ...
>>
>> BTW, the real Achilles' heel of any attempt to ensure sane behavior at
>> the OOM limit is this possibility of being unable to create a PGresult
>> with which to inform the client that we failed.
>>
>> I wonder if we could make things better by keeping around an emergency
>> backup PGresult struct.  Something along these lines:
>>
>> 1. Add a field "PGresult *emergency_result" to PGconn.
>>
>> 2. At the very beginning of any PGresult-returning libpq function, check
>> to see if we have an emergency_result, and if not make one, ensuring
>> there's room in it for a reasonable-size error message; or maybe even
>> preload it with "out of memory" if we assume that's the only condition
>> it'll ever be used for.  If malloc fails at this point, just return NULL
>> without doing anything or changing any libpq state.  (Since a NULL result
>> is documented as possibly caused by OOM, this isn't violating any API.)
>>
>> 3. Subsequent operations never touch the emergency_result unless we're
>> up against an OOM, but it can be used to return a failure indication
>> to the client so long as we leave libpq in a state where additional
>> calls to PQgetResult would return NULL.
>>
>> Basically this shifts the point where an unreportable OOM could happen
>> from somewhere in the depths of libpq to the very start of an operation,
>> where we're presumably in a clean state and OOM failure doesn't leave
>> us with a mess we can't clean up.

Sorry for the late reply here. I am looking back at this thing more seriously.

So, if I understand that correctly. As the emergency result is
pre-allocated, this leverages any future OOM errors because we could
always fallback to this error in case of problems, so this would
indeed help. And as this is independent of the rest of the status of
pg_conn, any subsequent calls of any PQ* routines returning a PGresult
would remain in the same state.

On top of this emergency code path, don't you think that we need as
well a flag that is switched to 'on' once an OOM error is faced? In
consequence, on a code path where an OOM happens, this flag is
activated. Then any subsequent calls of routines returning a PGresult
checks for this flag, resets it, and returns the emergency result. At
the end, it seems to me that the code paths where we check if the
emergency flag is activated or not are the beginning of PQgetResult
and PQexecFinish. In the case of PQexecFinish(), pending results would
be cleared the next time PQexecStart is called. For PQgetResult, this
gives the possibility to the application to report the OOM properly.
However, successive calls to PQgetResult() would still make the
application wait undefinitely for more data if it doesn't catch the
error.

Another thing that I think needs to be patched is libpqrcv_PQexec by
the way, so as we check if any errors are caught on the way when
calling multiple times PQgetResult or this code path could remain
stuck with an infinite wait. As a result, it seems to me that this
offers no real way to fix completely applications doing something like
that:
PQsendQuery(conn);
for (;;)
{
while (PQisBusy(conn))
{
//wait here for some event
}
result = PQgetResult(conn);
if (!result)
break;
}
The issue is that we'd wait for a NULL result to be received from
PQgetResult, however even with this patch asyncStatus remaining set to
PGASYNC_BUSY would make libpq waiting forever with pqWait for data
that will never show up. We could have a new status for asyncStatus to
be able to skip properly the loops where asyncStatus == PGASYNC_BUSY
and make PQisBusy smarter but this would actually break the semantics
of calling successively PQgetResult() if I am following correctly the
last posts of this thread.

Another, perhaps, better idea is to add some more extra logic to
pg_conn as follows:
boolis_emergency;
PGresult *emergency_result;
boolis_emergency_consumed;
So as when an OOM shows up, is_emergency is set to true. Then a first
call of PQgetResult returns the PGresult with the OOM in
emergency_result, setting is_emergency_consumed to true and switching
is_emergency back to false. Then a second call to PQgetResult enforces
the consumption of any results remaining without waiting for them, at
the end returning NULL to the caller, resetting is_emergency_consumed
to false. That's different compared to the extra failure
PGASYNC_COPY_FAILED in the fact that it does not break PQisBusy().

Thoughts?
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] OOM in libpq and infinite loop with getCopyStart()

2016-04-05 Thread Michael Paquier
On Sat, Apr 2, 2016 at 12:30 AM, Tom Lane  wrote:
> I wrote:
>> So the core of my complaint is that we need to fix things so that, whether
>> or not we are able to create the PGRES_FATAL_ERROR PGresult (and we'd
>> better consider the behavior when we cannot), ...
>
> BTW, the real Achilles' heel of any attempt to ensure sane behavior at
> the OOM limit is this possibility of being unable to create a PGresult
> with which to inform the client that we failed.
>
> I wonder if we could make things better by keeping around an emergency
> backup PGresult struct.  Something along these lines:
>
> 1. Add a field "PGresult *emergency_result" to PGconn.
>
> 2. At the very beginning of any PGresult-returning libpq function, check
> to see if we have an emergency_result, and if not make one, ensuring
> there's room in it for a reasonable-size error message; or maybe even
> preload it with "out of memory" if we assume that's the only condition
> it'll ever be used for.  If malloc fails at this point, just return NULL
> without doing anything or changing any libpq state.  (Since a NULL result
> is documented as possibly caused by OOM, this isn't violating any API.)
>
> 3. Subsequent operations never touch the emergency_result unless we're
> up against an OOM, but it can be used to return a failure indication
> to the client so long as we leave libpq in a state where additional
> calls to PQgetResult would return NULL.
>
> Basically this shifts the point where an unreportable OOM could happen
> from somewhere in the depths of libpq to the very start of an operation,
> where we're presumably in a clean state and OOM failure doesn't leave
> us with a mess we can't clean up.

I have moved this patch to next CF for the time being. As that's a
legit bug and not a feature, that should be fine to pursue work on
this item even if this CF ends.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] OOM in libpq and infinite loop with getCopyStart()

2016-04-01 Thread Tom Lane
I wrote:
> So the core of my complaint is that we need to fix things so that, whether
> or not we are able to create the PGRES_FATAL_ERROR PGresult (and we'd
> better consider the behavior when we cannot), ...

BTW, the real Achilles' heel of any attempt to ensure sane behavior at
the OOM limit is this possibility of being unable to create a PGresult
with which to inform the client that we failed.

I wonder if we could make things better by keeping around an emergency
backup PGresult struct.  Something along these lines:

1. Add a field "PGresult *emergency_result" to PGconn.

2. At the very beginning of any PGresult-returning libpq function, check
to see if we have an emergency_result, and if not make one, ensuring
there's room in it for a reasonable-size error message; or maybe even
preload it with "out of memory" if we assume that's the only condition
it'll ever be used for.  If malloc fails at this point, just return NULL
without doing anything or changing any libpq state.  (Since a NULL result
is documented as possibly caused by OOM, this isn't violating any API.)

3. Subsequent operations never touch the emergency_result unless we're
up against an OOM, but it can be used to return a failure indication
to the client so long as we leave libpq in a state where additional
calls to PQgetResult would return NULL.

Basically this shifts the point where an unreportable OOM could happen
from somewhere in the depths of libpq to the very start of an operation,
where we're presumably in a clean state and OOM failure doesn't leave
us with a mess we can't clean up.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] OOM in libpq and infinite loop with getCopyStart()

2016-04-01 Thread Tom Lane
Amit Kapila  writes:
> Very valid point.  So, if we see with patch, I think libpq will be
> in PGASYNC_COPY_XXX state after such a failure and next time when it tries
> to again execute statement, it will end copy mode and then allow to proceed
> from there.   This design is required for other purposes like silently
> discarding left over result.  Now, I think the other option(if possible)
> seems to be that we can put libpq in PGASYNC_IDLE, if we get such an error
> as we do at some other places in case of error as below and return OOM
> failure as we do in patch.

If we transition to PGASYNC_IDLE then the next PQexecStart will not
realize that it needs to do something to get out of the COPY state;
but it does, since the backend thinks that we're doing COPY.  That was
the reasoning behind my proposal to invent new PGASYNC states.
Obviously there's more than one way to represent this new state, eg
you could have a separate error flag instead --- but it is a new state.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] OOM in libpq and infinite loop with getCopyStart()

2016-04-01 Thread Tom Lane
Michael Paquier  writes:
> On Fri, Apr 1, 2016 at 12:48 PM, Tom Lane  wrote:
>> Anyway, the short of my review is that we need more clarity of thought
>> about what state libpq is in after a failure like this, and what that
>> state looks like to the application, and how that should affect how
>> libpq reacts to application calls.

> Hm. I would have thought that returning to the caller a result
> generated by PQmakeEmptyPGresult(PGRES_FATAL_ERROR) with an error
> string marked "out of memory" would be the best thing to do instead of
> NULL...

Sure, we're trying to do that, but the question is what happens afterward.
An app that is following the documented usage pattern for PQgetResult
will call it again, expecting to get a NULL, but as the patch stands it
might get a PGRES_COPY_XXX instead.  What drove me to decide the patch
wasn't committable was realizing that Robert was right upthread: the
change you propose in libpqwalreceiver.c is not a bug fix.  That code
is doing exactly what it's supposed to do per the libpq.sgml docs, namely
iterate till it gets a NULL.[1]  If we have to change it, that means
(a) we've changed the API spec for PQgetResult and (b) every other
existing caller of PQgetResult would need to change similarly, or else
risk corner-case bugs about as bad as the one we're trying to fix.

So the core of my complaint is that we need to fix things so that, whether
or not we are able to create the PGRES_FATAL_ERROR PGresult (and we'd
better consider the behavior when we cannot), we need to leave libpq
in a state where subsequent calls to PQgetResult will return NULL.

> If getCopyStart() complains because of a lack of data, we loop
> back into pqParseInput3 to try again. If an OOM happens, we still loop
> into pqParseInput3 but all the pending data received from client needs
> to be discarded so as we don't rely on the next calls of PQexecStart
> to do the cleanup, once all the data is received we get out of the
> loop and generate the result with PGRES_FATAL_ERROR. Does that match
> what you have in mind?

If we could, that would be OK, but (a) you're only considering the
COPY OUT case not the COPY IN case, and (b) a persistent OOM condition
could very well prevent us from ever completing the copy.  For example,
some COPY DATA messages might be too big for our current buffer, but
we won't be able to enlarge the buffer.  I'm inclined to think that
reporting the OOM to the client is the highest-priority goal, and
that attempting to clean up during the next PQexec/PQsendQuery is a
reasonable design.  If we fail to do that, the client won't really
understand that it's a cleanup failure and not a failure to issue the
new query, but it's not clear why it needs to know the difference.

regards, tom lane

[1] Well, almost.  Really, after dealing with a PGRES_COPY_XXX result
it ought to go back to pumping PQgetResult.  It doesn't, but that's
an expected client behavior, and cleaning up after that is exactly what
PQexecStart is intended for.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] OOM in libpq and infinite loop with getCopyStart()

2016-04-01 Thread Amit Kapila
On Fri, Apr 1, 2016 at 10:34 AM, Michael Paquier 
wrote:
> On Fri, Apr 1, 2016 at 12:48 PM, Tom Lane  wrote:
> > I thought about this patch a bit more...
> >
> > When I first looked at the patch, I didn't believe that it worked at
> > all: it failed to return a PGRES_COPY_XXX result to the application,
> > and yet nonetheless switched libpq's asyncStatus to PGASYNC_COPY_XXX?
> > Wouldn't things be hopelessly confused?  I tried it out and saw that
> > indeed it seemed to work in psql, and after tracing through that found
> > that psql has no idea what's going on, but when psql issues its next
> > command PQexecStart manages to get us out of the copy state (barring
> > more OOM failures, anyway).  That seems a bit accidental, though,
> > and for sure it wasn't documented in the patch.
>
> From the patch, that's mentioned:
> +* Stop if we are in copy mode and error has occurred, the pending
results
> +* will be discarded during next execution in PQexecStart.
>

Yeah, and same is mentioned in PQexecStart function as well which indicates
that above assumption is right.

>
> > Anyway, the short of my review is that we need more clarity of thought
> > about what state libpq is in after a failure like this, and what that
> > state looks like to the application, and how that should affect how
> > libpq reacts to application calls.
>

Very valid point.  So, if we see with patch, I think libpq will be
in PGASYNC_COPY_XXX state after such a failure and next time when it tries
to again execute statement, it will end copy mode and then allow to proceed
from there.   This design is required for other purposes like silently
discarding left over result.  Now, I think the other option(if possible)
seems to be that we can put libpq in PGASYNC_IDLE, if we get such an error
as we do at some other places in case of error as below and return OOM
failure as we do in patch.

PQgetResult()

{

..

if (flushResult ||

pqWait(TRUE, FALSE, conn) ||

pqReadData(conn) < 0)

{

/*

* conn->errorMessage has been set by pqWait or pqReadData. We

* want to append it to any already-received error message.

*/

pqSaveErrorResult(conn);

conn->asyncStatus = PGASYNC_IDLE;

return pqPrepareAsyncResult(conn);

}
..
}

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] OOM in libpq and infinite loop with getCopyStart()

2016-03-31 Thread Michael Paquier
On Fri, Apr 1, 2016 at 12:48 PM, Tom Lane  wrote:
> I thought about this patch a bit more...
>
> When I first looked at the patch, I didn't believe that it worked at
> all: it failed to return a PGRES_COPY_XXX result to the application,
> and yet nonetheless switched libpq's asyncStatus to PGASYNC_COPY_XXX?
> Wouldn't things be hopelessly confused?  I tried it out and saw that
> indeed it seemed to work in psql, and after tracing through that found
> that psql has no idea what's going on, but when psql issues its next
> command PQexecStart manages to get us out of the copy state (barring
> more OOM failures, anyway).  That seems a bit accidental, though,
> and for sure it wasn't documented in the patch.

>From the patch, that's mentioned:
+* Stop if we are in copy mode and error has occurred, the pending results
+* will be discarded during next execution in PQexecStart.

> In any case, having
> libpq in that state would have side-effects on how it responds to
> application actions, which is the core of my complaint about
> PQgetResult behavior.  Those side effects only make sense if the app
> knows (or should have known) that the connection is in COPY state.

OK. So we'd want to avoid relying on subsequent PQ* calls and return
into a clean state once the OOM shows up.

> I think it might be possible to get saner behavior if we invent a
> set of new asyncStatus values PGASYNC_COPY_XXX_FAILED, having the
> semantics that we got the relevant CopyStart message from the backend
> but were unable to report it to the application.  PQexecStart would
> treat these the same as the corresponding normal PGASYNC_COPY_XXX
> states and do what's needful to get out of the copy mode.  But in
> PQgetResult, we would *not* treat those states as a reason to return a
> PGRES_COPY_XXX result to the application.  Probably we'd just return
> NULL, with the implication being "we're done so far as you're concerned,
> please give a new command".  I might be underthinking it though.

I raised something like that a couple of months back, based on the
fact that PGASYNC_* are flags internal to libpq on frontend:
http://www.postgresql.org/message-id/cab7npqtaedwprcqak-czzxwwapdeyr1gug0vwvg7rhzhmaj...@mail.gmail.com
In this case what was debated is PGASYNC_FATAL... Something that is
quite different with your proposal is that we return NULL or something
else, and it did not discard any pending data from the client. For
applications running PQgetResult in a loop that would work.

> Anyway, the short of my review is that we need more clarity of thought
> about what state libpq is in after a failure like this, and what that
> state looks like to the application, and how that should affect how
> libpq reacts to application calls.

Hm. I would have thought that returning to the caller a result
generated by PQmakeEmptyPGresult(PGRES_FATAL_ERROR) with an error
string marked "out of memory" would be the best thing to do instead of
NULL... If getCopyStart() complains because of a lack of data, we loop
back into pqParseInput3 to try again. If an OOM happens, we still loop
into pqParseInput3 but all the pending data received from client needs
to be discarded so as we don't rely on the next calls of PQexecStart
to do the cleanup, once all the data is received we get out of the
loop and generate the result with PGRES_FATAL_ERROR. Does that match
what you have in mind?
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] OOM in libpq and infinite loop with getCopyStart()

2016-03-31 Thread Tom Lane
I thought about this patch a bit more...

When I first looked at the patch, I didn't believe that it worked at
all: it failed to return a PGRES_COPY_XXX result to the application,
and yet nonetheless switched libpq's asyncStatus to PGASYNC_COPY_XXX?
Wouldn't things be hopelessly confused?  I tried it out and saw that
indeed it seemed to work in psql, and after tracing through that found
that psql has no idea what's going on, but when psql issues its next
command PQexecStart manages to get us out of the copy state (barring
more OOM failures, anyway).  That seems a bit accidental, though,
and for sure it wasn't documented in the patch.  In any case, having
libpq in that state would have side-effects on how it responds to
application actions, which is the core of my complaint about
PQgetResult behavior.  Those side effects only make sense if the app
knows (or should have known) that the connection is in COPY state.

I think it might be possible to get saner behavior if we invent a
set of new asyncStatus values PGASYNC_COPY_XXX_FAILED, having the
semantics that we got the relevant CopyStart message from the backend
but were unable to report it to the application.  PQexecStart would
treat these the same as the corresponding normal PGASYNC_COPY_XXX
states and do what's needful to get out of the copy mode.  But in
PQgetResult, we would *not* treat those states as a reason to return a
PGRES_COPY_XXX result to the application.  Probably we'd just return
NULL, with the implication being "we're done so far as you're concerned,
please give a new command".  I might be underthinking it though.

Anyway, the short of my review is that we need more clarity of thought
about what state libpq is in after a failure like this, and what that
state looks like to the application, and how that should affect how
libpq reacts to application calls.

I'll set this back to Waiting on Author ...

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] OOM in libpq and infinite loop with getCopyStart()

2016-03-31 Thread Tom Lane
Aleksander Alekseev  writes:
> +1, same here. Lets see whats committer's opinion.

I fooled around with this patch quite a bit but could not bring myself
to pull the trigger, because I think it fundamentally breaks applications
that follow the "repeat PQgetResult until NULL" rule.  The only reason
that psql manages not to fail is that it doesn't do that, it just calls
PQexec; and you've hacked PQexecFinish so that it falls out without
pumping PQgetResult till dry.  But that's not a good solution, it's just
a hack that makes the behavior unprincipled and unlike direct use of
PQgetResult.  The key problem is that, assuming that the changes in
getCopyStart() succeed in returning a PGRES_FATAL_ERROR PGresult, the
application may just follow the rule of doing nothing with it unless it's
the last non-null result from PQgetResult.  And it won't be, because
you've switched libpq's asyncStatus into one or another COPY status, which
will cause PQgetResult to continually create and return PGRES_COPY_XXX
results, which is what it's supposed to do in that situation (cf the last
step in getCopyResult).

Now, this will accidentally fail to fail if PQgetResult's attempt to
manufacture a PGRES_COPY_XXX result fails and returns null, which is
certainly possible if we're up against an OOM situation.  But what if
it doesn't fail --- which is also possible, because a PGRES_COPY_XXX
with no attached data will not need as much space as a PGRES_FATAL_ERROR
with attached error message.  The app probably throws away the
PGRES_FATAL_ERROR and tries to use the PGRES_COPY_XXX result, which
is almost okay, except that it will lack the copy format information
which will be fatal if the application is relying on that.

So AFAICT, this patch kinda sorta works for psql but it is not going
to make things better for other apps.

The other problem is that I don't have a lot of faith in the theory
that getCopyStart is going to be able to make an error PGresult when
it couldn't make a COPY PGresult.  The existing message-receipt routines
that this is modeled on are dealing with PGresults that are expected to
grow large, so throwing away the accumulated PGresult is highly likely
to free enough memory to let us allocate an error PGresult.  Not so
here.

I have to run, but the bottom line is I don't feel comfortable with
this.  It's trying to fix what's a very corner-case problem (since
COPY PGresults aren't large) but it's introducing new corner case
behaviors of its own.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] OOM in libpq and infinite loop with getCopyStart()

2016-03-30 Thread Aleksander Alekseev
> I think this patch needs to be looked upon by committer now.  I have
> done review and added some code in this patch as well long back, just
> see the e-mail [1], patch is just same as it was in October 2015.  I
> think myself and Michael are in agreement that this patch solves the
> reported problem. There is one similar problem [2] reported by Heikki
> which I think can be fixed separately.
> [...]
> Let me know if you think anything more from myside can help in moving
> patch.

+1, same here. Lets see whats committer's opinion.


-- 
Best regards,
Aleksander Alekseev
http://eax.me/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] OOM in libpq and infinite loop with getCopyStart()

2016-03-29 Thread Amit Kapila
On Tue, Mar 29, 2016 at 8:31 PM, David Steele  wrote:

> Hi Amit,
>
> On 3/22/16 3:37 AM, Michael Paquier wrote:
>
> Hope this brings some light in.
>>
>
> Do you know when you'll have time to respond to Michael's last email? I've
> marked this patch "waiting on author" in the meantime.
>
>
I think this patch needs to be looked upon by committer now.  I have done
review and added some code in this patch as well long back, just see the
e-mail [1], patch is just same as it was in October 2015.  I think myself
and Michael are in agreement that this patch solves the reported problem.
There is one similar problem [2] reported by Heikki which I think can be
fixed separately.

I think the main reason of moving this thread to hackers from bugs is to
gain some more traction which I see that it achieves its purpose to some
extent, but I find that more or less we are at same situation as we were
back in October.

Let me know if you think anything more from myside can help in moving patch.


[1] -
http://www.postgresql.org/message-id/cab7npqtnrw8lr1hd7zlnojjc1bp1evw_emadgorv+s-sbcr...@mail.gmail.com
[2] - http://www.postgresql.org/message-id/566ef84f.1030...@iki.fi


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] OOM in libpq and infinite loop with getCopyStart()

2016-03-29 Thread David Steele
On 3/29/16 8:21 PM, Michael Paquier wrote:
> On Wed, Mar 30, 2016 at 12:01 AM, David Steele  wrote:
>> Hi Amit,
>>
>> On 3/22/16 3:37 AM, Michael Paquier wrote:
>>
>>> Hope this brings some light in.
>>
>>
>> Do you know when you'll have time to respond to Michael's last email? I've
>> marked this patch "waiting on author" in the meantime.
> 
> Shouldn't it be "needs review" instead? I am marked as the author of this 
> patch.

Whoops!  I got the author and reviewer backwards.  Set back to "needs
review".

-- 
-David
da...@pgmasters.net


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] OOM in libpq and infinite loop with getCopyStart()

2016-03-29 Thread Michael Paquier
On Wed, Mar 30, 2016 at 12:01 AM, David Steele  wrote:
> Hi Amit,
>
> On 3/22/16 3:37 AM, Michael Paquier wrote:
>
>> Hope this brings some light in.
>
>
> Do you know when you'll have time to respond to Michael's last email? I've
> marked this patch "waiting on author" in the meantime.

Shouldn't it be "needs review" instead? I am marked as the author of this patch.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] OOM in libpq and infinite loop with getCopyStart()

2016-03-29 Thread David Steele

Hi Amit,

On 3/22/16 3:37 AM, Michael Paquier wrote:


Hope this brings some light in.


Do you know when you'll have time to respond to Michael's last email? 
I've marked this patch "waiting on author" in the meantime.


Thanks,
--
-David
da...@pgmasters.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers



Re: [HACKERS] OOM in libpq and infinite loop with getCopyStart()

2016-03-22 Thread Michael Paquier
On Tue, Mar 22, 2016 at 2:19 PM, Amit Kapila  wrote:
> On Tue, Mar 22, 2016 at 9:46 AM, Tom Lane  wrote:
>>
>> Amit Kapila  writes:
>> > On Mon, Mar 21, 2016 at 10:13 PM, Robert Haas 
>> > wrote:
>> >> It is very difficult to believe that this is a good idea:
>> >>
>> >> --- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
>> >> +++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
>> >> @@ -445,6 +445,7 @@ libpqrcv_PQexec(const char *query)
>> >>  if (PQresultStatus(lastResult) == PGRES_COPY_IN ||
>> >>  PQresultStatus(lastResult) == PGRES_COPY_OUT ||
>> >>  PQresultStatus(lastResult) == PGRES_COPY_BOTH ||
>> >> +PQresultStatus(lastResult) == PGRES_FATAL_ERROR ||
>> >>  PQstatus(streamConn) == CONNECTION_BAD)
>> >> break;
>> >>
>> >> I mean, why would it be a good idea to blindly skip over fatal errors?
>>
>> > I think it is not about skipping the FATAL error, rather to stop trying
>> > to
>> > get further results on FATAL error.
>>
>> If the code already includes "lost the connection" as a case to break on,
>> I'm not quite sure why "got a query error" is not.
>>
>
> This error check is exactly same as PQexecFinish() and there some
> explanation is given in comments which hints towards the reason for
> continuing on error, basically both the functions PQexecFinish() and
> libpqrcv_PQexec() returns the last result if there are many and
> PQexecFinish() concatenates the errors as well in some cases.  Do we see any
> use in continuing to get result after getting PGRES_FATAL_ERROR error?

I spent quite a couple of hours looking at that, and I still fail to
see how it would be an advantage to stack errors. We reduce the error
message visibility for frontend clients, and this does not prevent
clients to actually see error messages generated by a backend, take a
WAL sender. And actually depending on the message type received we may
end up by simply ignoring those error messages and have libpq discard
them. So it seems like an oversight of 03a571a4.

One could always say that this change breaks the case where multiple
error messages are sent in a row from a client with COPY_* (IN, OUT
and BOTH), because we'd get back the last error message, while with
this change we let client know the first one, though that would mean
that client is missing something when using COPY_BOTH.

Also...
@@ -2023,6 +2030,11 @@ PQexecFinish(PGconn *conn)
result->resultStatus == PGRES_COPY_BOTH ||
conn->status == CONNECTION_BAD)
break;
+   else if ((conn->asyncStatus == PGASYNC_COPY_IN ||
+ conn->asyncStatus == PGASYNC_COPY_OUT  ||
+ conn->asyncStatus == PGASYNC_COPY_BOTH) &&
+result->resultStatus == PGRES_FATAL_ERROR)
+   break;
The reason behind this check in PQexecFinish is that we need to make
the difference between an error where there is not enough data, which
means that we want to retry again the message fetching through
parseInput3(), and the case where an actual OOM happened, where we
want to exit immediately and let the caller know that there has been a
frontend error. Now the other reason why we went on with
PGRES_FATAL_ERROR here is that it was discussed that it was an
overkill to assign a special status to PGASYNC to detect a
frontend-side failure, because we already treat other frontend-side
errors with PGRES_FATAL_ERROR and assign an error message to them (see
for example when an allocation fails for 'C', introduced in another
patch of the OOM failure series), and we still need to know that we
are in PGASYNC_COPY_* mode in this code path as well because it means
that the start message has been processed, but we had a failure in
deparsing it so we bypassed it, and need to fail immediately. So using
PGREC_FATAL_ERROR simplifies the code paths creating an error stack.

Hope this brings some light in.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] OOM in libpq and infinite loop with getCopyStart()

2016-03-21 Thread Amit Kapila
On Tue, Mar 22, 2016 at 9:46 AM, Tom Lane  wrote:
>
> Amit Kapila  writes:
> > On Mon, Mar 21, 2016 at 10:13 PM, Robert Haas 
wrote:
> >> It is very difficult to believe that this is a good idea:
> >>
> >> --- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
> >> +++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
> >> @@ -445,6 +445,7 @@ libpqrcv_PQexec(const char *query)
> >>  if (PQresultStatus(lastResult) == PGRES_COPY_IN ||
> >>  PQresultStatus(lastResult) == PGRES_COPY_OUT ||
> >>  PQresultStatus(lastResult) == PGRES_COPY_BOTH ||
> >> +PQresultStatus(lastResult) == PGRES_FATAL_ERROR ||
> >>  PQstatus(streamConn) == CONNECTION_BAD)
> >> break;
> >>
> >> I mean, why would it be a good idea to blindly skip over fatal errors?
>
> > I think it is not about skipping the FATAL error, rather to stop trying
to
> > get further results on FATAL error.
>
> If the code already includes "lost the connection" as a case to break on,
> I'm not quite sure why "got a query error" is not.
>

This error check is exactly same as PQexecFinish() and there some
explanation is given in comments which hints towards the reason for
continuing on error, basically both the functions PQexecFinish()
and libpqrcv_PQexec() returns the last result if there are many
and PQexecFinish() concatenates the errors as well in some cases.  Do we
see any use in continuing to get result after getting PGRES_FATAL_ERROR
error?


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] OOM in libpq and infinite loop with getCopyStart()

2016-03-21 Thread Tom Lane
Amit Kapila  writes:
> On Mon, Mar 21, 2016 at 10:13 PM, Robert Haas  wrote:
>> It is very difficult to believe that this is a good idea:
>> 
>> --- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
>> +++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
>> @@ -445,6 +445,7 @@ libpqrcv_PQexec(const char *query)
>>  if (PQresultStatus(lastResult) == PGRES_COPY_IN ||
>>  PQresultStatus(lastResult) == PGRES_COPY_OUT ||
>>  PQresultStatus(lastResult) == PGRES_COPY_BOTH ||
>> +PQresultStatus(lastResult) == PGRES_FATAL_ERROR ||
>>  PQstatus(streamConn) == CONNECTION_BAD)
>> break;
>> 
>> I mean, why would it be a good idea to blindly skip over fatal errors?

> I think it is not about skipping the FATAL error, rather to stop trying to
> get further results on FATAL error.

If the code already includes "lost the connection" as a case to break on,
I'm not quite sure why "got a query error" is not.  Maybe that PQstatus
check is broken too, but it doesn't seem like this patch makes it more so.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] OOM in libpq and infinite loop with getCopyStart()

2016-03-21 Thread Amit Kapila
On Mon, Mar 21, 2016 at 10:13 PM, Robert Haas  wrote:
>
> On Tue, Mar 1, 2016 at 12:38 AM, Michael Paquier
>  wrote:
> > Thoughts? I have registered that in the CF app, and a patch is attached.
>
> It is very difficult to believe that this is a good idea:
>
> --- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
> +++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
> @@ -445,6 +445,7 @@ libpqrcv_PQexec(const char *query)
>  if (PQresultStatus(lastResult) == PGRES_COPY_IN ||
>  PQresultStatus(lastResult) == PGRES_COPY_OUT ||
>  PQresultStatus(lastResult) == PGRES_COPY_BOTH ||
> +PQresultStatus(lastResult) == PGRES_FATAL_ERROR ||
>  PQstatus(streamConn) == CONNECTION_BAD)
>  break;
>  }
>
> I mean, why would it be a good idea to blindly skip over fatal errors?
>

I think it is not about skipping the FATAL error, rather to stop trying to
get further results on FATAL error.  This is to ensure that OOM error is
reported rather that ignored.  There has been discussion about this
previously as well [1].


[1] -
http://www.postgresql.org/message-id/CAB7nPqT6gKj6iS9VTPth_h6Sz5Jo-177s6QJN_jrW66wyCjJ=w...@mail.gmail.com

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] OOM in libpq and infinite loop with getCopyStart()

2016-03-21 Thread Robert Haas
On Tue, Mar 1, 2016 at 12:38 AM, Michael Paquier
 wrote:
> Thoughts? I have registered that in the CF app, and a patch is attached.

It is very difficult to believe that this is a good idea:

--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -445,6 +445,7 @@ libpqrcv_PQexec(const char *query)
 if (PQresultStatus(lastResult) == PGRES_COPY_IN ||
 PQresultStatus(lastResult) == PGRES_COPY_OUT ||
 PQresultStatus(lastResult) == PGRES_COPY_BOTH ||
+PQresultStatus(lastResult) == PGRES_FATAL_ERROR ||
 PQstatus(streamConn) == CONNECTION_BAD)
 break;
 }

I mean, why would it be a good idea to blindly skip over fatal errors?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] OOM in libpq and infinite loop with getCopyStart()

2016-03-09 Thread Michael Paquier
On Thu, Mar 10, 2016 at 12:12 AM, Alvaro Herrera
 wrote:
> Aleksander Alekseev wrote:
>> pg_receivexlog: could not send replication command "START_REPLICATION":
>> out of memory pg_receivexlog: disconnected; waiting 5 seconds to try
>> again pg_receivexlog: starting log streaming at 0/100 (timeline 1)
>>
>> Breakpoint 1, getCopyStart (conn=0x610180, copytype=PGRES_COPY_BOTH,
>> msgLength=3) at fe-protocol3.c:1398 1398  const char
>> *errmsg = NULL;
>> ```
>>
>> Granted this behaviour is a bit better then the current one. But
>> basically it's the same infinite loop only with pauses and warnings. I
>> wonder if this is a behaviour we really want. For instance wouldn't it
>> be better just to terminate an application in out-of-memory case? "Let
>> it crash" as Erlang programmers say.
>
> Hmm.  It would be useful to retry in the case that there is a chance
> that the program releases memory and can continue later.  But if it will
> only stay there doing nothing other than retrying, then that obviously
> will not happen.  One situation where this might help is if the overall
> *system* is short on memory and we expect that situation to resolve
> itself after a while -- after all, if the system is so loaded that it
> can't allocate a few more bytes for the COPY message, then odds are that
> other things are also crashing and eventually enough memory will be
> released that pg_receivexlog can continue.

Yep, that's my assumption regarding that, at some point the system may
succeed, and I don't think that we should break the current behaviors
of pg_receivexlog and pg_recvlogical regarding that in the
back-branches. Now, note that without the patch we actually have the
same problem. Say if OOMs happen continuously in getCopyStart, with
COPY_BOTH libpq would attempt to read the next message continuously
and would keep failing. Except that in this case the caller has no
idea what is happening as things keep running in libpq itself.

> On the other hand, if the system is so loaded, perhaps it's better to
> "let it crash" and have it restart later -- presumably once the admin
> notices the problem and restarts it manually after cleaning up the mess.
>
> If all programs are well behaved and nothing crashes when OOM but they
> all retry instead, then everything will continue to retry infinitely and
> make no progress.  That cannot be good.

That's something we could take care of in those client utilities I
think with a new option like --maximum-retries or similar, but anyway
I think that's a different discussion. The patch I am proposing here
allows a client application to be made aware of OOM errors that
happen. If we don't do something about that first, something like
--maximum-retries would be useless for COPY_BOTH as the client will
never be made aware of the OOM that happened in libpq and would keep
looping inside libpq itself until some memory is freed.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] OOM in libpq and infinite loop with getCopyStart()

2016-03-09 Thread Alvaro Herrera
Aleksander Alekseev wrote:

> pg_receivexlog: could not send replication command "START_REPLICATION":
> out of memory pg_receivexlog: disconnected; waiting 5 seconds to try
> again pg_receivexlog: starting log streaming at 0/100 (timeline 1)
> 
> Breakpoint 1, getCopyStart (conn=0x610180, copytype=PGRES_COPY_BOTH,
> msgLength=3) at fe-protocol3.c:1398 1398  const char
> *errmsg = NULL;
> ```
> 
> Granted this behaviour is a bit better then the current one. But
> basically it's the same infinite loop only with pauses and warnings. I
> wonder if this is a behaviour we really want. For instance wouldn't it
> be better just to terminate an application in out-of-memory case? "Let
> it crash" as Erlang programmers say.

Hmm.  It would be useful to retry in the case that there is a chance
that the program releases memory and can continue later.  But if it will
only stay there doing nothing other than retrying, then that obviously
will not happen.  One situation where this might help is if the overall
*system* is short on memory and we expect that situation to resolve
itself after a while -- after all, if the system is so loaded that it
can't allocate a few more bytes for the COPY message, then odds are that
other things are also crashing and eventually enough memory will be
released that pg_receivexlog can continue.

On the other hand, if the system is so loaded, perhaps it's better to
"let it crash" and have it restart later -- presumably once the admin
notices the problem and restarts it manually after cleaning up the mess.

If all programs are well behaved and nothing crashes when OOM but they
all retry instead, then everything will continue to retry infinitely and
make no progress.  That cannot be good.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] OOM in libpq and infinite loop with getCopyStart()

2016-03-09 Thread Aleksander Alekseev
Hello, Michael

Thanks a lot for steps to reproduce you provided.

I tested your path on Ubuntu Linux 14.04 LTS (GCC 4.8.4) and FreeBSD
10.2 RELEASE (Clang 3.4.1). In both cases patch applies cleanly, there
are no warnings during compilation and all regression tests pass. A few
files are not properly pgindent'ed though:

```
diff --git a/src/interfaces/libpq/fe-exec.c
b/src/interfaces/libpq/fe-exec.c index c99f193..2769719 100644
--- a/src/interfaces/libpq/fe-exec.c
+++ b/src/interfaces/libpq/fe-exec.c
@@ -2031,7 +2031,7 @@ PQexecFinish(PGconn *conn)
conn->status == CONNECTION_BAD)
break;
else if ((conn->asyncStatus == PGASYNC_COPY_IN ||
- conn->asyncStatus == PGASYNC_COPY_OUT  ||
+ conn->asyncStatus == PGASYNC_COPY_OUT ||
  conn->asyncStatus == PGASYNC_COPY_BOTH) &&
 result->resultStatus == PGRES_FATAL_ERROR)
break;
diff --git a/src/interfaces/libpq/fe-protocol3.c
b/src/interfaces/libpq/fe-protocol3.c index 21a1d9b..280ca16 100644
--- a/src/interfaces/libpq/fe-protocol3.c
+++ b/src/interfaces/libpq/fe-protocol3.c
@@ -49,9 +49,9 @@ static intgetParamDescriptions(PGconn *conn, int
msgLength); static int getAnotherTuple(PGconn *conn, int msgLength);
 static int getParameterStatus(PGconn *conn);
 static int getNotify(PGconn *conn);
-static int getCopyStart(PGconn *conn,
-ExecStatusType copytype,
-int msgLength);
+static int getCopyStart(PGconn *conn,
+ExecStatusType copytype,
+int msgLength);
 static int getReadyForQuery(PGconn *conn);
 static void reportErrorPosition(PQExpBuffer msg, const char *query,
int loc, int encoding);
```

Indeed your patch solves issues you described. Still here is something
that concerns me:

```
$ gdb --args pg_receivexlog --verbose -D ./temp_data/

(gdb) b getCopyStart
Function "getCopyStart" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (getCopyStart) pending.
(gdb) r
Starting program: /usr/local/pgsql/bin/pg_receivexlog --verbose
-D ./temp_data/ [Thread debugging using libthread_db enabled]
Using host libthread_db library
"/lib/x86_64-linux-gnu/libthread_db.so.1". pg_receivexlog: starting log
streaming at 0/100 (timeline 1)

Breakpoint 1, getCopyStart (conn=0x610220, copytype=PGRES_COPY_BOTH,
msgLength=3) at fe-protocol3.c:1398 1398const char
*errmsg = NULL; (gdb) n
1400result = PQmakeEmptyPGresult(conn, copytype);
(gdb) 
1401if (!result)
(gdb) p result = 0
$1 = (PGresult *) 0x0
(gdb) c
Continuing.
pg_receivexlog: could not send replication command "START_REPLICATION":
out of memory pg_receivexlog: disconnected; waiting 5 seconds to try
again pg_receivexlog: starting log streaming at 0/100 (timeline 1)

Breakpoint 1, getCopyStart (conn=0x610180, copytype=PGRES_COPY_BOTH,
msgLength=3) at fe-protocol3.c:1398 1398const char
*errmsg = NULL; (gdb) n
1400result = PQmakeEmptyPGresult(conn, copytype);
(gdb) n
1401if (!result)
(gdb) p result = 0
$2 = (PGresult *) 0x0
(gdb) c
Continuing.
pg_receivexlog: could not send replication command "START_REPLICATION":
out of memory pg_receivexlog: disconnected; waiting 5 seconds to try
again pg_receivexlog: starting log streaming at 0/100 (timeline 1)

Breakpoint 1, getCopyStart (conn=0x610180, copytype=PGRES_COPY_BOTH,
msgLength=3) at fe-protocol3.c:1398 1398const char
*errmsg = NULL;
```

Granted this behaviour is a bit better then the current one. But
basically it's the same infinite loop only with pauses and warnings. I
wonder if this is a behaviour we really want. For instance wouldn't it
be better just to terminate an application in out-of-memory case? "Let
it crash" as Erlang programmers say.

Best regards,
Aleksander


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] OOM in libpq and infinite loop with getCopyStart()

2016-03-03 Thread Michael Paquier
On Fri, Mar 4, 2016 at 12:59 AM, Aleksander Alekseev
 wrote:
>> The easiest way to perform tests with this patch is to take a debugger
>> and enforce the malloc'd pointers to NULL in the code paths.
>
> I see. Still I don't think it's an excuse to not provide clear steps to
> reproduce an issue. As I see it anyone should be able to easily check
> your patch locally without having deep understanding of how libpq is
> implemented or reading thread which contains 48 e-mails.

OK, here they are if that helps. Following those steps and having some
knowledge of lldb or gdb is enough. The key point is that getCopyStart
is the routine to breakpoint and make fail.
A) psql and COPY.
1) gdb --args psql -c 'COPY (SELECT 1) TO stdout'
2) Then take a breakpoint at getCopyStart()
3) run
4) Enforce the return result of PQmakeEmptyPGresult to NULL:
p result = 0
5) Enjoy the infinite loop with HEAD, and the error with the patch

B) pg_receivexlog:
1) mkdir data
gdb --args pg_receivexlog --verbose -D data/
2) Take a breakpoint at getCopyStart
3) run
4) enforce result = 0 after the call to PQmakeEmptyPGresult
5) And enjoy what has been reported

C) pg_recvlogical, similar to B.
1) Create a replication slot on a server that accepts them
select pg_create_logical_replication_slot('foo', 'test_decoding');
2) Same breakpoint as B) with this start command or similar (depends
on where the logical slot has been created):
pg_recvlogical --start --slot foo -f - -d postgres

D) triggering standby problem:
1) Patch upstream, as follows by adding a sleep in
libpqrcv_startstreaming of libpqwalreceiver.c. This is a simple method
to have the time to take a breakpoint on the WAL receiver process that
has been started, with streaming that has not begun yet.
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index f670957..c7fccf1 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -188,6 +188,9 @@ libpqrcv_startstreaming(TimeLineID tli, XLogRecPtr
startpoint, char *slotname)
charcmd[256];
PGresult   *res;

+   /* ZZZzzz... */
+   pg_usleep(1L);
+
/* Start streaming from the point requested by startup process */
if (slotname != NULL)
snprintf(cmd, sizeof(cmd),
Trying to design a test taking into account the context of a WAL
receiver does not seem worth it to me with a simple method like this
one...
2) Start the standby and attach debugger on the WAL receiver process,
send SIGSTOP to it, whatever...
3) breakpoint at getCopyStart
4) Then move on with the same process as previously, and enjoy the errors.
-- 
Michael
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index f670957..c7fccf1 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -188,6 +188,9 @@ libpqrcv_startstreaming(TimeLineID tli, XLogRecPtr startpoint, char *slotname)
 	char		cmd[256];
 	PGresult   *res;
 
+	/* ZZZzzz... */
+	pg_usleep(1L);
+
 	/* Start streaming from the point requested by startup process */
 	if (slotname != NULL)
 		snprintf(cmd, sizeof(cmd),

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] OOM in libpq and infinite loop with getCopyStart()

2016-03-03 Thread Aleksander Alekseev
Hello, Michael

> The easiest way to perform tests with this patch is to take a debugger
> and enforce the malloc'd pointers to NULL in the code paths.

I see. Still I don't think it's an excuse to not provide clear steps to
reproduce an issue. As I see it anyone should be able to easily check
your patch locally without having deep understanding of how libpq is
implemented or reading thread which contains 48 e-mails.

For instance you cold provide a step-by-step instruction like:

1. run gdb --args some-prog arg1 arg2 ...
2. b some_file.c:123
3. c
4. ...
5. we expect ... but actually see ...

Or you cold automate these steps using gdb batch file:

gdb --batch --command=gdb.script --args some-prog arg1 arg2

... where gdb.script is:

b some_file.c:123
c
... etc ...

Naturally all of this is optional. But it will simplify both
understanding and reviewing of this path. Thus chances that this patch
will be accepted will be increased and time required for this will be
reduced significantly.

Best regards,
Aleksander


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] OOM in libpq and infinite loop with getCopyStart()

2016-03-03 Thread Michael Paquier
On Thu, Mar 3, 2016 at 10:18 PM, Aleksander Alekseev
 wrote:
> I didn't checked your patch in detail yet but here is a thought I would
> like to share.
>
> In my experience usually it takes number of rewrites before patch will
> be accepted. To make sure that after every rewrite your patch still
> solves an issue you described you should probably provide a few test
> programs too. If it's possible to make these programs part of regression
> tests suite it would be just great.

Yes that's usually the case, this patch is not at its first version.

> Without such test programs I personally find it difficult to verify
> that your patch fixes something. If such examples are provided your
> patch would be much more likely to accepted. "See, this program crashes
> everything. Now we apply a patch and everything works! Cool, heh?"

The easiest way to perform tests with this patch is to take a debugger
and enforce the malloc'd pointers to NULL in the code paths. That's
wild but efficient, and that's actually for we did for the other two
libpq patches treating those OOM failures. In the case of the standby
code path what I did was to put a sleep in the WAL receiver code and
then trigger the OOM manually, leading to the errors listed upthread.
Regarding automated tests, it is largely feasible to have tests on
Linux at least by using for example LD_PRELOAD or glibc-specific
malloc hooks, though those would be platform-restricted. Here is for
example one for COPY IN/OUT, that this patch also passed (there has
been already some discussion for this patch).
http://www.postgresql.org/message-id/55fc501a.5000...@iki.fi
Applying it for recovery is doable as well by manipulating the
recovery protocol with libpq though it is aimed really at covering a
narrow case.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] OOM in libpq and infinite loop with getCopyStart()

2016-03-03 Thread Aleksander Alekseev
Hello, Michael

I didn't checked your patch in detail yet but here is a thought I would
like to share.

In my experience usually it takes number of rewrites before patch will
be accepted. To make sure that after every rewrite your patch still
solves an issue you described you should probably provide a few test
programs too. If it's possible to make these programs part of regression
tests suite it would be just great.

Without such test programs I personally find it difficult to verify
that your patch fixes something. If such examples are provided your
patch would be much more likely to accepted. "See, this program crashes
everything. Now we apply a patch and everything works! Cool, heh?"

Best regards,
Aleksander


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers