Re: [HACKERS] OOM in libpq and infinite loop with getCopyStart()
On Mon, Apr 18, 2016 at 4:48 PM, Michael Paquier wrote: > Another, perhaps, better idea is to add some more extra logic to > pg_conn as follows: > boolis_emergency; > PGresult *emergency_result; > boolis_emergency_consumed; > So as when an OOM shows up, is_emergency is set to true. Then a first > call of PQgetResult returns the PGresult with the OOM in > emergency_result, setting is_emergency_consumed to true and switching > is_emergency back to false. Then a second call to PQgetResult enforces > the consumption of any results remaining without waiting for them, at > the end returning NULL to the caller, resetting is_emergency_consumed > to false. That's different compared to the extra failure > PGASYNC_COPY_FAILED in the fact that it does not break PQisBusy(). So, in terms of code, it gives more or less the attached. I have coupled the emergency_result with an enum flag emergency_status that has 3 states: - NONE, which is the default - ENABLE, which is when an OOM or other error has been found in libpq. Making the next call of PQgetResult return the emergency_result - CONSUMED, set after PQgetResult is consuming emergency_result, to return to caller NULL so as it can get out of any PQgetResult loop expecting a result at the end of. Once NULL is returned, the flag is switched back to NONE. This works in the case of getCopyStart because the OOM failures are happening before any messages are sent to server. The checks for the emergency result are done in PQgetResult, so as any upper-level routine gets the message. Tom, others, what do you think? -- Michael diff --git a/src/interfaces/libpq/fe-exec.c b/src/interfaces/libpq/fe-exec.c index 2621767..c5d7efd 100644 --- a/src/interfaces/libpq/fe-exec.c +++ b/src/interfaces/libpq/fe-exec.c @@ -1689,6 +1689,30 @@ PQgetResult(PGconn *conn) if (!conn) return NULL; + if (!conn->emergency_result) + { + conn->emergency_result = PQmakeEmptyPGresult(conn, PGRES_FATAL_ERROR); + if (!conn->emergency_result) + return NULL; + conn->emergency_status = PGEMERGENCY_NONE; + } + + /* + * Check for any error that could have happened with the client. + * Do this check before parsing any new commands. + */ + switch (conn->emergency_status) + { + case PGEMERGENCY_NONE: + break; + case PGEMERGENCY_ENABLE: + conn->emergency_status = PGEMERGENCY_CONSUMED; + return conn->emergency_result; + case PGEMERGENCY_CONSUMED: + conn->emergency_status = PGEMERGENCY_NONE; + return NULL; + } + /* Parse any available data, if our state permits. */ parseInput(conn); @@ -1698,6 +1722,22 @@ PQgetResult(PGconn *conn) int flushResult; /* + * Check for any error that could have happened with + * the client. + */ + switch (conn->emergency_status) + { + case PGEMERGENCY_NONE: +break; + case PGEMERGENCY_ENABLE: +conn->emergency_status = PGEMERGENCY_CONSUMED; +return conn->emergency_result; + case PGEMERGENCY_CONSUMED: +conn->emergency_status = PGEMERGENCY_NONE; +return NULL; + } + + /* * If data remains unsent, send it. Else we might be waiting for the * result of a command the backend hasn't even got yet. */ diff --git a/src/interfaces/libpq/fe-protocol3.c b/src/interfaces/libpq/fe-protocol3.c index 0b8c62f..6d99d12 100644 --- a/src/interfaces/libpq/fe-protocol3.c +++ b/src/interfaces/libpq/fe-protocol3.c @@ -1453,7 +1453,7 @@ getCopyStart(PGconn *conn, ExecStatusType copytype) result = PQmakeEmptyPGresult(conn, copytype); if (!result) - goto failure; + goto oom_failure; if (pqGetc(&conn->copy_is_binary, conn)) goto failure; @@ -1469,7 +1469,7 @@ getCopyStart(PGconn *conn, ExecStatusType copytype) result->attDescs = (PGresAttDesc *) pqResultAlloc(result, nfields * sizeof(PGresAttDesc), TRUE); if (!result->attDescs) - goto failure; + goto oom_failure; MemSet(result->attDescs, 0, nfields * sizeof(PGresAttDesc)); } @@ -1492,6 +1492,10 @@ getCopyStart(PGconn *conn, ExecStatusType copytype) conn->result = result; return 0; +oom_failure: + conn->emergency_status = PGEMERGENCY_ENABLE; + printfPQExpBuffer(&conn->errorMessage, + libpq_gettext("out of memory\n")); failure: PQclear(result); return EOF; diff --git a/src/interfaces/libpq/libpq-int.h b/src/interfaces/libpq/libpq-int.h index 1183323..e9dd167 100644 --- a/src/interfaces/libpq/libpq-int.h +++ b/src/interfaces/libpq/libpq-int.h @@ -223,6 +223,19 @@ typedef enum PGASYNC_COPY_BOTH /* Copy In/Out data transfer in progress */ } PGAsyncStatusType; +/* + * PGemergencyState defines the consumption status of emergency_result + * for clients-side error handling, particularly out-of-memory errors + * that could happen. + */ +typedef enum +{ + PGEMERGENCY_NONE, /* emergency result allocated, not needed */ + PGEMERGENCY_ENABLE, /* ready to be consumed by client */ + PGEMERGENCY_CONSUMED /* consumed by client, next call querying + * for a result will get NULL. */ +} PGemergencyState; + /* PGQueryC
Re: [HACKERS] OOM in libpq and infinite loop with getCopyStart()
On Wed, Apr 6, 2016 at 3:09 PM, Michael Paquier wrote: > On Sat, Apr 2, 2016 at 12:30 AM, Tom Lane wrote: >> I wrote: >>> So the core of my complaint is that we need to fix things so that, whether >>> or not we are able to create the PGRES_FATAL_ERROR PGresult (and we'd >>> better consider the behavior when we cannot), ... >> >> BTW, the real Achilles' heel of any attempt to ensure sane behavior at >> the OOM limit is this possibility of being unable to create a PGresult >> with which to inform the client that we failed. >> >> I wonder if we could make things better by keeping around an emergency >> backup PGresult struct. Something along these lines: >> >> 1. Add a field "PGresult *emergency_result" to PGconn. >> >> 2. At the very beginning of any PGresult-returning libpq function, check >> to see if we have an emergency_result, and if not make one, ensuring >> there's room in it for a reasonable-size error message; or maybe even >> preload it with "out of memory" if we assume that's the only condition >> it'll ever be used for. If malloc fails at this point, just return NULL >> without doing anything or changing any libpq state. (Since a NULL result >> is documented as possibly caused by OOM, this isn't violating any API.) >> >> 3. Subsequent operations never touch the emergency_result unless we're >> up against an OOM, but it can be used to return a failure indication >> to the client so long as we leave libpq in a state where additional >> calls to PQgetResult would return NULL. >> >> Basically this shifts the point where an unreportable OOM could happen >> from somewhere in the depths of libpq to the very start of an operation, >> where we're presumably in a clean state and OOM failure doesn't leave >> us with a mess we can't clean up. Sorry for the late reply here. I am looking back at this thing more seriously. So, if I understand that correctly. As the emergency result is pre-allocated, this leverages any future OOM errors because we could always fallback to this error in case of problems, so this would indeed help. And as this is independent of the rest of the status of pg_conn, any subsequent calls of any PQ* routines returning a PGresult would remain in the same state. On top of this emergency code path, don't you think that we need as well a flag that is switched to 'on' once an OOM error is faced? In consequence, on a code path where an OOM happens, this flag is activated. Then any subsequent calls of routines returning a PGresult checks for this flag, resets it, and returns the emergency result. At the end, it seems to me that the code paths where we check if the emergency flag is activated or not are the beginning of PQgetResult and PQexecFinish. In the case of PQexecFinish(), pending results would be cleared the next time PQexecStart is called. For PQgetResult, this gives the possibility to the application to report the OOM properly. However, successive calls to PQgetResult() would still make the application wait undefinitely for more data if it doesn't catch the error. Another thing that I think needs to be patched is libpqrcv_PQexec by the way, so as we check if any errors are caught on the way when calling multiple times PQgetResult or this code path could remain stuck with an infinite wait. As a result, it seems to me that this offers no real way to fix completely applications doing something like that: PQsendQuery(conn); for (;;) { while (PQisBusy(conn)) { //wait here for some event } result = PQgetResult(conn); if (!result) break; } The issue is that we'd wait for a NULL result to be received from PQgetResult, however even with this patch asyncStatus remaining set to PGASYNC_BUSY would make libpq waiting forever with pqWait for data that will never show up. We could have a new status for asyncStatus to be able to skip properly the loops where asyncStatus == PGASYNC_BUSY and make PQisBusy smarter but this would actually break the semantics of calling successively PQgetResult() if I am following correctly the last posts of this thread. Another, perhaps, better idea is to add some more extra logic to pg_conn as follows: boolis_emergency; PGresult *emergency_result; boolis_emergency_consumed; So as when an OOM shows up, is_emergency is set to true. Then a first call of PQgetResult returns the PGresult with the OOM in emergency_result, setting is_emergency_consumed to true and switching is_emergency back to false. Then a second call to PQgetResult enforces the consumption of any results remaining without waiting for them, at the end returning NULL to the caller, resetting is_emergency_consumed to false. That's different compared to the extra failure PGASYNC_COPY_FAILED in the fact that it does not break PQisBusy(). Thoughts? -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] OOM in libpq and infinite loop with getCopyStart()
On Sat, Apr 2, 2016 at 12:30 AM, Tom Lane wrote: > I wrote: >> So the core of my complaint is that we need to fix things so that, whether >> or not we are able to create the PGRES_FATAL_ERROR PGresult (and we'd >> better consider the behavior when we cannot), ... > > BTW, the real Achilles' heel of any attempt to ensure sane behavior at > the OOM limit is this possibility of being unable to create a PGresult > with which to inform the client that we failed. > > I wonder if we could make things better by keeping around an emergency > backup PGresult struct. Something along these lines: > > 1. Add a field "PGresult *emergency_result" to PGconn. > > 2. At the very beginning of any PGresult-returning libpq function, check > to see if we have an emergency_result, and if not make one, ensuring > there's room in it for a reasonable-size error message; or maybe even > preload it with "out of memory" if we assume that's the only condition > it'll ever be used for. If malloc fails at this point, just return NULL > without doing anything or changing any libpq state. (Since a NULL result > is documented as possibly caused by OOM, this isn't violating any API.) > > 3. Subsequent operations never touch the emergency_result unless we're > up against an OOM, but it can be used to return a failure indication > to the client so long as we leave libpq in a state where additional > calls to PQgetResult would return NULL. > > Basically this shifts the point where an unreportable OOM could happen > from somewhere in the depths of libpq to the very start of an operation, > where we're presumably in a clean state and OOM failure doesn't leave > us with a mess we can't clean up. I have moved this patch to next CF for the time being. As that's a legit bug and not a feature, that should be fine to pursue work on this item even if this CF ends. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] OOM in libpq and infinite loop with getCopyStart()
I wrote: > So the core of my complaint is that we need to fix things so that, whether > or not we are able to create the PGRES_FATAL_ERROR PGresult (and we'd > better consider the behavior when we cannot), ... BTW, the real Achilles' heel of any attempt to ensure sane behavior at the OOM limit is this possibility of being unable to create a PGresult with which to inform the client that we failed. I wonder if we could make things better by keeping around an emergency backup PGresult struct. Something along these lines: 1. Add a field "PGresult *emergency_result" to PGconn. 2. At the very beginning of any PGresult-returning libpq function, check to see if we have an emergency_result, and if not make one, ensuring there's room in it for a reasonable-size error message; or maybe even preload it with "out of memory" if we assume that's the only condition it'll ever be used for. If malloc fails at this point, just return NULL without doing anything or changing any libpq state. (Since a NULL result is documented as possibly caused by OOM, this isn't violating any API.) 3. Subsequent operations never touch the emergency_result unless we're up against an OOM, but it can be used to return a failure indication to the client so long as we leave libpq in a state where additional calls to PQgetResult would return NULL. Basically this shifts the point where an unreportable OOM could happen from somewhere in the depths of libpq to the very start of an operation, where we're presumably in a clean state and OOM failure doesn't leave us with a mess we can't clean up. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] OOM in libpq and infinite loop with getCopyStart()
Amit Kapila writes: > Very valid point. So, if we see with patch, I think libpq will be > in PGASYNC_COPY_XXX state after such a failure and next time when it tries > to again execute statement, it will end copy mode and then allow to proceed > from there. This design is required for other purposes like silently > discarding left over result. Now, I think the other option(if possible) > seems to be that we can put libpq in PGASYNC_IDLE, if we get such an error > as we do at some other places in case of error as below and return OOM > failure as we do in patch. If we transition to PGASYNC_IDLE then the next PQexecStart will not realize that it needs to do something to get out of the COPY state; but it does, since the backend thinks that we're doing COPY. That was the reasoning behind my proposal to invent new PGASYNC states. Obviously there's more than one way to represent this new state, eg you could have a separate error flag instead --- but it is a new state. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] OOM in libpq and infinite loop with getCopyStart()
Michael Paquier writes: > On Fri, Apr 1, 2016 at 12:48 PM, Tom Lane wrote: >> Anyway, the short of my review is that we need more clarity of thought >> about what state libpq is in after a failure like this, and what that >> state looks like to the application, and how that should affect how >> libpq reacts to application calls. > Hm. I would have thought that returning to the caller a result > generated by PQmakeEmptyPGresult(PGRES_FATAL_ERROR) with an error > string marked "out of memory" would be the best thing to do instead of > NULL... Sure, we're trying to do that, but the question is what happens afterward. An app that is following the documented usage pattern for PQgetResult will call it again, expecting to get a NULL, but as the patch stands it might get a PGRES_COPY_XXX instead. What drove me to decide the patch wasn't committable was realizing that Robert was right upthread: the change you propose in libpqwalreceiver.c is not a bug fix. That code is doing exactly what it's supposed to do per the libpq.sgml docs, namely iterate till it gets a NULL.[1] If we have to change it, that means (a) we've changed the API spec for PQgetResult and (b) every other existing caller of PQgetResult would need to change similarly, or else risk corner-case bugs about as bad as the one we're trying to fix. So the core of my complaint is that we need to fix things so that, whether or not we are able to create the PGRES_FATAL_ERROR PGresult (and we'd better consider the behavior when we cannot), we need to leave libpq in a state where subsequent calls to PQgetResult will return NULL. > If getCopyStart() complains because of a lack of data, we loop > back into pqParseInput3 to try again. If an OOM happens, we still loop > into pqParseInput3 but all the pending data received from client needs > to be discarded so as we don't rely on the next calls of PQexecStart > to do the cleanup, once all the data is received we get out of the > loop and generate the result with PGRES_FATAL_ERROR. Does that match > what you have in mind? If we could, that would be OK, but (a) you're only considering the COPY OUT case not the COPY IN case, and (b) a persistent OOM condition could very well prevent us from ever completing the copy. For example, some COPY DATA messages might be too big for our current buffer, but we won't be able to enlarge the buffer. I'm inclined to think that reporting the OOM to the client is the highest-priority goal, and that attempting to clean up during the next PQexec/PQsendQuery is a reasonable design. If we fail to do that, the client won't really understand that it's a cleanup failure and not a failure to issue the new query, but it's not clear why it needs to know the difference. regards, tom lane [1] Well, almost. Really, after dealing with a PGRES_COPY_XXX result it ought to go back to pumping PQgetResult. It doesn't, but that's an expected client behavior, and cleaning up after that is exactly what PQexecStart is intended for. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] OOM in libpq and infinite loop with getCopyStart()
On Fri, Apr 1, 2016 at 10:34 AM, Michael Paquier wrote: > On Fri, Apr 1, 2016 at 12:48 PM, Tom Lane wrote: > > I thought about this patch a bit more... > > > > When I first looked at the patch, I didn't believe that it worked at > > all: it failed to return a PGRES_COPY_XXX result to the application, > > and yet nonetheless switched libpq's asyncStatus to PGASYNC_COPY_XXX? > > Wouldn't things be hopelessly confused? I tried it out and saw that > > indeed it seemed to work in psql, and after tracing through that found > > that psql has no idea what's going on, but when psql issues its next > > command PQexecStart manages to get us out of the copy state (barring > > more OOM failures, anyway). That seems a bit accidental, though, > > and for sure it wasn't documented in the patch. > > From the patch, that's mentioned: > +* Stop if we are in copy mode and error has occurred, the pending results > +* will be discarded during next execution in PQexecStart. > Yeah, and same is mentioned in PQexecStart function as well which indicates that above assumption is right. > > > Anyway, the short of my review is that we need more clarity of thought > > about what state libpq is in after a failure like this, and what that > > state looks like to the application, and how that should affect how > > libpq reacts to application calls. > Very valid point. So, if we see with patch, I think libpq will be in PGASYNC_COPY_XXX state after such a failure and next time when it tries to again execute statement, it will end copy mode and then allow to proceed from there. This design is required for other purposes like silently discarding left over result. Now, I think the other option(if possible) seems to be that we can put libpq in PGASYNC_IDLE, if we get such an error as we do at some other places in case of error as below and return OOM failure as we do in patch. PQgetResult() { .. if (flushResult || pqWait(TRUE, FALSE, conn) || pqReadData(conn) < 0) { /* * conn->errorMessage has been set by pqWait or pqReadData. We * want to append it to any already-received error message. */ pqSaveErrorResult(conn); conn->asyncStatus = PGASYNC_IDLE; return pqPrepareAsyncResult(conn); } .. } With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] OOM in libpq and infinite loop with getCopyStart()
On Fri, Apr 1, 2016 at 12:48 PM, Tom Lane wrote: > I thought about this patch a bit more... > > When I first looked at the patch, I didn't believe that it worked at > all: it failed to return a PGRES_COPY_XXX result to the application, > and yet nonetheless switched libpq's asyncStatus to PGASYNC_COPY_XXX? > Wouldn't things be hopelessly confused? I tried it out and saw that > indeed it seemed to work in psql, and after tracing through that found > that psql has no idea what's going on, but when psql issues its next > command PQexecStart manages to get us out of the copy state (barring > more OOM failures, anyway). That seems a bit accidental, though, > and for sure it wasn't documented in the patch. >From the patch, that's mentioned: +* Stop if we are in copy mode and error has occurred, the pending results +* will be discarded during next execution in PQexecStart. > In any case, having > libpq in that state would have side-effects on how it responds to > application actions, which is the core of my complaint about > PQgetResult behavior. Those side effects only make sense if the app > knows (or should have known) that the connection is in COPY state. OK. So we'd want to avoid relying on subsequent PQ* calls and return into a clean state once the OOM shows up. > I think it might be possible to get saner behavior if we invent a > set of new asyncStatus values PGASYNC_COPY_XXX_FAILED, having the > semantics that we got the relevant CopyStart message from the backend > but were unable to report it to the application. PQexecStart would > treat these the same as the corresponding normal PGASYNC_COPY_XXX > states and do what's needful to get out of the copy mode. But in > PQgetResult, we would *not* treat those states as a reason to return a > PGRES_COPY_XXX result to the application. Probably we'd just return > NULL, with the implication being "we're done so far as you're concerned, > please give a new command". I might be underthinking it though. I raised something like that a couple of months back, based on the fact that PGASYNC_* are flags internal to libpq on frontend: http://www.postgresql.org/message-id/cab7npqtaedwprcqak-czzxwwapdeyr1gug0vwvg7rhzhmaj...@mail.gmail.com In this case what was debated is PGASYNC_FATAL... Something that is quite different with your proposal is that we return NULL or something else, and it did not discard any pending data from the client. For applications running PQgetResult in a loop that would work. > Anyway, the short of my review is that we need more clarity of thought > about what state libpq is in after a failure like this, and what that > state looks like to the application, and how that should affect how > libpq reacts to application calls. Hm. I would have thought that returning to the caller a result generated by PQmakeEmptyPGresult(PGRES_FATAL_ERROR) with an error string marked "out of memory" would be the best thing to do instead of NULL... If getCopyStart() complains because of a lack of data, we loop back into pqParseInput3 to try again. If an OOM happens, we still loop into pqParseInput3 but all the pending data received from client needs to be discarded so as we don't rely on the next calls of PQexecStart to do the cleanup, once all the data is received we get out of the loop and generate the result with PGRES_FATAL_ERROR. Does that match what you have in mind? -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] OOM in libpq and infinite loop with getCopyStart()
I thought about this patch a bit more... When I first looked at the patch, I didn't believe that it worked at all: it failed to return a PGRES_COPY_XXX result to the application, and yet nonetheless switched libpq's asyncStatus to PGASYNC_COPY_XXX? Wouldn't things be hopelessly confused? I tried it out and saw that indeed it seemed to work in psql, and after tracing through that found that psql has no idea what's going on, but when psql issues its next command PQexecStart manages to get us out of the copy state (barring more OOM failures, anyway). That seems a bit accidental, though, and for sure it wasn't documented in the patch. In any case, having libpq in that state would have side-effects on how it responds to application actions, which is the core of my complaint about PQgetResult behavior. Those side effects only make sense if the app knows (or should have known) that the connection is in COPY state. I think it might be possible to get saner behavior if we invent a set of new asyncStatus values PGASYNC_COPY_XXX_FAILED, having the semantics that we got the relevant CopyStart message from the backend but were unable to report it to the application. PQexecStart would treat these the same as the corresponding normal PGASYNC_COPY_XXX states and do what's needful to get out of the copy mode. But in PQgetResult, we would *not* treat those states as a reason to return a PGRES_COPY_XXX result to the application. Probably we'd just return NULL, with the implication being "we're done so far as you're concerned, please give a new command". I might be underthinking it though. Anyway, the short of my review is that we need more clarity of thought about what state libpq is in after a failure like this, and what that state looks like to the application, and how that should affect how libpq reacts to application calls. I'll set this back to Waiting on Author ... regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] OOM in libpq and infinite loop with getCopyStart()
Aleksander Alekseev writes: > +1, same here. Lets see whats committer's opinion. I fooled around with this patch quite a bit but could not bring myself to pull the trigger, because I think it fundamentally breaks applications that follow the "repeat PQgetResult until NULL" rule. The only reason that psql manages not to fail is that it doesn't do that, it just calls PQexec; and you've hacked PQexecFinish so that it falls out without pumping PQgetResult till dry. But that's not a good solution, it's just a hack that makes the behavior unprincipled and unlike direct use of PQgetResult. The key problem is that, assuming that the changes in getCopyStart() succeed in returning a PGRES_FATAL_ERROR PGresult, the application may just follow the rule of doing nothing with it unless it's the last non-null result from PQgetResult. And it won't be, because you've switched libpq's asyncStatus into one or another COPY status, which will cause PQgetResult to continually create and return PGRES_COPY_XXX results, which is what it's supposed to do in that situation (cf the last step in getCopyResult). Now, this will accidentally fail to fail if PQgetResult's attempt to manufacture a PGRES_COPY_XXX result fails and returns null, which is certainly possible if we're up against an OOM situation. But what if it doesn't fail --- which is also possible, because a PGRES_COPY_XXX with no attached data will not need as much space as a PGRES_FATAL_ERROR with attached error message. The app probably throws away the PGRES_FATAL_ERROR and tries to use the PGRES_COPY_XXX result, which is almost okay, except that it will lack the copy format information which will be fatal if the application is relying on that. So AFAICT, this patch kinda sorta works for psql but it is not going to make things better for other apps. The other problem is that I don't have a lot of faith in the theory that getCopyStart is going to be able to make an error PGresult when it couldn't make a COPY PGresult. The existing message-receipt routines that this is modeled on are dealing with PGresults that are expected to grow large, so throwing away the accumulated PGresult is highly likely to free enough memory to let us allocate an error PGresult. Not so here. I have to run, but the bottom line is I don't feel comfortable with this. It's trying to fix what's a very corner-case problem (since COPY PGresults aren't large) but it's introducing new corner case behaviors of its own. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] OOM in libpq and infinite loop with getCopyStart()
> I think this patch needs to be looked upon by committer now. I have > done review and added some code in this patch as well long back, just > see the e-mail [1], patch is just same as it was in October 2015. I > think myself and Michael are in agreement that this patch solves the > reported problem. There is one similar problem [2] reported by Heikki > which I think can be fixed separately. > [...] > Let me know if you think anything more from myside can help in moving > patch. +1, same here. Lets see whats committer's opinion. -- Best regards, Aleksander Alekseev http://eax.me/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] OOM in libpq and infinite loop with getCopyStart()
On Tue, Mar 29, 2016 at 8:31 PM, David Steele wrote: > Hi Amit, > > On 3/22/16 3:37 AM, Michael Paquier wrote: > > Hope this brings some light in. >> > > Do you know when you'll have time to respond to Michael's last email? I've > marked this patch "waiting on author" in the meantime. > > I think this patch needs to be looked upon by committer now. I have done review and added some code in this patch as well long back, just see the e-mail [1], patch is just same as it was in October 2015. I think myself and Michael are in agreement that this patch solves the reported problem. There is one similar problem [2] reported by Heikki which I think can be fixed separately. I think the main reason of moving this thread to hackers from bugs is to gain some more traction which I see that it achieves its purpose to some extent, but I find that more or less we are at same situation as we were back in October. Let me know if you think anything more from myside can help in moving patch. [1] - http://www.postgresql.org/message-id/cab7npqtnrw8lr1hd7zlnojjc1bp1evw_emadgorv+s-sbcr...@mail.gmail.com [2] - http://www.postgresql.org/message-id/566ef84f.1030...@iki.fi With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] OOM in libpq and infinite loop with getCopyStart()
On 3/29/16 8:21 PM, Michael Paquier wrote: > On Wed, Mar 30, 2016 at 12:01 AM, David Steele wrote: >> Hi Amit, >> >> On 3/22/16 3:37 AM, Michael Paquier wrote: >> >>> Hope this brings some light in. >> >> >> Do you know when you'll have time to respond to Michael's last email? I've >> marked this patch "waiting on author" in the meantime. > > Shouldn't it be "needs review" instead? I am marked as the author of this > patch. Whoops! I got the author and reviewer backwards. Set back to "needs review". -- -David da...@pgmasters.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] OOM in libpq and infinite loop with getCopyStart()
On Wed, Mar 30, 2016 at 12:01 AM, David Steele wrote: > Hi Amit, > > On 3/22/16 3:37 AM, Michael Paquier wrote: > >> Hope this brings some light in. > > > Do you know when you'll have time to respond to Michael's last email? I've > marked this patch "waiting on author" in the meantime. Shouldn't it be "needs review" instead? I am marked as the author of this patch. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] OOM in libpq and infinite loop with getCopyStart()
Hi Amit, On 3/22/16 3:37 AM, Michael Paquier wrote: Hope this brings some light in. Do you know when you'll have time to respond to Michael's last email? I've marked this patch "waiting on author" in the meantime. Thanks, -- -David da...@pgmasters.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] OOM in libpq and infinite loop with getCopyStart()
On Tue, Mar 22, 2016 at 2:19 PM, Amit Kapila wrote: > On Tue, Mar 22, 2016 at 9:46 AM, Tom Lane wrote: >> >> Amit Kapila writes: >> > On Mon, Mar 21, 2016 at 10:13 PM, Robert Haas >> > wrote: >> >> It is very difficult to believe that this is a good idea: >> >> >> >> --- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c >> >> +++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c >> >> @@ -445,6 +445,7 @@ libpqrcv_PQexec(const char *query) >> >> if (PQresultStatus(lastResult) == PGRES_COPY_IN || >> >> PQresultStatus(lastResult) == PGRES_COPY_OUT || >> >> PQresultStatus(lastResult) == PGRES_COPY_BOTH || >> >> +PQresultStatus(lastResult) == PGRES_FATAL_ERROR || >> >> PQstatus(streamConn) == CONNECTION_BAD) >> >> break; >> >> >> >> I mean, why would it be a good idea to blindly skip over fatal errors? >> >> > I think it is not about skipping the FATAL error, rather to stop trying >> > to >> > get further results on FATAL error. >> >> If the code already includes "lost the connection" as a case to break on, >> I'm not quite sure why "got a query error" is not. >> > > This error check is exactly same as PQexecFinish() and there some > explanation is given in comments which hints towards the reason for > continuing on error, basically both the functions PQexecFinish() and > libpqrcv_PQexec() returns the last result if there are many and > PQexecFinish() concatenates the errors as well in some cases. Do we see any > use in continuing to get result after getting PGRES_FATAL_ERROR error? I spent quite a couple of hours looking at that, and I still fail to see how it would be an advantage to stack errors. We reduce the error message visibility for frontend clients, and this does not prevent clients to actually see error messages generated by a backend, take a WAL sender. And actually depending on the message type received we may end up by simply ignoring those error messages and have libpq discard them. So it seems like an oversight of 03a571a4. One could always say that this change breaks the case where multiple error messages are sent in a row from a client with COPY_* (IN, OUT and BOTH), because we'd get back the last error message, while with this change we let client know the first one, though that would mean that client is missing something when using COPY_BOTH. Also... @@ -2023,6 +2030,11 @@ PQexecFinish(PGconn *conn) result->resultStatus == PGRES_COPY_BOTH || conn->status == CONNECTION_BAD) break; + else if ((conn->asyncStatus == PGASYNC_COPY_IN || + conn->asyncStatus == PGASYNC_COPY_OUT || + conn->asyncStatus == PGASYNC_COPY_BOTH) && +result->resultStatus == PGRES_FATAL_ERROR) + break; The reason behind this check in PQexecFinish is that we need to make the difference between an error where there is not enough data, which means that we want to retry again the message fetching through parseInput3(), and the case where an actual OOM happened, where we want to exit immediately and let the caller know that there has been a frontend error. Now the other reason why we went on with PGRES_FATAL_ERROR here is that it was discussed that it was an overkill to assign a special status to PGASYNC to detect a frontend-side failure, because we already treat other frontend-side errors with PGRES_FATAL_ERROR and assign an error message to them (see for example when an allocation fails for 'C', introduced in another patch of the OOM failure series), and we still need to know that we are in PGASYNC_COPY_* mode in this code path as well because it means that the start message has been processed, but we had a failure in deparsing it so we bypassed it, and need to fail immediately. So using PGREC_FATAL_ERROR simplifies the code paths creating an error stack. Hope this brings some light in. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] OOM in libpq and infinite loop with getCopyStart()
On Tue, Mar 22, 2016 at 9:46 AM, Tom Lane wrote: > > Amit Kapila writes: > > On Mon, Mar 21, 2016 at 10:13 PM, Robert Haas wrote: > >> It is very difficult to believe that this is a good idea: > >> > >> --- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c > >> +++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c > >> @@ -445,6 +445,7 @@ libpqrcv_PQexec(const char *query) > >> if (PQresultStatus(lastResult) == PGRES_COPY_IN || > >> PQresultStatus(lastResult) == PGRES_COPY_OUT || > >> PQresultStatus(lastResult) == PGRES_COPY_BOTH || > >> +PQresultStatus(lastResult) == PGRES_FATAL_ERROR || > >> PQstatus(streamConn) == CONNECTION_BAD) > >> break; > >> > >> I mean, why would it be a good idea to blindly skip over fatal errors? > > > I think it is not about skipping the FATAL error, rather to stop trying to > > get further results on FATAL error. > > If the code already includes "lost the connection" as a case to break on, > I'm not quite sure why "got a query error" is not. > This error check is exactly same as PQexecFinish() and there some explanation is given in comments which hints towards the reason for continuing on error, basically both the functions PQexecFinish() and libpqrcv_PQexec() returns the last result if there are many and PQexecFinish() concatenates the errors as well in some cases. Do we see any use in continuing to get result after getting PGRES_FATAL_ERROR error? With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] OOM in libpq and infinite loop with getCopyStart()
Amit Kapila writes: > On Mon, Mar 21, 2016 at 10:13 PM, Robert Haas wrote: >> It is very difficult to believe that this is a good idea: >> >> --- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c >> +++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c >> @@ -445,6 +445,7 @@ libpqrcv_PQexec(const char *query) >> if (PQresultStatus(lastResult) == PGRES_COPY_IN || >> PQresultStatus(lastResult) == PGRES_COPY_OUT || >> PQresultStatus(lastResult) == PGRES_COPY_BOTH || >> +PQresultStatus(lastResult) == PGRES_FATAL_ERROR || >> PQstatus(streamConn) == CONNECTION_BAD) >> break; >> >> I mean, why would it be a good idea to blindly skip over fatal errors? > I think it is not about skipping the FATAL error, rather to stop trying to > get further results on FATAL error. If the code already includes "lost the connection" as a case to break on, I'm not quite sure why "got a query error" is not. Maybe that PQstatus check is broken too, but it doesn't seem like this patch makes it more so. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] OOM in libpq and infinite loop with getCopyStart()
On Mon, Mar 21, 2016 at 10:13 PM, Robert Haas wrote: > > On Tue, Mar 1, 2016 at 12:38 AM, Michael Paquier > wrote: > > Thoughts? I have registered that in the CF app, and a patch is attached. > > It is very difficult to believe that this is a good idea: > > --- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c > +++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c > @@ -445,6 +445,7 @@ libpqrcv_PQexec(const char *query) > if (PQresultStatus(lastResult) == PGRES_COPY_IN || > PQresultStatus(lastResult) == PGRES_COPY_OUT || > PQresultStatus(lastResult) == PGRES_COPY_BOTH || > +PQresultStatus(lastResult) == PGRES_FATAL_ERROR || > PQstatus(streamConn) == CONNECTION_BAD) > break; > } > > I mean, why would it be a good idea to blindly skip over fatal errors? > I think it is not about skipping the FATAL error, rather to stop trying to get further results on FATAL error. This is to ensure that OOM error is reported rather that ignored. There has been discussion about this previously as well [1]. [1] - http://www.postgresql.org/message-id/CAB7nPqT6gKj6iS9VTPth_h6Sz5Jo-177s6QJN_jrW66wyCjJ=w...@mail.gmail.com With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] OOM in libpq and infinite loop with getCopyStart()
On Tue, Mar 1, 2016 at 12:38 AM, Michael Paquier wrote: > Thoughts? I have registered that in the CF app, and a patch is attached. It is very difficult to believe that this is a good idea: --- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c +++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c @@ -445,6 +445,7 @@ libpqrcv_PQexec(const char *query) if (PQresultStatus(lastResult) == PGRES_COPY_IN || PQresultStatus(lastResult) == PGRES_COPY_OUT || PQresultStatus(lastResult) == PGRES_COPY_BOTH || +PQresultStatus(lastResult) == PGRES_FATAL_ERROR || PQstatus(streamConn) == CONNECTION_BAD) break; } I mean, why would it be a good idea to blindly skip over fatal errors? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] OOM in libpq and infinite loop with getCopyStart()
On Thu, Mar 10, 2016 at 12:12 AM, Alvaro Herrera wrote: > Aleksander Alekseev wrote: >> pg_receivexlog: could not send replication command "START_REPLICATION": >> out of memory pg_receivexlog: disconnected; waiting 5 seconds to try >> again pg_receivexlog: starting log streaming at 0/100 (timeline 1) >> >> Breakpoint 1, getCopyStart (conn=0x610180, copytype=PGRES_COPY_BOTH, >> msgLength=3) at fe-protocol3.c:1398 1398 const char >> *errmsg = NULL; >> ``` >> >> Granted this behaviour is a bit better then the current one. But >> basically it's the same infinite loop only with pauses and warnings. I >> wonder if this is a behaviour we really want. For instance wouldn't it >> be better just to terminate an application in out-of-memory case? "Let >> it crash" as Erlang programmers say. > > Hmm. It would be useful to retry in the case that there is a chance > that the program releases memory and can continue later. But if it will > only stay there doing nothing other than retrying, then that obviously > will not happen. One situation where this might help is if the overall > *system* is short on memory and we expect that situation to resolve > itself after a while -- after all, if the system is so loaded that it > can't allocate a few more bytes for the COPY message, then odds are that > other things are also crashing and eventually enough memory will be > released that pg_receivexlog can continue. Yep, that's my assumption regarding that, at some point the system may succeed, and I don't think that we should break the current behaviors of pg_receivexlog and pg_recvlogical regarding that in the back-branches. Now, note that without the patch we actually have the same problem. Say if OOMs happen continuously in getCopyStart, with COPY_BOTH libpq would attempt to read the next message continuously and would keep failing. Except that in this case the caller has no idea what is happening as things keep running in libpq itself. > On the other hand, if the system is so loaded, perhaps it's better to > "let it crash" and have it restart later -- presumably once the admin > notices the problem and restarts it manually after cleaning up the mess. > > If all programs are well behaved and nothing crashes when OOM but they > all retry instead, then everything will continue to retry infinitely and > make no progress. That cannot be good. That's something we could take care of in those client utilities I think with a new option like --maximum-retries or similar, but anyway I think that's a different discussion. The patch I am proposing here allows a client application to be made aware of OOM errors that happen. If we don't do something about that first, something like --maximum-retries would be useless for COPY_BOTH as the client will never be made aware of the OOM that happened in libpq and would keep looping inside libpq itself until some memory is freed. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] OOM in libpq and infinite loop with getCopyStart()
Aleksander Alekseev wrote: > pg_receivexlog: could not send replication command "START_REPLICATION": > out of memory pg_receivexlog: disconnected; waiting 5 seconds to try > again pg_receivexlog: starting log streaming at 0/100 (timeline 1) > > Breakpoint 1, getCopyStart (conn=0x610180, copytype=PGRES_COPY_BOTH, > msgLength=3) at fe-protocol3.c:1398 1398 const char > *errmsg = NULL; > ``` > > Granted this behaviour is a bit better then the current one. But > basically it's the same infinite loop only with pauses and warnings. I > wonder if this is a behaviour we really want. For instance wouldn't it > be better just to terminate an application in out-of-memory case? "Let > it crash" as Erlang programmers say. Hmm. It would be useful to retry in the case that there is a chance that the program releases memory and can continue later. But if it will only stay there doing nothing other than retrying, then that obviously will not happen. One situation where this might help is if the overall *system* is short on memory and we expect that situation to resolve itself after a while -- after all, if the system is so loaded that it can't allocate a few more bytes for the COPY message, then odds are that other things are also crashing and eventually enough memory will be released that pg_receivexlog can continue. On the other hand, if the system is so loaded, perhaps it's better to "let it crash" and have it restart later -- presumably once the admin notices the problem and restarts it manually after cleaning up the mess. If all programs are well behaved and nothing crashes when OOM but they all retry instead, then everything will continue to retry infinitely and make no progress. That cannot be good. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] OOM in libpq and infinite loop with getCopyStart()
Hello, Michael Thanks a lot for steps to reproduce you provided. I tested your path on Ubuntu Linux 14.04 LTS (GCC 4.8.4) and FreeBSD 10.2 RELEASE (Clang 3.4.1). In both cases patch applies cleanly, there are no warnings during compilation and all regression tests pass. A few files are not properly pgindent'ed though: ``` diff --git a/src/interfaces/libpq/fe-exec.c b/src/interfaces/libpq/fe-exec.c index c99f193..2769719 100644 --- a/src/interfaces/libpq/fe-exec.c +++ b/src/interfaces/libpq/fe-exec.c @@ -2031,7 +2031,7 @@ PQexecFinish(PGconn *conn) conn->status == CONNECTION_BAD) break; else if ((conn->asyncStatus == PGASYNC_COPY_IN || - conn->asyncStatus == PGASYNC_COPY_OUT || + conn->asyncStatus == PGASYNC_COPY_OUT || conn->asyncStatus == PGASYNC_COPY_BOTH) && result->resultStatus == PGRES_FATAL_ERROR) break; diff --git a/src/interfaces/libpq/fe-protocol3.c b/src/interfaces/libpq/fe-protocol3.c index 21a1d9b..280ca16 100644 --- a/src/interfaces/libpq/fe-protocol3.c +++ b/src/interfaces/libpq/fe-protocol3.c @@ -49,9 +49,9 @@ static intgetParamDescriptions(PGconn *conn, int msgLength); static int getAnotherTuple(PGconn *conn, int msgLength); static int getParameterStatus(PGconn *conn); static int getNotify(PGconn *conn); -static int getCopyStart(PGconn *conn, -ExecStatusType copytype, -int msgLength); +static int getCopyStart(PGconn *conn, +ExecStatusType copytype, +int msgLength); static int getReadyForQuery(PGconn *conn); static void reportErrorPosition(PQExpBuffer msg, const char *query, int loc, int encoding); ``` Indeed your patch solves issues you described. Still here is something that concerns me: ``` $ gdb --args pg_receivexlog --verbose -D ./temp_data/ (gdb) b getCopyStart Function "getCopyStart" not defined. Make breakpoint pending on future shared library load? (y or [n]) y Breakpoint 1 (getCopyStart) pending. (gdb) r Starting program: /usr/local/pgsql/bin/pg_receivexlog --verbose -D ./temp_data/ [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". pg_receivexlog: starting log streaming at 0/100 (timeline 1) Breakpoint 1, getCopyStart (conn=0x610220, copytype=PGRES_COPY_BOTH, msgLength=3) at fe-protocol3.c:1398 1398const char *errmsg = NULL; (gdb) n 1400result = PQmakeEmptyPGresult(conn, copytype); (gdb) 1401if (!result) (gdb) p result = 0 $1 = (PGresult *) 0x0 (gdb) c Continuing. pg_receivexlog: could not send replication command "START_REPLICATION": out of memory pg_receivexlog: disconnected; waiting 5 seconds to try again pg_receivexlog: starting log streaming at 0/100 (timeline 1) Breakpoint 1, getCopyStart (conn=0x610180, copytype=PGRES_COPY_BOTH, msgLength=3) at fe-protocol3.c:1398 1398const char *errmsg = NULL; (gdb) n 1400result = PQmakeEmptyPGresult(conn, copytype); (gdb) n 1401if (!result) (gdb) p result = 0 $2 = (PGresult *) 0x0 (gdb) c Continuing. pg_receivexlog: could not send replication command "START_REPLICATION": out of memory pg_receivexlog: disconnected; waiting 5 seconds to try again pg_receivexlog: starting log streaming at 0/100 (timeline 1) Breakpoint 1, getCopyStart (conn=0x610180, copytype=PGRES_COPY_BOTH, msgLength=3) at fe-protocol3.c:1398 1398const char *errmsg = NULL; ``` Granted this behaviour is a bit better then the current one. But basically it's the same infinite loop only with pauses and warnings. I wonder if this is a behaviour we really want. For instance wouldn't it be better just to terminate an application in out-of-memory case? "Let it crash" as Erlang programmers say. Best regards, Aleksander -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] OOM in libpq and infinite loop with getCopyStart()
On Fri, Mar 4, 2016 at 12:59 AM, Aleksander Alekseev wrote: >> The easiest way to perform tests with this patch is to take a debugger >> and enforce the malloc'd pointers to NULL in the code paths. > > I see. Still I don't think it's an excuse to not provide clear steps to > reproduce an issue. As I see it anyone should be able to easily check > your patch locally without having deep understanding of how libpq is > implemented or reading thread which contains 48 e-mails. OK, here they are if that helps. Following those steps and having some knowledge of lldb or gdb is enough. The key point is that getCopyStart is the routine to breakpoint and make fail. A) psql and COPY. 1) gdb --args psql -c 'COPY (SELECT 1) TO stdout' 2) Then take a breakpoint at getCopyStart() 3) run 4) Enforce the return result of PQmakeEmptyPGresult to NULL: p result = 0 5) Enjoy the infinite loop with HEAD, and the error with the patch B) pg_receivexlog: 1) mkdir data gdb --args pg_receivexlog --verbose -D data/ 2) Take a breakpoint at getCopyStart 3) run 4) enforce result = 0 after the call to PQmakeEmptyPGresult 5) And enjoy what has been reported C) pg_recvlogical, similar to B. 1) Create a replication slot on a server that accepts them select pg_create_logical_replication_slot('foo', 'test_decoding'); 2) Same breakpoint as B) with this start command or similar (depends on where the logical slot has been created): pg_recvlogical --start --slot foo -f - -d postgres D) triggering standby problem: 1) Patch upstream, as follows by adding a sleep in libpqrcv_startstreaming of libpqwalreceiver.c. This is a simple method to have the time to take a breakpoint on the WAL receiver process that has been started, with streaming that has not begun yet. diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c index f670957..c7fccf1 100644 --- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c +++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c @@ -188,6 +188,9 @@ libpqrcv_startstreaming(TimeLineID tli, XLogRecPtr startpoint, char *slotname) charcmd[256]; PGresult *res; + /* ZZZzzz... */ + pg_usleep(1L); + /* Start streaming from the point requested by startup process */ if (slotname != NULL) snprintf(cmd, sizeof(cmd), Trying to design a test taking into account the context of a WAL receiver does not seem worth it to me with a simple method like this one... 2) Start the standby and attach debugger on the WAL receiver process, send SIGSTOP to it, whatever... 3) breakpoint at getCopyStart 4) Then move on with the same process as previously, and enjoy the errors. -- Michael diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c index f670957..c7fccf1 100644 --- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c +++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c @@ -188,6 +188,9 @@ libpqrcv_startstreaming(TimeLineID tli, XLogRecPtr startpoint, char *slotname) char cmd[256]; PGresult *res; + /* ZZZzzz... */ + pg_usleep(1L); + /* Start streaming from the point requested by startup process */ if (slotname != NULL) snprintf(cmd, sizeof(cmd), -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] OOM in libpq and infinite loop with getCopyStart()
Hello, Michael > The easiest way to perform tests with this patch is to take a debugger > and enforce the malloc'd pointers to NULL in the code paths. I see. Still I don't think it's an excuse to not provide clear steps to reproduce an issue. As I see it anyone should be able to easily check your patch locally without having deep understanding of how libpq is implemented or reading thread which contains 48 e-mails. For instance you cold provide a step-by-step instruction like: 1. run gdb --args some-prog arg1 arg2 ... 2. b some_file.c:123 3. c 4. ... 5. we expect ... but actually see ... Or you cold automate these steps using gdb batch file: gdb --batch --command=gdb.script --args some-prog arg1 arg2 ... where gdb.script is: b some_file.c:123 c ... etc ... Naturally all of this is optional. But it will simplify both understanding and reviewing of this path. Thus chances that this patch will be accepted will be increased and time required for this will be reduced significantly. Best regards, Aleksander -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] OOM in libpq and infinite loop with getCopyStart()
On Thu, Mar 3, 2016 at 10:18 PM, Aleksander Alekseev wrote: > I didn't checked your patch in detail yet but here is a thought I would > like to share. > > In my experience usually it takes number of rewrites before patch will > be accepted. To make sure that after every rewrite your patch still > solves an issue you described you should probably provide a few test > programs too. If it's possible to make these programs part of regression > tests suite it would be just great. Yes that's usually the case, this patch is not at its first version. > Without such test programs I personally find it difficult to verify > that your patch fixes something. If such examples are provided your > patch would be much more likely to accepted. "See, this program crashes > everything. Now we apply a patch and everything works! Cool, heh?" The easiest way to perform tests with this patch is to take a debugger and enforce the malloc'd pointers to NULL in the code paths. That's wild but efficient, and that's actually for we did for the other two libpq patches treating those OOM failures. In the case of the standby code path what I did was to put a sleep in the WAL receiver code and then trigger the OOM manually, leading to the errors listed upthread. Regarding automated tests, it is largely feasible to have tests on Linux at least by using for example LD_PRELOAD or glibc-specific malloc hooks, though those would be platform-restricted. Here is for example one for COPY IN/OUT, that this patch also passed (there has been already some discussion for this patch). http://www.postgresql.org/message-id/55fc501a.5000...@iki.fi Applying it for recovery is doable as well by manipulating the recovery protocol with libpq though it is aimed really at covering a narrow case. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] OOM in libpq and infinite loop with getCopyStart()
Hello, Michael I didn't checked your patch in detail yet but here is a thought I would like to share. In my experience usually it takes number of rewrites before patch will be accepted. To make sure that after every rewrite your patch still solves an issue you described you should probably provide a few test programs too. If it's possible to make these programs part of regression tests suite it would be just great. Without such test programs I personally find it difficult to verify that your patch fixes something. If such examples are provided your patch would be much more likely to accepted. "See, this program crashes everything. Now we apply a patch and everything works! Cool, heh?" Best regards, Aleksander -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers