With apologies for resuscitating an old thread, I came across this issue last week, a 15-minute hang while an amazon RDS instance was doing a failover (we are on SQLA 1.3.22). I need to do some more work on repro-ing the issue but:
This post on SO from last month https://stackoverflow.com/questions/65505031/sqlalchemy-ssl-syscall-timeout-coping-mechanism mentions that PGBouncer had a similar issue (linked). PGBouncer decided it was to do with low-level TCP settings. they fixed it by adjusting the TCP_USER_TIMEOUT setting. that can be done on a per-connection basis apparently. Would it be worth investigating and adding to SQLA? hp On Monday, 8 February 2016 at 00:47:40 UTC Josh wrote: > Forgot to send an update myself. We actually figured it out: I found that > it was due to SQLA not properly detecting a failed connection - we get "SSL > SYSCALL error" and the connection pool doesn't reopen + stays borked. Looks > like SQLA has a few hard-coded string checks that it looks for and SSL > SYCALL was a newer one. > > 0.9.5 and 0.9.8 fixed a bunch of related issues: > 0.9.5: Added a new "disconnect" message connection has been closed > unexpectedly" > 0.9.5: The psycopg2 .closed accessor is now consulted when determining > if an > exception is a "disconnect" error > 0.9.8: A revisit to this issue first patched in 0.9.5, apparently > psycopg2’s > .closed accessor is not as reliable as we assumed, so we have added > an explicit > check for the exception messages "SSL SYSCALL error: Bad file > descriptor" and > "SSL SYSCALL error: EOF detected" when detecting an is-disconnect > scenario > > So upgrade to 0.9.8+ solved our issues. When a failover happens: > > 1) primary stops accepting requests, triggers SSL SYSCALL errors > 2) connection pool invalidates the connection, opens a new one > 3) dns has 5s ttl, so by then it points to new primary (if not, error + > repeat) > 4) new connection is established, and we've recovered > > So there's a few errors, which seems like you can solve it using > pessimistic disconnect handling, but at this point it seems reasonable > enough for us. > > - Josh > > On Saturday, February 6, 2016 at 10:22:53 AM UTC-8, Jonas Oscarsson wrote: >> >> Hi Josh, >> >> Sorry for the late reply. >> >> Unfortunately we haven't had time/prioritized to look into it further. >> For the very few times when we've had to use the failover functionality, it >> has recovered quickly enough (within two minutes or so). >> >> Best, >> Jonas >> >> On Mon, Dec 21, 2015 at 9:19 PM Joshua Ma <jo...@benchling.com> wrote: >> >>> Hi Jonas, >>> >>> Just wondering if you were able to characterize the behavior and had any >>> ideas for gracefully handling RDS failover. From what I'm seeing, RDS will >>> not cause errors, but will simply hang, so SQLA doesn't hit its "optimistic >>> disconnect handling" ( >>> http://docs.sqlalchemy.org/en/latest/core/pooling.html#disconnect-handling-optimistic) >>> >>> handler. >>> >>> Have you found a simple method of identifying failover but not a slow >>> query? Would be super useful for us and I'm guessing other folks using SQLA >>> + RDS. (We're also using Postgres.) >>> >>> - Josh >>> >>> On Saturday, October 18, 2014 at 5:42:32 AM UTC-7, Jonas Oscarsson wrote: >>> >>>> Thanks for the tip, we haven't tried that yet although I think we would >>>> have the same behavior as it seems to be the next connection after the >>>> database is down that hangs (no matter if it's a SELECT 1 or an actual >>>> query). >>>> >>> On Fri, Oct 17, 2014 at 5:38 PM, Jonathan Vanasco <jvan...@gmail.com> >>>> wrote: >>>> >>> Have you tried a Pessimistic Disconnect strategy? >>>>> >>>>> >>>>> http://docs.sqlalchemy.org/en/latest/core/pooling.html#dealing-with-disconnects >>>>> >>>>> >>>>> -- >>>>> You received this message because you are subscribed to a topic in the >>>>> Google Groups "sqlalchemy" group. >>>>> To unsubscribe from this topic, visit >>>>> https://groups.google.com/d/topic/sqlalchemy/NANPH1QWgMM/unsubscribe. >>>>> >>>> To unsubscribe from this group and all its topics, send an email to >>>>> sqlalchemy+...@googlegroups.com. >>>>> To post to this group, send email to sqlal...@googlegroups.com. >>>>> >>>> >>>>> Visit this group at http://groups.google.com/group/sqlalchemy. >>>>> For more options, visit https://groups.google.com/d/optout. >>>>> >>>> -- >>> You received this message because you are subscribed to a topic in the >>> Google Groups "sqlalchemy" group. >>> To unsubscribe from this topic, visit >>> https://groups.google.com/d/topic/sqlalchemy/NANPH1QWgMM/unsubscribe. >>> To unsubscribe from this group and all its topics, send an email to >>> sqlalchemy+...@googlegroups.com. >>> To post to this group, send email to sqlal...@googlegroups.com. >>> Visit this group at https://groups.google.com/group/sqlalchemy. >>> For more options, visit https://groups.google.com/d/optout. >>> >> -- SQLAlchemy - The Python SQL Toolkit and Object Relational Mapper http://www.sqlalchemy.org/ To post example code, please provide an MCVE: Minimal, Complete, and Verifiable Example. See http://stackoverflow.com/help/mcve for a full description. --- You received this message because you are subscribed to the Google Groups "sqlalchemy" group. To unsubscribe from this group and stop receiving emails from it, send an email to sqlalchemy+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/sqlalchemy/75b4bed5-bbbc-476f-9635-e270780f010cn%40googlegroups.com.