Re: [sqlalchemy] Re: Multi-AZ RDS on AWS, failover and SQLAlchemy

Harry Wed, 20 Jan 2021 02:42:41 -0800

With apologies for resuscitating an old thread, I came across this issue 
last week, a 15-minute hang while an amazon RDS instance was doing a 
failover (we are on SQLA 1.3.22).  I need to do some more work on repro-ing 
the issue but:


This post on SO from last month 
https://stackoverflow.com/questions/65505031/sqlalchemy-ssl-syscall-timeout-coping-mechanism
 
mentions that PGBouncer had a similar issue (linked).
PGBouncer decided it was to do with low-level TCP settings.  they fixed it 
by adjusting the TCP_USER_TIMEOUT setting.  that can be done on a 
per-connection basis apparently. 

Would it be worth investigating and adding to SQLA?

hp

On Monday, 8 February 2016 at 00:47:40 UTC Josh wrote:

> Forgot to send an update myself. We actually figured it out: I found that 
> it was due to SQLA not properly detecting a failed connection - we get "SSL 
> SYSCALL error" and the connection pool doesn't reopen + stays borked. Looks 
> like SQLA has a few hard-coded string checks that it looks for and SSL 
> SYCALL was a newer one.
>
> 0.9.5 and 0.9.8 fixed a bunch of related issues:
>     0.9.5: Added a new "disconnect" message connection has been closed 
> unexpectedly"
>     0.9.5: The psycopg2 .closed accessor is now consulted when determining 
> if an
>       exception is a "disconnect" error
>     0.9.8: A revisit to this issue first patched in 0.9.5, apparently 
> psycopg2’s
>       .closed accessor is not as reliable as we assumed, so we have added 
> an explicit
>       check for the exception messages "SSL SYSCALL error: Bad file 
> descriptor" and
>       "SSL SYSCALL error: EOF detected" when detecting an is-disconnect 
> scenario
>
> So upgrade to 0.9.8+ solved our issues. When a failover happens:
>
> 1) primary stops accepting requests, triggers SSL SYSCALL errors
> 2) connection pool invalidates the connection, opens a new one
> 3) dns has 5s ttl, so by then it points to new primary (if not, error + 
> repeat)
> 4) new connection is established, and we've recovered
>
> So there's a few errors, which seems like you can solve it using 
> pessimistic disconnect handling, but at this point it seems reasonable 
> enough for us.
>
> - Josh
>
> On Saturday, February 6, 2016 at 10:22:53 AM UTC-8, Jonas Oscarsson wrote:
>>
>> Hi Josh,
>>
>> Sorry for the late reply.
>>
>> Unfortunately we haven't had time/prioritized to look into it further. 
>> For the very few times when we've had to use the failover functionality, it 
>> has recovered quickly enough (within two minutes or so).
>>
>> Best,
>> Jonas
>>
>> On Mon, Dec 21, 2015 at 9:19 PM Joshua Ma <jo...@benchling.com> wrote:
>>
>>> Hi Jonas,
>>>
>>> Just wondering if you were able to characterize the behavior and had any 
>>> ideas for gracefully handling RDS failover. From what I'm seeing, RDS will 
>>> not cause errors, but will simply hang, so SQLA doesn't hit its "optimistic 
>>> disconnect handling" (
>>> http://docs.sqlalchemy.org/en/latest/core/pooling.html#disconnect-handling-optimistic)
>>>  
>>> handler.
>>>
>>> Have you found a simple method of identifying failover but not a slow 
>>> query? Would be super useful for us and I'm guessing other folks using SQLA 
>>> + RDS. (We're also using Postgres.)
>>>
>>> - Josh
>>>
>>> On Saturday, October 18, 2014 at 5:42:32 AM UTC-7, Jonas Oscarsson wrote:
>>>
>>>> Thanks for the tip, we haven't tried that yet although I think we would 
>>>> have the same behavior as it seems to be the next connection after the 
>>>> database is down that hangs (no matter if it's a SELECT 1 or an actual 
>>>> query).
>>>>
>>> On Fri, Oct 17, 2014 at 5:38 PM, Jonathan Vanasco <jvan...@gmail.com> 
>>>> wrote:
>>>>
>>> Have you tried a Pessimistic Disconnect strategy?
>>>>>
>>>>>
>>>>> http://docs.sqlalchemy.org/en/latest/core/pooling.html#dealing-with-disconnects
>>>>>
>>>>>
>>>>> -- 
>>>>> You received this message because you are subscribed to a topic in the 
>>>>> Google Groups "sqlalchemy" group.
>>>>> To unsubscribe from this topic, visit 
>>>>> https://groups.google.com/d/topic/sqlalchemy/NANPH1QWgMM/unsubscribe.
>>>>>
>>>> To unsubscribe from this group and all its topics, send an email to 
>>>>> sqlalchemy+...@googlegroups.com.
>>>>> To post to this group, send email to sqlal...@googlegroups.com.
>>>>>
>>>>
>>>>> Visit this group at http://groups.google.com/group/sqlalchemy.
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>> -- 
>>> You received this message because you are subscribed to a topic in the 
>>> Google Groups "sqlalchemy" group.
>>> To unsubscribe from this topic, visit 
>>> https://groups.google.com/d/topic/sqlalchemy/NANPH1QWgMM/unsubscribe.
>>> To unsubscribe from this group and all its topics, send an email to 
>>> sqlalchemy+...@googlegroups.com.
>>> To post to this group, send email to sqlal...@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/sqlalchemy.
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>

-- 
SQLAlchemy - 
The Python SQL Toolkit and Object Relational Mapper

http://www.sqlalchemy.org/

To post example code, please provide an MCVE: Minimal, Complete, and Verifiable 
Example.  See  http://stackoverflow.com/help/mcve for a full description.
--- 
You received this message because you are subscribed to the Google Groups 
"sqlalchemy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sqlalchemy+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/sqlalchemy/75b4bed5-bbbc-476f-9635-e270780f010cn%40googlegroups.com.

Re: [sqlalchemy] Re: Multi-AZ RDS on AWS, failover and SQLAlchemy

Reply via email to