Re: [sqlalchemy] Re: Multi-AZ RDS on AWS, failover and SQLAlchemy

Harry Wed, 20 Jan 2021 03:05:09 -0800

setting socket options like this might be more of a psycopg2 thing?  some 
more investigations here: 
https://stackoverflow.com/questions/26741175/psycopg2-db-connection-hangs-on-lost-network-connection


On Wednesday, 20 January 2021 at 10:42:35 UTC Harry wrote:

> With apologies for resuscitating an old thread, I came across this issue 
> last week, a 15-minute hang while an amazon RDS instance was doing a 
> failover (we are on SQLA 1.3.22).  I need to do some more work on repro-ing 
> the issue but:
>
> This post on SO from last month 
> https://stackoverflow.com/questions/65505031/sqlalchemy-ssl-syscall-timeout-coping-mechanism
>  
> mentions that PGBouncer had a similar issue (linked).
> PGBouncer decided it was to do with low-level TCP settings.  they fixed it 
> by adjusting the TCP_USER_TIMEOUT setting.  that can be done on a 
> per-connection basis apparently. 
>
> Would it be worth investigating and adding to SQLA?
>
> hp
>
> On Monday, 8 February 2016 at 00:47:40 UTC Josh wrote:
>
>> Forgot to send an update myself. We actually figured it out: I found that 
>> it was due to SQLA not properly detecting a failed connection - we get "SSL 
>> SYSCALL error" and the connection pool doesn't reopen + stays borked. Looks 
>> like SQLA has a few hard-coded string checks that it looks for and SSL 
>> SYCALL was a newer one.
>>
>> 0.9.5 and 0.9.8 fixed a bunch of related issues:
>>     0.9.5: Added a new "disconnect" message connection has been closed 
>> unexpectedly"
>>     0.9.5: The psycopg2 .closed accessor is now consulted when 
>> determining if an
>>       exception is a "disconnect" error
>>     0.9.8: A revisit to this issue first patched in 0.9.5, apparently 
>> psycopg2’s
>>       .closed accessor is not as reliable as we assumed, so we have added 
>> an explicit
>>       check for the exception messages "SSL SYSCALL error: Bad file 
>> descriptor" and
>>       "SSL SYSCALL error: EOF detected" when detecting an is-disconnect 
>> scenario
>>
>> So upgrade to 0.9.8+ solved our issues. When a failover happens:
>>
>> 1) primary stops accepting requests, triggers SSL SYSCALL errors
>> 2) connection pool invalidates the connection, opens a new one
>> 3) dns has 5s ttl, so by then it points to new primary (if not, error + 
>> repeat)
>> 4) new connection is established, and we've recovered
>>
>> So there's a few errors, which seems like you can solve it using 
>> pessimistic disconnect handling, but at this point it seems reasonable 
>> enough for us.
>>
>> - Josh
>>
>> On Saturday, February 6, 2016 at 10:22:53 AM UTC-8, Jonas Oscarsson wrote:
>>>
>>> Hi Josh,
>>>
>>> Sorry for the late reply.
>>>
>>> Unfortunately we haven't had time/prioritized to look into it further. 
>>> For the very few times when we've had to use the failover functionality, it 
>>> has recovered quickly enough (within two minutes or so).
>>>
>>> Best,
>>> Jonas
>>>
>>> On Mon, Dec 21, 2015 at 9:19 PM Joshua Ma <[email protected]> wrote:
>>>
>>>> Hi Jonas,
>>>>
>>>> Just wondering if you were able to characterize the behavior and had 
>>>> any ideas for gracefully handling RDS failover. From what I'm seeing, RDS 
>>>> will not cause errors, but will simply hang, so SQLA doesn't hit its 
>>>> "optimistic disconnect handling" (
>>>> http://docs.sqlalchemy.org/en/latest/core/pooling.html#disconnect-handling-optimistic)
>>>>  
>>>> handler.
>>>>
>>>> Have you found a simple method of identifying failover but not a slow 
>>>> query? Would be super useful for us and I'm guessing other folks using 
>>>> SQLA 
>>>> + RDS. (We're also using Postgres.)
>>>>
>>>> - Josh
>>>>
>>>> On Saturday, October 18, 2014 at 5:42:32 AM UTC-7, Jonas Oscarsson 
>>>> wrote:
>>>>
>>>>> Thanks for the tip, we haven't tried that yet although I think we 
>>>>> would have the same behavior as it seems to be the next connection after 
>>>>> the database is down that hangs (no matter if it's a SELECT 1 or an 
>>>>> actual query).
>>>>>
>>>> On Fri, Oct 17, 2014 at 5:38 PM, Jonathan Vanasco <[email protected]> 
>>>>> wrote:
>>>>>
>>>> Have you tried a Pessimistic Disconnect strategy?
>>>>>>
>>>>>>
>>>>>> http://docs.sqlalchemy.org/en/latest/core/pooling.html#dealing-with-disconnects
>>>>>>
>>>>>>
>>>>>> -- 
>>>>>> You received this message because you are subscribed to a topic in 
>>>>>> the Google Groups "sqlalchemy" group.
>>>>>> To unsubscribe from this topic, visit 
>>>>>> https://groups.google.com/d/topic/sqlalchemy/NANPH1QWgMM/unsubscribe.
>>>>>>
>>>>> To unsubscribe from this group and all its topics, send an email to 
>>>>>> [email protected].
>>>>>> To post to this group, send email to [email protected].
>>>>>>
>>>>>
>>>>>> Visit this group at http://groups.google.com/group/sqlalchemy.
>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>
>>>>> -- 
>>>> You received this message because you are subscribed to a topic in the 
>>>> Google Groups "sqlalchemy" group.
>>>> To unsubscribe from this topic, visit 
>>>> https://groups.google.com/d/topic/sqlalchemy/NANPH1QWgMM/unsubscribe.
>>>> To unsubscribe from this group and all its topics, send an email to 
>>>> [email protected].
>>>> To post to this group, send email to [email protected].
>>>> Visit this group at https://groups.google.com/group/sqlalchemy.
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>

-- 
SQLAlchemy - 
The Python SQL Toolkit and Object Relational Mapper

http://www.sqlalchemy.org/

To post example code, please provide an MCVE: Minimal, Complete, and Verifiable 
Example.  See  http://stackoverflow.com/help/mcve for a full description.
--- 
You received this message because you are subscribed to the Google Groups 
"sqlalchemy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/sqlalchemy/8d91a805-cdb1-49b1-9e09-bf7d5f9b4562n%40googlegroups.com.

Re: [sqlalchemy] Re: Multi-AZ RDS on AWS, failover and SQLAlchemy

Reply via email to