On 11/13/2012 2:06 PM, Mark Stosberg wrote:
>
>
> On 11/12/2012 04:15 PM, Andrew Sullivan wrote:
>> On Mon, Nov 12, 2012 at 04:05:37PM -0500, Christopher Browne wrote:
>>> Is it possible that your libc is doing cacheing?  If *that's* the
>>> case, and I'm suspicious of it, then you really would need to restart
>>> the slon to get "unstuck" from what libc has cached on you...
>>
>> More likely, the system resolver is pointing at a caching name server.
>
> Thanks for prompt answers, everyone.
>
> I'll pursue this as an issue with our use of Ubuntu on Amazon/EC2.

I don't think that restarting the slon manually is the right solution to 
this problem.

Whatever really causes it (libc probably) doesn't matter. Changing DNS 
entries to redirect services to a different IP address is often used in 
disaster recovery procedures (failover). We should not require an 
outside process or worse an admin to restart things in a case like that.

I don't know what exactly is the right answer. From behind libpq it is 
rather difficult to tell what exactly went wrong when trying to connect 
to a database. A generic answer to that would be to do only X attempts 
to (re)connect, then restart the entire slon process. But that has 
pitfalls. If someone has shut down a replica for maintenance purposes, 
all slons would constantly restart in that interval, which would prevent 
longer running tasks like subscriptions. So this would have to go hand 
in hand with some new configuration option(s), like "ignore connection 
failures to this node WRT restarting".

Some input from the user community would certainly help.


Jan

-- 
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin


_______________________________________________
Slony1-general mailing list
[email protected]
http://lists.slony.info/mailman/listinfo/slony1-general

Reply via email to