On 11/13/2012 2:06 PM, Mark Stosberg wrote:
>
>
> On 11/12/2012 04:15 PM, Andrew Sullivan wrote:
>> On Mon, Nov 12, 2012 at 04:05:37PM -0500, Christopher Browne wrote:
>>> Is it possible that your libc is doing cacheing? If *that's* the
>>> case, and I'm suspicious of it, then you really would need to restart
>>> the slon to get "unstuck" from what libc has cached on you...
>>
>> More likely, the system resolver is pointing at a caching name server.
>
> Thanks for prompt answers, everyone.
>
> I'll pursue this as an issue with our use of Ubuntu on Amazon/EC2.
I don't think that restarting the slon manually is the right solution to
this problem.
Whatever really causes it (libc probably) doesn't matter. Changing DNS
entries to redirect services to a different IP address is often used in
disaster recovery procedures (failover). We should not require an
outside process or worse an admin to restart things in a case like that.
I don't know what exactly is the right answer. From behind libpq it is
rather difficult to tell what exactly went wrong when trying to connect
to a database. A generic answer to that would be to do only X attempts
to (re)connect, then restart the entire slon process. But that has
pitfalls. If someone has shut down a replica for maintenance purposes,
all slons would constantly restart in that interval, which would prevent
longer running tasks like subscriptions. So this would have to go hand
in hand with some new configuration option(s), like "ignore connection
failures to this node WRT restarting".
Some input from the user community would certainly help.
Jan
--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin
_______________________________________________
Slony1-general mailing list
[email protected]
http://lists.slony.info/mailman/listinfo/slony1-general