Re: [Slony1-general] Slow replication issue

Jan Wieck Thu, 29 May 2008 12:27:45 -0700

On 5/27/2008 10:26 AM, Vivek Khera wrote:

On May 23, 2008, at 2:05 PM, Jan Wieck wrote:
The slony connections are all regular libpq database connections. Soyou might test this by using psql or pg_dump running on one of thosesubscribers, connecting to the appropriate data provider. If thatcan utilize more bandwidth, then the problem lies within the replicaitself and something else must be limiting it from reading from thenetwork faster.
Every time I dig into why our replication is lagging severely (morethan 10-15 mintues) I find that I'm spending a lot of time inside"FETCH 100" queries, and then a lot of time between them, as well (ie,applying the updates to the replica). It feels as of pg isn't runningat full speed, but I don't have the numbers to prove it.

Which is surprising, because the entire (complex I might add)architecture of helper threads doing the fetch, placing result rows intobuffers and shoveling them towards the remote_worker thread was supposedto reduce those times, where the remote_worker is "waiting" for rowsinstead of full-bore applying changes.

One thing that comes to mind would be if the slon is actually running onthe same box as any of the involved databases. In that case I can thinkof a scenario where the remote helper is buffering quite well ahead ...and since the local DB is heavily under fire the OS thinks it's a goodidea to page those buffers out. To make better educated guesswork weprobably need a few more DEBUG points where the remote worker andhelpers are issuing messages when the internal buffer exceeds some highwater mark or when (after exceeding high) falls back below some lowwater mark. That would help us very much to fine tune the actual amountof buffers and the fetch size (both of which should be config options).

Another thing to look for is the dreaded "delay for first row". That isthe time it takes the data provider from when the subscriber asks forthe current SYNC's log rows until the data provider actually returns thefirst FETCH chunk. That is definitely time that the subscriber is doingnothing but twiddling thumbs. Unfortunately, it is going to add a lotmore complexity in the worker/helper architecture to cure that problem.But I would like to hear what are typical ratios of "delay for firstrow" / "time for entire sync" out in the field. Because if that delayaccounts for a large portion of the entire sync processing we betterlook into improving that part, however complex the solution to it mightbe. Again, it would help to have some better logging here that simplystates the percentage of time spent waiting during the sync processing.That would be the sum of all delays caused when the worker is waitingfor log rows.



Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin

_______________________________________________
Slony1-general mailing list
[email protected]
http://lists.slony.info/mailman/listinfo/slony1-general

Re: [Slony1-general] Slow replication issue

Reply via email to