On Wed, Apr 10, 2013 at 11:10 PM, Tom Lane <t...@sss.pgh.pa.us> wrote:
> Gurjeet Singh <gurj...@singh.im> writes: > > If I'm reading the code right [1], this GUC does not actually > *synchronize* > > the scans, but instead just makes sure that a new scan starts from a > block > > that was reported by some other backend performing a scan on the same > > relation. > > Well, that's the only *direct* effect, but ... > > > Since the backends scanning the relation may be processing the relation > at > > different speeds, even though each one took the hint when starting the > > scan, they may end up being out of sync with each other. > > The point you're missing is that the synchronization is self-enforcing: > whichever backend gets ahead of the others will be the one forced to > request (and wait for) the next physical I/O. This will naturally slow > down the lower-CPU-cost-per-page scans. The other ones tend to catch up > during the I/O operation. > Got it. So far, so good. Let's consider a pathological case where a scan is performed by a user controlled cursor, whose scan speed depends on how fast the user presses the "Next" button, then this scan is quickly going to fall out of sync with other scans. Moreover, if a new scan happens to pick up the block reported by this slow scan, then that new scan may have to read blocks off the disk afresh. So, again, it is not guaranteed that all the scans on a relation will synchronize with each other. Hence my proposal to include the term 'probability' in the definition. > The feature is not terribly useful unless I/O costs are high compared to > the CPU cost-per-page. But when that is true, it's actually rather > robust. Backends don't have to have exactly the same per-page > processing cost, because pages stay in shared buffers for a while after > the current scan leader reads them. > Agreed. Even if the buffer has been evicted from shared_buffers, there's a high likelihood that the scan that's close on the heels of others will fetch it from FS cache. > > > Imagining that all scans on a table are always synchronized, may make > some > > wrongly believe that adding more backends scanning the same table will > not > > incur any extra I/O; that is, only one stream of blocks will be read from > > disk no matter how many backends you add to the mix. I noticed this when > I > > was creating partition tables, and each of those was a CREATE TABLE AS > > SELECT FROM original_table (to avoid WAL generation), and running more > than > > 3 such transactions caused the disk read throughput to behave > unpredictably, > > sometimes even dipping below 1 MB/s for a few seconds at a stretch. > > It's not really the scans that's causing that to be unpredictable, it's > the write I/O from the output side, which is forcing highly > nonsequential behavior (or at least I suspect so ... how many disk units > were involved in this test?) > You may be right. I don't have access to the system anymore, and I don't remember the disk layout, but it's quite possible that write operations were causing the read throughput to drop. I did try to reproduce the behaviour on my laptop with up to 6 backends doing pure reads on a table that was multiple times the system RAM, but I could not get them to get out of sync. -- Gurjeet Singh http://gurjeet.singh.im/ EnterpriseDB Inc.