On 6/7/13 10:14 AM, Robert Haas wrote:
If the page hit limit goes away, the user with a single core server who used
to having autovacuum only pillage shared_buffers at 78MB/s might complain
that if it became unbounded.

Except that it shouldn't become unbounded, because of the ring-buffer
stuff.  Vacuum can pillage the OS cache, but the degree to which a
scan of a single relation can pillage shared_buffers should be sharply
limited.

I wasn't talking about disruption of the data that's in the buffer cache. The only time the scenario I was describing plays out is when the data is already in shared_buffers. The concern is damage done to the CPU's data cache by this activity. Right now you can't even reach 100MB/s of damage to your CPU caches in an autovacuum process. Ripping out the page hit cost will eliminate that cap. Autovacuum could introduce gigabytes per second of memory -> L1 cache transfers. That's what all my details about memory bandwidth were trying to put into context. I don't think it really matter much because the new bottleneck will be the processing speed of a single core, and that's still a decent cap to most people now.

I think you're missing my point here, which is is that we shouldn't
have any such things as a "cost limit".  We should limit reads and
writes *completely separately*.  IMHO, there should be a limit on
reading, and a limit on dirtying data, and those two limits should not
be tied to any common underlying "cost limit".  If they are, they will
not actually enforce precisely the set limit, but some other composite
limit which will just be weird.

I see the distinction you're making now, don't need a mock up to follow you. The main challenge of moving this way is that read and write rates never end up being completely disconnected from one another. A read will only cost some fraction of what a write does, but they shouldn't be completely independent.

Just because I'm comfortable doing 10MB/s of reads and 5MB/s of writes, I may not be happy with the server doing 9MB/s read + 5MB/s write=14MB/s of I/O in an implementation where they float independently. It's certainly possible to disconnect the two like that, and people will be able to work something out anyway. I personally would prefer not to lose some ability to specify how expensive read and write operations should be considered in relation to one another.

Related aside: shared_buffers is becoming a decreasing fraction of total RAM each release, because it's stuck with this rough 8GB limit right now. As the OS cache becomes a larger multiple of the shared_buffers size, the expense of the average read is dropping. Reads are getting more likely to be in the OS cache but not shared_buffers, which makes the average cost of any one read shrink. But writes are as expensive as ever.

Real-world tunings I'm doing now reflecting that, typically in servers with >128GB of RAM, have gone this far in that direction:

vacuum_cost_page_hit = 0
vacuum_cost_page_hit = 2
vacuum_cost_page_hit = 20

That's 4MB/s of writes, 40MB/s of reads, or some blended mix that considers writes 10X as expensive as reads. The blend is a feature.

The logic here is starting to remind me of how the random_page_cost default has been justified. Read-world random reads are actually close to 50X as expensive as sequential ones. But the average read from the executor's perspective is effectively discounted by OS cache hits, so 4.0 is still working OK. In large memory servers, random reads keep getting cheaper via better OS cache hit odds, and it's increasingly becoming something important to tune for.

Some of this mess would go away if we could crack the shared_buffers scaling issues for 9.4. There's finally enough dedicated hardware around to see the issue and work on it, but I haven't gotten a clear picture of any reproducible test workload that gets slower with large buffer cache sizes. If anyone has a public test case that gets slower when shared_buffers goes from 8GB to 16GB, please let me know; I've got two systems setup I could chase that down on now.

--
Greg Smith   2ndQuadrant US    g...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to