On 1/14/14, 4:21 AM, Mel Gorman wrote:
There is an interesting side-line here. If all IO is initiated by one
process in postgres then the memory locality will be sub-optimal.
The consumer of the data may or may not be running on the same
node as the process that read the data from disk. It is possible to
migrate this from user space but the interface is clumsy and assumes the
data is mapped.

That's really not the case in Postgres. There's essentially 3 main areas for IO 
requests to come from:

- Individual "backends". These are processes forked off of our startup process 
(postmaster) for the purpose of serving user connections. This is always "foreground" IO 
and should be avoided as much as possible (but is still a large percentage).
- autovacuum. This is a set of "clean-up" processes, meant to be low impact, 
background only. Similar to garbage collection is GC languages.
- bgwriter. This process is meant to greatly reduce the need for user backends 
to write data out.

Generally speaking, read requests are most likely to come from user backends. 
autovacuum can issue them too, but it's got a throttling mechanism so generally 
shouldn't be that much of the workload.

Ideally most write traffic would come from bgwriter (and autovacuum, though again we 
don't care too much about it). In reality though, that's going to depend very highly on a 
user's actual workload. To start, backends normally must write all write-ahead-log 
traffic before they finalize (COMMIT) a transaction for the user. COMMIT is sort of 
similar in idea to fsync... "When this returns I guarantee I've permanently stored 
your data."

The amount of WAL data generated for a transaction will vary enormously, even 
as a percentage of raw page data written. In some cases a very small (10s-100s 
of bytes) amount of WAL data will cover 1 or more base data pages (8k by 
default, up to 64k). But to protect against torn page writes, by default we 
write a complete copy of a data page to WAL the first time the page is dirtied 
after a checkpoint. So the opposite scenario is we actually write slightly MORE 
data to WAL than we do to the data pages.

What makes WAL even trickier is that bgwritter tries to write WAL data out 
before backends need to. In a system with a fairly low transaction rate that 
can work... but with a higher rate most WAL data will be written by a backend 
trying to issue a COMMIT. Note however that COMMIT needs to write ALL WAL data 
up to a given point, so one backend that only needs to write 100 bytes can 
easily end up flushing (and fsync'ing) megabytes of data written by some other 
backend.

Further complicating things is temporary storage, either in the form of user 
defined temporary tables, or temporary storage needed by the database itself. 
It's hard to characterize these workloads other than to say that typically 
reading and writing to them will want to move a relatively large amount of data 
at once.

BTW, because Postgres doesn't have terribly sophisticated memory management, 
it's very common to create temporary file data that will never, ever, ever 
actually NEED to hit disk. Where I work being able to tell the kernel to avoid 
flushing those files unless the kernel thinks it's got better things to do with 
that memory would be EXTREMELY valuable, because it's all temp data anyway: if 
the database or server crashes it's just going to get throw away. It might be a 
good idea for the Postgres to look at simply putting this data into plain 
memory now and relying on the OS to swap it as needed. That'd be more 
problematic for temp tables, but in that case mmap might work very well, 
because that data is currently never shared by other processes, though if we 
start doing parallel query execution that will change.
--
Jim C. Nasby, Data Architect                       j...@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to