On 2011-08-08 15:29, Robert Haas wrote:
On Sat, Aug 6, 2011 at 2:16 PM, Dimitri Fontaine<dimi...@2ndquadrant.fr>  wrote:
Robert Haas<robertmh...@gmail.com>  writes:
It would be nice if the Linux guys would fix this problem for us, but
I'm not sure whether they will.  For those who may be curious, the
problem is in generic_file_llseek() in fs/read-write.c.  On a platform
with 8-byte atomic reads, it seems like it ought to be very possible
to read inode->i_size without taking a spinlock.  A little Googling
around suggests that some patches along these lines have been proposed
and - for reasons that I don't fully understand - rejected.  That now
seems unfortunate.  Barring a kernel-level fix, we could try to
implement our own cache to work around this problem.  However, any
such cache would need to be darn cheap to check and update (since we
can't assume that relation extension is an infrequent event) and must
somehow having the same sort of mutex contention that's killing the
kernel in this workload.
What about making the relation extension much less frequent?  It's been
talked about before here, that instead of extending 8kB at a time we
could (should) extend by much larger chunks.  I would go as far as
preallocating the whole next segment (1GB) (in the background) as soon
as the current is more than half full, or such a policy.

Then you have the problem that you can't really use lseek() anymore to
guess'timate a relation size, but Tom said in this thread that the
planner certainly doesn't need something that accurate.  Maybe the
reltuples would do?  If not, it could be that some adapting of its
accuracy could be done?
I think that pre-extending relations or extending them in larger
increments is probably a good idea, although I think the AMOUNT of
preallocation you just proposed would be severe overkill.  If we
extended the relation in 1MB chunks, we'd reduce the number of
relation extensions by more than 99%, and with far less space wastage
than the approach you are proposing.
Preextending in bigger chuncks has other benefits
as well, since it helps the filsystem (if it supports extends) to get
the data from the relation layed out in sequential order on disk.

On a well filled relation doing filefrag on an ext4 filesystem reveals
that data loaded during initial creation gives 10-11 extends per 1GB
file. Whereas a relation filled over time gives as much as 128 extends.

I would suggest 5% of current relation size or 25-100MB whatever being
the smallest of it. That would still keep the size down on small relations.

--
Jesper


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to