Re: [PERFORM] Block at a time ...
On Mar 22, 2010, at 4:46 PM, Craig James wrote: On 3/22/10 11:47 AM, Scott Carey wrote: On Mar 17, 2010, at 9:41 AM, Craig James wrote: On 3/17/10 2:52 AM, Greg Stark wrote: On Wed, Mar 17, 2010 at 7:32 AM, Pierre Cli...@peufeu.com wrote: I was thinking in something like that, except that the factor I'd use would be something like 50% or 100% of current size, capped at (say) 1 GB. This turns out to be a bad idea. One of the first thing Oracle DBAs are told to do is change this default setting to allocate some reasonably large fixed size rather than scaling upwards. This might be mostly due to Oracle's extent-based space management but I'm not so sure. Recall that the filesystem is probably doing some rounding itself. If you allocate 120kB it's probably allocating 128kB itself anyways. Having two layers rounding up will result in odd behaviour. In any case I was planning on doing this a while back. Then I ran some experiments and couldn't actually demonstrate any problem. ext2 seems to do a perfectly reasonable job of avoiding this problem. All the files were mostly large contiguous blocks after running some tests -- IIRC running pgbench. This is one of the more-or-less solved problems in Unix/Linux. Ext* file systems have a reserve usually of 10% of the disk space that nobody except root can use. It's not for root, it's because with 10% of the disk free, you can almost always do a decent job of allocating contiguous blocks and get good performance. Unless Postgres has some weird problem that Linux has never seen before (and that wouldn't be unprecedented...), there's probably no need to fool with file-allocation strategies. Craig Its fairly easy to break. Just do a parallel import with say, 16 concurrent tables being written to at once. Result? Fragmented tables. Is this from real-life experience? With fragmentation, there's a point of diminishing return. A couple head-seeks now and then hardly matter. My recollection is that even when there are lots of concurrent processes running that are all making files larger and larger, the Linux file system still can do a pretty good job of allocating mostly-contiguous space. It doesn't just dumbly allocate from some list, but rather tries to allocate in a way that results in pretty good contiguousness (if that's a word). On the other hand, this is just from reading discussion groups like this one over the last few decades, I haven't tried it... Well how fragmented is too fragmented depends on the use case and the hardware capability. In real world use, which for me means about 20 phases of large bulk inserts a day and not a lot of updates or index maintenance, the system gets somewhat fragmented but its not too bad. I did a dump/restore in 8.4 with parallel restore and it was much slower than usual. I did a single threaded restore and it was much faster. The dev environments are on ext3 and we see this pretty clearly -- but poor OS tuning can mask it (readahead parameter not set high enough). This is CentOS 5.4/5.3, perhaps later kernels are better at scheduling file writes to avoid this. We also use the deadline scheduler which helps a lot on concurrent reads, but might be messing up concurrent writes. On production with xfs this was also bad at first --- in fact worse because xfs's default 'allocsize' setting is 64k. So files were regularly fragmented in small multiples of 64k. Changing the 'allocsize' parameter to 80MB made the restore process produce files with fragment sizes of 80MB. 80MB is big for most systems, but this array does over 1000MB/sec sequential read at peak, and only 200MB/sec with moderate fragmentation. It won't fail to allocate disk space due to any 'reservations' of the delayed allocation, it just means that it won't choose to create a new file or extent within 80MB of another file that is open unless it has to. This can cause performance problems if you have lots of small files, which is why the default is 64k. Craig -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Block at a time ...
On Mar 17, 2010, at 9:41 AM, Craig James wrote: On 3/17/10 2:52 AM, Greg Stark wrote: On Wed, Mar 17, 2010 at 7:32 AM, Pierre Cli...@peufeu.com wrote: I was thinking in something like that, except that the factor I'd use would be something like 50% or 100% of current size, capped at (say) 1 GB. This turns out to be a bad idea. One of the first thing Oracle DBAs are told to do is change this default setting to allocate some reasonably large fixed size rather than scaling upwards. This might be mostly due to Oracle's extent-based space management but I'm not so sure. Recall that the filesystem is probably doing some rounding itself. If you allocate 120kB it's probably allocating 128kB itself anyways. Having two layers rounding up will result in odd behaviour. In any case I was planning on doing this a while back. Then I ran some experiments and couldn't actually demonstrate any problem. ext2 seems to do a perfectly reasonable job of avoiding this problem. All the files were mostly large contiguous blocks after running some tests -- IIRC running pgbench. This is one of the more-or-less solved problems in Unix/Linux. Ext* file systems have a reserve usually of 10% of the disk space that nobody except root can use. It's not for root, it's because with 10% of the disk free, you can almost always do a decent job of allocating contiguous blocks and get good performance. Unless Postgres has some weird problem that Linux has never seen before (and that wouldn't be unprecedented...), there's probably no need to fool with file-allocation strategies. Craig Its fairly easy to break. Just do a parallel import with say, 16 concurrent tables being written to at once. Result? Fragmented tables. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Block at a time ...
This is one of the more-or-less solved problems in Unix/Linux. Ext* file systems have a reserve usually of 10% of the disk space that nobody except root can use. It's not for root, it's because with 10% of the disk free, you can almost always do a decent job of allocating contiguous blocks and get good performance. Unless Postgres has some weird problem that Linux has never seen before (and that wouldn't be unprecedented...), there's probably no need to fool with file-allocation strategies. Craig Its fairly easy to break. Just do a parallel import with say, 16 concurrent tables being written to at once. Result? Fragmented tables. Delayed allocation (ext4, XFS) helps a lot for concurrent writing at a medium-high rate (a few megabytes per second and up) when lots of data can sit in the cache and be flushed/allocated as big contiguous chunks. I'm pretty sure ext4/XFS would pass your parallel import test. However if you have files like tables (and indexes) or logs that grow slowly over time (something like a few megabytes per hour or less), after a few days/weeks/months, horrible fragmentation is an almost guaranteed result on many filesystems (NTFS being perhaps the absolute worst). -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Block at a time ...
This is why pre-allocation is a good idea if you have the space Tom, what about a really simple command in a forthcoming release of PG that would just preformat a 1GB file at a time? This is what I've always done scripted with Oracle (ALTER TABLESPACE foo ADD DATAFILE ) rather than relying on its autoextender when performance has been a concern. Cheers Dave On Mon, Mar 22, 2010 at 3:55 PM, Pierre C li...@peufeu.com wrote: This is one of the more-or-less solved problems in Unix/Linux. Ext* file systems have a reserve usually of 10% of the disk space that nobody except root can use. It's not for root, it's because with 10% of the disk free, you can almost always do a decent job of allocating contiguous blocks and get good performance. Unless Postgres has some weird problem that Linux has never seen before (and that wouldn't be unprecedented...), there's probably no need to fool with file-allocation strategies. Craig Its fairly easy to break. Just do a parallel import with say, 16 concurrent tables being written to at once. Result? Fragmented tables. Delayed allocation (ext4, XFS) helps a lot for concurrent writing at a medium-high rate (a few megabytes per second and up) when lots of data can sit in the cache and be flushed/allocated as big contiguous chunks. I'm pretty sure ext4/XFS would pass your parallel import test. However if you have files like tables (and indexes) or logs that grow slowly over time (something like a few megabytes per hour or less), after a few days/weeks/months, horrible fragmentation is an almost guaranteed result on many filesystems (NTFS being perhaps the absolute worst). -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Block at a time ...
On Mon, Mar 22, 2010 at 6:47 PM, Scott Carey sc...@richrelevance.com wrote: Its fairly easy to break. Just do a parallel import with say, 16 concurrent tables being written to at once. Result? Fragmented tables. Fwiw I did do some investigation about this at one point and could not demonstrate any significant fragmentation. But that was on Linux -- different filesystem implementations would have different success rates. And there could be other factors as well such as how full the fileystem is or how old it is. -- greg -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Block at a time ...
On 3/22/10 11:47 AM, Scott Carey wrote: On Mar 17, 2010, at 9:41 AM, Craig James wrote: On 3/17/10 2:52 AM, Greg Stark wrote: On Wed, Mar 17, 2010 at 7:32 AM, Pierre Cli...@peufeu.com wrote: I was thinking in something like that, except that the factor I'd use would be something like 50% or 100% of current size, capped at (say) 1 GB. This turns out to be a bad idea. One of the first thing Oracle DBAs are told to do is change this default setting to allocate some reasonably large fixed size rather than scaling upwards. This might be mostly due to Oracle's extent-based space management but I'm not so sure. Recall that the filesystem is probably doing some rounding itself. If you allocate 120kB it's probably allocating 128kB itself anyways. Having two layers rounding up will result in odd behaviour. In any case I was planning on doing this a while back. Then I ran some experiments and couldn't actually demonstrate any problem. ext2 seems to do a perfectly reasonable job of avoiding this problem. All the files were mostly large contiguous blocks after running some tests -- IIRC running pgbench. This is one of the more-or-less solved problems in Unix/Linux. Ext* file systems have a reserve usually of 10% of the disk space that nobody except root can use. It's not for root, it's because with 10% of the disk free, you can almost always do a decent job of allocating contiguous blocks and get good performance. Unless Postgres has some weird problem that Linux has never seen before (and that wouldn't be unprecedented...), there's probably no need to fool with file-allocation strategies. Craig Its fairly easy to break. Just do a parallel import with say, 16 concurrent tables being written to at once. Result? Fragmented tables. Is this from real-life experience? With fragmentation, there's a point of diminishing return. A couple head-seeks now and then hardly matter. My recollection is that even when there are lots of concurrent processes running that are all making files larger and larger, the Linux file system still can do a pretty good job of allocating mostly-contiguous space. It doesn't just dumbly allocate from some list, but rather tries to allocate in a way that results in pretty good contiguousness (if that's a word). On the other hand, this is just from reading discussion groups like this one over the last few decades, I haven't tried it... Craig -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Block at a time ...
I was thinking in something like that, except that the factor I'd use would be something like 50% or 100% of current size, capped at (say) 1 GB. Using fallocate() ? -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Block at a time ...
On Wed, Mar 17, 2010 at 7:32 AM, Pierre C li...@peufeu.com wrote: I was thinking in something like that, except that the factor I'd use would be something like 50% or 100% of current size, capped at (say) 1 GB. This turns out to be a bad idea. One of the first thing Oracle DBAs are told to do is change this default setting to allocate some reasonably large fixed size rather than scaling upwards. This might be mostly due to Oracle's extent-based space management but I'm not so sure. Recall that the filesystem is probably doing some rounding itself. If you allocate 120kB it's probably allocating 128kB itself anyways. Having two layers rounding up will result in odd behaviour. In any case I was planning on doing this a while back. Then I ran some experiments and couldn't actually demonstrate any problem. ext2 seems to do a perfectly reasonable job of avoiding this problem. All the files were mostly large contiguous blocks after running some tests -- IIRC running pgbench. Using fallocate() ? I think we need posix_fallocate(). -- greg -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Block at a time ...
Greg - with Oracle, I always do fixed 2GB dbf's for poartability, and preallocate the whole file in advance. However, the situation is a bit different in that Oracle will put blocks from multiple tables and indexes in a DBF if you don't tell it differently. Tom - I'm not sure what Oracle does, but it literally writes the whole extent before using it I think they are just doing the literal equivalent of *dd if=/dev/zero* ... it takes several seconds to prep a 2GB file on decent storage. Cheers Dave On Wed, Mar 17, 2010 at 9:27 AM, Tom Lane t...@sss.pgh.pa.us wrote: Greg Stark gsst...@mit.edu writes: I think we need posix_fallocate(). The problem with posix_fallocate (other than questionable portability) is that it doesn't appear to guarantee anything at all about what is in the space it allocates. Worst case, we might find valid-looking Postgres data there (eg, because a block was recycled from some recently dropped table). If we have to write something anyway to zero the space, what's the point? regards, tom lane
Re: [PERFORM] Block at a time ...
On 3/17/10 2:52 AM, Greg Stark wrote: On Wed, Mar 17, 2010 at 7:32 AM, Pierre Cli...@peufeu.com wrote: I was thinking in something like that, except that the factor I'd use would be something like 50% or 100% of current size, capped at (say) 1 GB. This turns out to be a bad idea. One of the first thing Oracle DBAs are told to do is change this default setting to allocate some reasonably large fixed size rather than scaling upwards. This might be mostly due to Oracle's extent-based space management but I'm not so sure. Recall that the filesystem is probably doing some rounding itself. If you allocate 120kB it's probably allocating 128kB itself anyways. Having two layers rounding up will result in odd behaviour. In any case I was planning on doing this a while back. Then I ran some experiments and couldn't actually demonstrate any problem. ext2 seems to do a perfectly reasonable job of avoiding this problem. All the files were mostly large contiguous blocks after running some tests -- IIRC running pgbench. This is one of the more-or-less solved problems in Unix/Linux. Ext* file systems have a reserve usually of 10% of the disk space that nobody except root can use. It's not for root, it's because with 10% of the disk free, you can almost always do a decent job of allocating contiguous blocks and get good performance. Unless Postgres has some weird problem that Linux has never seen before (and that wouldn't be unprecedented...), there's probably no need to fool with file-allocation strategies. Craig -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Block at a time ...
Greg is correct, as usual. Geometric growth of files is A Bad Thing in an Oracle DBA's world, since you can unexpectedly (automatically?) run out of file system space when the database determines it needs x% more extents than last time. The concept of contiguous extents, however, has some merit, particularly when restoring databases. Prior to parallel restore, a table's files were created and extended in roughly contiguous allocations, presuming there was no other activity on your database disks. (You do dedicate disks, don't you?) When using 8-way parallel restore against a six-disk RAID 10 group I found that table and index scan performance dropped by about 10x. I/O performance was restored by either clustering the tables one at a time, or by dropping and restoring them one at a time. The only reason I can come up with for this behavior is file fragmentation and increased seek times. If PostgreSQL had a mechanism to pre-allocate files prior to restoring the database that might mitigate the problem. Then if we could only get parallel index operations ... Bob Lunney --- On Wed, 3/17/10, Greg Stark gsst...@mit.edu wrote: From: Greg Stark gsst...@mit.edu Subject: Re: [PERFORM] Block at a time ... To: Pierre C li...@peufeu.com Cc: Alvaro Herrera alvhe...@commandprompt.com, Dave Crooke dcro...@gmail.com, pgsql-performance@postgresql.org Date: Wednesday, March 17, 2010, 5:52 AM On Wed, Mar 17, 2010 at 7:32 AM, Pierre C li...@peufeu.com wrote: I was thinking in something like that, except that the factor I'd use would be something like 50% or 100% of current size, capped at (say) 1 GB. This turns out to be a bad idea. One of the first thing Oracle DBAs are told to do is change this default setting to allocate some reasonably large fixed size rather than scaling upwards. This might be mostly due to Oracle's extent-based space management but I'm not so sure. Recall that the filesystem is probably doing some rounding itself. If you allocate 120kB it's probably allocating 128kB itself anyways. Having two layers rounding up will result in odd behaviour. In any case I was planning on doing this a while back. Then I ran some experiments and couldn't actually demonstrate any problem. ext2 seems to do a perfectly reasonable job of avoiding this problem. All the files were mostly large contiguous blocks after running some tests -- IIRC running pgbench. Using fallocate() ? I think we need posix_fallocate(). -- greg -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] Block at a time ...
I agree with Tom, any reordering attempt is at best second guessing the filesystem and underlying storage. However, having the ability to control the extent size would be a worthwhile improvement for systems that walk and chew gum (write to lots of tables) concurrently. I'm thinking of Oracle's AUTOEXTEND settings for tablespace datafiles I think the ideal way to do it for PG would be to make the equivalent configurable in postgresql.conf system wide, and allow specific per-table settings in the SQL metadata, similar to auto-vacuum. An awesomely simple alternative is to just specify the extension as e.g. 5% of the existing table size it starts by adding one block at a time for tiny tables, and once your table is over 20GB, it ends up adding a whole 1GB file and pre-allocating it. Very little wasteage. Cheers Dave On Tue, Mar 16, 2010 at 4:49 PM, Alvaro Herrera alvhe...@commandprompt.comwrote: Tom Lane escribió: Alvaro Herrera alvhe...@commandprompt.com writes: Maybe it would make more sense to try to reorder the fsync calls instead. Reorder to what, though? You still have the problem that we don't know much about the physical layout on-disk. Well, to block numbers as a first step. However, this reminds me that sometimes we take the block-at-a-time extension policy too seriously. We had a customer that had a performance problem because they were inserting lots of data to TOAST tables, causing very frequent extensions. I kept wondering whether an allocation policy that allocated several new blocks at a time could be useful (but I didn't try it). This would also alleviate fragmentation, thus helping the physical layout be more similar to logical block numbers. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support