Re: [PERFORM] Block at a time ...

2010-03-26 Thread Scott Carey

On Mar 22, 2010, at 4:46 PM, Craig James wrote:

 On 3/22/10 11:47 AM, Scott Carey wrote:
 
 On Mar 17, 2010, at 9:41 AM, Craig James wrote:
 
 On 3/17/10 2:52 AM, Greg Stark wrote:
 On Wed, Mar 17, 2010 at 7:32 AM, Pierre Cli...@peufeu.com   wrote:
 I was thinking in something like that, except that the factor I'd use
 would be something like 50% or 100% of current size, capped at (say) 1 
 GB.
 
 This turns out to be a bad idea. One of the first thing Oracle DBAs
 are told to do is change this default setting to allocate some
 reasonably large fixed size rather than scaling upwards.
 
 This might be mostly due to Oracle's extent-based space management but
 I'm not so sure. Recall that the filesystem is probably doing some
 rounding itself. If you allocate 120kB it's probably allocating 128kB
 itself anyways. Having two layers rounding up will result in odd
 behaviour.
 
 In any case I was planning on doing this a while back. Then I ran some
 experiments and couldn't actually demonstrate any problem. ext2 seems
 to do a perfectly reasonable job of avoiding this problem. All the
 files were mostly large contiguous blocks after running some tests --
 IIRC running pgbench.
 
 This is one of the more-or-less solved problems in Unix/Linux.  Ext* file 
 systems have a reserve usually of 10% of the disk space that nobody 
 except root can use.  It's not for root, it's because with 10% of the disk 
 free, you can almost always do a decent job of allocating contiguous blocks 
 and get good performance.  Unless Postgres has some weird problem that 
 Linux has never seen before (and that wouldn't be unprecedented...), 
 there's probably no need to fool with file-allocation strategies.
 
 Craig
 
 
 Its fairly easy to break.  Just do a parallel import with say, 16 concurrent 
 tables being written to at once.  Result?  Fragmented tables.
 
 Is this from real-life experience?  With fragmentation, there's a point of 
 diminishing return.  A couple head-seeks now and then hardly matter.  My 
 recollection is that even when there are lots of concurrent processes running 
 that are all making files larger and larger, the Linux file system still can 
 do a pretty good job of allocating mostly-contiguous space.  It doesn't just 
 dumbly allocate from some list, but rather tries to allocate in a way that 
 results in pretty good contiguousness (if that's a word).
 
 On the other hand, this is just from reading discussion groups like this one 
 over the last few decades, I haven't tried it...
 

Well how fragmented is too fragmented depends on the use case and the hardware 
capability.  In real world use, which for me means about 20 phases of large 
bulk inserts a day and not a lot of updates or index maintenance, the system 
gets somewhat fragmented but its not too bad.  I did a dump/restore in 8.4 with 
parallel restore and it was much slower than usual.  I did a single threaded 
restore and it was much faster.  The dev environments are on ext3 and we see 
this pretty clearly -- but poor OS tuning can mask it (readahead parameter not 
set high enough).   This is CentOS 5.4/5.3, perhaps later kernels are better at 
scheduling file writes to avoid this.  We also use the deadline scheduler which 
helps a lot on concurrent reads, but might be messing up concurrent writes.
On production with xfs this was also bad at first --- in fact worse because 
xfs's default 'allocsize' setting is 64k.  So files were regularly fragmented 
in small multiples of 64k.   Changing the 'allocsize' parameter to 80MB made 
the restore process produce files with fragment sizes of 80MB.  80MB is big for 
most systems, but this array does over 1000MB/sec sequential read at peak, and 
only 200MB/sec with moderate fragmentation.
It won't fail to allocate disk space due to any 'reservations' of the delayed 
allocation, it just means that it won't choose to create a new file or extent 
within 80MB of another file that is open unless it has to.  This can cause 
performance problems if you have lots of small files, which is why the default 
is 64k.



 Craig



-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Block at a time ...

2010-03-22 Thread Scott Carey

On Mar 17, 2010, at 9:41 AM, Craig James wrote:

 On 3/17/10 2:52 AM, Greg Stark wrote:
 On Wed, Mar 17, 2010 at 7:32 AM, Pierre Cli...@peufeu.com  wrote:
 I was thinking in something like that, except that the factor I'd use
 would be something like 50% or 100% of current size, capped at (say) 1 GB.
 
 This turns out to be a bad idea. One of the first thing Oracle DBAs
 are told to do is change this default setting to allocate some
 reasonably large fixed size rather than scaling upwards.
 
 This might be mostly due to Oracle's extent-based space management but
 I'm not so sure. Recall that the filesystem is probably doing some
 rounding itself. If you allocate 120kB it's probably allocating 128kB
 itself anyways. Having two layers rounding up will result in odd
 behaviour.
 
 In any case I was planning on doing this a while back. Then I ran some
 experiments and couldn't actually demonstrate any problem. ext2 seems
 to do a perfectly reasonable job of avoiding this problem. All the
 files were mostly large contiguous blocks after running some tests --
 IIRC running pgbench.
 
 This is one of the more-or-less solved problems in Unix/Linux.  Ext* file 
 systems have a reserve usually of 10% of the disk space that nobody except 
 root can use.  It's not for root, it's because with 10% of the disk free, you 
 can almost always do a decent job of allocating contiguous blocks and get 
 good performance.  Unless Postgres has some weird problem that Linux has 
 never seen before (and that wouldn't be unprecedented...), there's probably 
 no need to fool with file-allocation strategies.
 
 Craig
 

Its fairly easy to break.  Just do a parallel import with say, 16 concurrent 
tables being written to at once.  Result?  Fragmented tables.

 -- 
 Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-performance


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Block at a time ...

2010-03-22 Thread Pierre C


This is one of the more-or-less solved problems in Unix/Linux.  Ext*  
file systems have a reserve usually of 10% of the disk space that  
nobody except root can use.  It's not for root, it's because with 10%  
of the disk free, you can almost always do a decent job of allocating  
contiguous blocks and get good performance.  Unless Postgres has some  
weird problem that Linux has never seen before (and that wouldn't be  
unprecedented...), there's probably no need to fool with  
file-allocation strategies.


Craig


Its fairly easy to break.  Just do a parallel import with say, 16  
concurrent tables being written to at once.  Result?  Fragmented tables.


Delayed allocation (ext4, XFS) helps a lot for concurrent writing at a  
medium-high rate (a few megabytes per second and up) when lots of data can  
sit in the cache and be flushed/allocated as big contiguous chunks. I'm  
pretty sure ext4/XFS would pass your parallel import test.


However if you have files like tables (and indexes) or logs that grow  
slowly over time (something like a few megabytes per hour or less), after  
a few days/weeks/months, horrible fragmentation is an almost guaranteed  
result on many filesystems (NTFS being perhaps the absolute worst).



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Block at a time ...

2010-03-22 Thread Dave Crooke
This is why pre-allocation is a good idea if you have the space 

Tom, what about a really simple command in a forthcoming release of PG that
would just preformat a 1GB file at a time? This is what I've always done
scripted with Oracle (ALTER TABLESPACE foo ADD DATAFILE ) rather than
relying on its autoextender when performance has been a concern.

Cheers
Dave

On Mon, Mar 22, 2010 at 3:55 PM, Pierre C li...@peufeu.com wrote:


  This is one of the more-or-less solved problems in Unix/Linux.  Ext* file
 systems have a reserve usually of 10% of the disk space that nobody except
 root can use.  It's not for root, it's because with 10% of the disk free,
 you can almost always do a decent job of allocating contiguous blocks and
 get good performance.  Unless Postgres has some weird problem that Linux has
 never seen before (and that wouldn't be unprecedented...), there's probably
 no need to fool with file-allocation strategies.

 Craig


 Its fairly easy to break.  Just do a parallel import with say, 16
 concurrent tables being written to at once.  Result?  Fragmented tables.


 Delayed allocation (ext4, XFS) helps a lot for concurrent writing at a
 medium-high rate (a few megabytes per second and up) when lots of data can
 sit in the cache and be flushed/allocated as big contiguous chunks. I'm
 pretty sure ext4/XFS would pass your parallel import test.

 However if you have files like tables (and indexes) or logs that grow
 slowly over time (something like a few megabytes per hour or less), after a
 few days/weeks/months, horrible fragmentation is an almost guaranteed result
 on many filesystems (NTFS being perhaps the absolute worst).



 --
 Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-performance



Re: [PERFORM] Block at a time ...

2010-03-22 Thread Greg Stark
On Mon, Mar 22, 2010 at 6:47 PM, Scott Carey sc...@richrelevance.com wrote:
 Its fairly easy to break.  Just do a parallel import with say, 16 concurrent 
 tables being written to at once.  Result?  Fragmented tables.


Fwiw I did do some investigation about this at one point and could not
demonstrate any significant fragmentation. But that was on Linux --
different filesystem implementations would have different success
rates. And there could be other factors as well such as how full the
fileystem is or how old it is.

-- 
greg

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Block at a time ...

2010-03-22 Thread Craig James

On 3/22/10 11:47 AM, Scott Carey wrote:


On Mar 17, 2010, at 9:41 AM, Craig James wrote:


On 3/17/10 2:52 AM, Greg Stark wrote:

On Wed, Mar 17, 2010 at 7:32 AM, Pierre Cli...@peufeu.com   wrote:

I was thinking in something like that, except that the factor I'd use
would be something like 50% or 100% of current size, capped at (say) 1 GB.


This turns out to be a bad idea. One of the first thing Oracle DBAs
are told to do is change this default setting to allocate some
reasonably large fixed size rather than scaling upwards.

This might be mostly due to Oracle's extent-based space management but
I'm not so sure. Recall that the filesystem is probably doing some
rounding itself. If you allocate 120kB it's probably allocating 128kB
itself anyways. Having two layers rounding up will result in odd
behaviour.

In any case I was planning on doing this a while back. Then I ran some
experiments and couldn't actually demonstrate any problem. ext2 seems
to do a perfectly reasonable job of avoiding this problem. All the
files were mostly large contiguous blocks after running some tests --
IIRC running pgbench.


This is one of the more-or-less solved problems in Unix/Linux.  Ext* file systems have a 
reserve usually of 10% of the disk space that nobody except root can use.  
It's not for root, it's because with 10% of the disk free, you can almost always do a 
decent job of allocating contiguous blocks and get good performance.  Unless Postgres has 
some weird problem that Linux has never seen before (and that wouldn't be 
unprecedented...), there's probably no need to fool with file-allocation strategies.

Craig



Its fairly easy to break.  Just do a parallel import with say, 16 concurrent 
tables being written to at once.  Result?  Fragmented tables.


Is this from real-life experience?  With fragmentation, there's a point of diminishing 
return.  A couple head-seeks now and then hardly matter.  My recollection is that even 
when there are lots of concurrent processes running that are all making files larger and 
larger, the Linux file system still can do a pretty good job of allocating 
mostly-contiguous space.  It doesn't just dumbly allocate from some list, but rather 
tries to allocate in a way that results in pretty good contiguousness (if 
that's a word).

On the other hand, this is just from reading discussion groups like this one 
over the last few decades, I haven't tried it...

Craig

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Block at a time ...

2010-03-17 Thread Pierre C

I was thinking in something like that, except that the factor I'd use
would be something like 50% or 100% of current size, capped at (say) 1  
GB.


Using fallocate() ?


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Block at a time ...

2010-03-17 Thread Greg Stark
On Wed, Mar 17, 2010 at 7:32 AM, Pierre C li...@peufeu.com wrote:
 I was thinking in something like that, except that the factor I'd use
 would be something like 50% or 100% of current size, capped at (say) 1 GB.

This turns out to be a bad idea. One of the first thing Oracle DBAs
are told to do is change this default setting to allocate some
reasonably large fixed size rather than scaling upwards.

This might be mostly due to Oracle's extent-based space management but
I'm not so sure. Recall that the filesystem is probably doing some
rounding itself. If you allocate 120kB it's probably allocating 128kB
itself anyways. Having two layers rounding up will result in odd
behaviour.

In any case I was planning on doing this a while back. Then I ran some
experiments and couldn't actually demonstrate any problem. ext2 seems
to do a perfectly reasonable job of avoiding this problem. All the
files were mostly large contiguous blocks after running some tests --
IIRC running pgbench.


 Using fallocate() ?

I think we need posix_fallocate().

-- 
greg

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Block at a time ...

2010-03-17 Thread Dave Crooke
Greg - with Oracle, I always do fixed 2GB dbf's for poartability, and
preallocate the whole file in advance. However, the situation is a bit
different in that Oracle will put blocks from multiple tables and indexes in
a DBF if you don't tell it differently.

Tom - I'm not sure what Oracle does, but it literally writes the whole
extent before using it  I think they are just doing the literal
equivalent of *dd if=/dev/zero* ... it takes several seconds to prep a 2GB
file on decent storage.

Cheers
Dave

On Wed, Mar 17, 2010 at 9:27 AM, Tom Lane t...@sss.pgh.pa.us wrote:

 Greg Stark gsst...@mit.edu writes:
  I think we need posix_fallocate().

 The problem with posix_fallocate (other than questionable portability)
 is that it doesn't appear to guarantee anything at all about what is in
 the space it allocates.  Worst case, we might find valid-looking
 Postgres data there (eg, because a block was recycled from some recently
 dropped table).  If we have to write something anyway to zero the space,
 what's the point?

regards, tom lane



Re: [PERFORM] Block at a time ...

2010-03-17 Thread Craig James

On 3/17/10 2:52 AM, Greg Stark wrote:

On Wed, Mar 17, 2010 at 7:32 AM, Pierre Cli...@peufeu.com  wrote:

I was thinking in something like that, except that the factor I'd use
would be something like 50% or 100% of current size, capped at (say) 1 GB.


This turns out to be a bad idea. One of the first thing Oracle DBAs
are told to do is change this default setting to allocate some
reasonably large fixed size rather than scaling upwards.

This might be mostly due to Oracle's extent-based space management but
I'm not so sure. Recall that the filesystem is probably doing some
rounding itself. If you allocate 120kB it's probably allocating 128kB
itself anyways. Having two layers rounding up will result in odd
behaviour.

In any case I was planning on doing this a while back. Then I ran some
experiments and couldn't actually demonstrate any problem. ext2 seems
to do a perfectly reasonable job of avoiding this problem. All the
files were mostly large contiguous blocks after running some tests --
IIRC running pgbench.


This is one of the more-or-less solved problems in Unix/Linux.  Ext* file systems have a 
reserve usually of 10% of the disk space that nobody except root can use.  
It's not for root, it's because with 10% of the disk free, you can almost always do a 
decent job of allocating contiguous blocks and get good performance.  Unless Postgres has 
some weird problem that Linux has never seen before (and that wouldn't be 
unprecedented...), there's probably no need to fool with file-allocation strategies.

Craig

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Block at a time ...

2010-03-17 Thread Bob Lunney
Greg is correct, as usual.  Geometric growth of files is A Bad Thing in an  
Oracle DBA's world, since you can unexpectedly (automatically?) run out of file 
system space when the database determines it needs x% more extents than last 
time.

The concept of contiguous extents, however, has some merit, particularly when 
restoring databases.  Prior to parallel restore, a table's files were created 
and extended in roughly contiguous allocations, presuming there was no other 
activity on your database disks.  (You do dedicate disks, don't you?)  When 
using 8-way parallel restore against a six-disk RAID 10 group I found that 
table and index scan performance dropped by about 10x.  I/O performance was 
restored by either clustering the tables one at a time, or by dropping and 
restoring them one at a time.  The only reason I can come up with for this 
behavior is file fragmentation and increased seek times.

If PostgreSQL had a mechanism to pre-allocate files prior to restoring the 
database that might mitigate the problem.  

Then if we could only get parallel index operations ...

Bob Lunney

--- On Wed, 3/17/10, Greg Stark gsst...@mit.edu wrote:

 From: Greg Stark gsst...@mit.edu
 Subject: Re: [PERFORM] Block at a time ...
 To: Pierre C li...@peufeu.com
 Cc: Alvaro Herrera alvhe...@commandprompt.com, Dave Crooke 
 dcro...@gmail.com, pgsql-performance@postgresql.org
 Date: Wednesday, March 17, 2010, 5:52 AM
 On Wed, Mar 17, 2010 at 7:32 AM,
 Pierre C li...@peufeu.com
 wrote:
  I was thinking in something like that, except that
 the factor I'd use
  would be something like 50% or 100% of current
 size, capped at (say) 1 GB.
 
 This turns out to be a bad idea. One of the first thing
 Oracle DBAs
 are told to do is change this default setting to allocate
 some
 reasonably large fixed size rather than scaling upwards.
 
 This might be mostly due to Oracle's extent-based space
 management but
 I'm not so sure. Recall that the filesystem is probably
 doing some
 rounding itself. If you allocate 120kB it's probably
 allocating 128kB
 itself anyways. Having two layers rounding up will result
 in odd
 behaviour.
 
 In any case I was planning on doing this a while back. Then
 I ran some
 experiments and couldn't actually demonstrate any problem.
 ext2 seems
 to do a perfectly reasonable job of avoiding this problem.
 All the
 files were mostly large contiguous blocks after running
 some tests --
 IIRC running pgbench.
 
 
  Using fallocate() ?
 
 I think we need posix_fallocate().
 
 -- 
 greg
 
 -- 
 Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-performance
 


  

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] Block at a time ...

2010-03-16 Thread Dave Crooke
I agree with Tom, any reordering attempt is at best second guessing the
filesystem and underlying storage.

However, having the ability to control the extent size would be a worthwhile
improvement for systems that walk and chew gum (write to lots of tables)
concurrently.

I'm thinking of Oracle's AUTOEXTEND settings for tablespace datafiles  I
think the ideal way to do it for PG would be to make the equivalent
configurable in postgresql.conf system wide, and allow specific per-table
settings in the SQL metadata, similar to auto-vacuum.

An awesomely simple alternative is to just specify the extension as e.g. 5%
of the existing table size  it starts by adding one block at a time for
tiny tables, and once your table is over 20GB, it ends up adding a whole 1GB
file and pre-allocating it. Very little wasteage.

Cheers
Dave

On Tue, Mar 16, 2010 at 4:49 PM, Alvaro Herrera
alvhe...@commandprompt.comwrote:

 Tom Lane escribió:
  Alvaro Herrera alvhe...@commandprompt.com writes:
   Maybe it would make more sense to try to reorder the fsync calls
   instead.
 
  Reorder to what, though?  You still have the problem that we don't know
  much about the physical layout on-disk.

 Well, to block numbers as a first step.

 However, this reminds me that sometimes we take the block-at-a-time
 extension policy too seriously.  We had a customer that had a
 performance problem because they were inserting lots of data to TOAST
 tables, causing very frequent extensions.  I kept wondering whether an
 allocation policy that allocated several new blocks at a time could be
 useful (but I didn't try it).  This would also alleviate fragmentation,
 thus helping the physical layout be more similar to logical block
 numbers.

 --
 Alvaro Herrera
 http://www.CommandPrompt.com/
 PostgreSQL Replication, Consulting, Custom Development, 24x7 support