Re: [PERFORM] Anything to be gained from a 'Postgres Filesystem'?

2004-12-14 Thread Markus Schaber
Hi, Christopher,
[sorry for the delay of my answer, we were rather busy last weks]

On Thu, 04 Nov 2004 21:29:04 -0500
Christopher Browne [EMAIL PROTECTED] wrote:

 In an attempt to throw the authorities off his trail, [EMAIL PROTECTED] 
 (Markus Schaber) transmitted:
  We should create a list of those needs, and then communicate those
  to the kernel/fs developers. Then we (as well as other apps) can
  make use of those features where they are available, and use the old
  way everywhere else.
 
 Which kernel/fs developers did you have in mind?  The ones working on
 Linux?  Or FreeBSD?  Or DragonflyBSD?  Or Solaris?  Or AIX?

All of them, and others (e. G. Windows).

Once we have a list of those needs, the advocates can talk to the OS
developers. Some OS developers will follow, others not.

Then the postgres folks (and other application developers that benefit
from this capabilities) can point interested users to our benchmarks and
tell them that Foox performs 3 times as fast as BaarOs because they
provide better support for database needs.

 Please keep in mind that many of the PostgreSQL developers are BSD
 folk that aren't particularly interested in creating bleeding edge
 Linux capabilities.

Then this should be motivation to add those things to BSD, maybe as a
patch or loadable module so it does not bloat mainstream. I personally
would prefer it to appear in BSD first, because in case it really pays
of, it won't be long until it appears in Linux as well :-)

 Jumping into a customized filesystem that neither hardware nor
 software vendors would remotely consider supporting just doesn't look
 like a viable strategy to me.

I did not vote for a custom filesystem, as the OP did. I did vote for
isolating a set of useful capabilities PostgreSQL could exploit, and
then try to confince the kernel developers to include this capabilities,
so they are likely to be included in the main distributions.

I don't know about the BSD market, but I know that Redhat and SuSE often
ship their patched versions of the kernels (so then they officially
support the extensions), and most of this is likely to be included in
main stream later.

  Maybe Reiser4 is a step into the right way, and maybe even a
  postgres plugin for Reiser4 will be worth the effort. Maybe XFS/JFS
  etc. already have such capabilities. Maybe that's completely wrong.
 
 The capabilities tend to be redundant.  They tend to implement vaguely
 similar transactional capabilities to what databases have to
 implement.  The similarities are not close enough to eliminate either
 variety of commit as redundant.

But a speed gain may be possible by coordinating DB and FS tansactions.

Thanks,
Markus

-- 
markus schaber | dipl. informatiker
logi-track ag | rennweg 14-16 | ch 8001 zürich
phone +41-43-888 62 52 | fax +41-43-888 62 53
mailto:[EMAIL PROTECTED] | www.logi-track.com

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [PERFORM] Anything to be gained from a 'Postgres Filesystem'?

2004-11-04 Thread Markus Schaber
Hi, Leeuw,

On Thu, 21 Oct 2004 12:44:10 +0200
Leeuw van der, Tim [EMAIL PROTECTED] wrote:

 (I'm not sure if it's a good idea to create a PG-specific FS in your
 OS of choice, but it's certainly gonna be easier than getting FS code
 inside of PG)

I don't think PG really needs a specific FS. I rather think that PG
could profit from some functionality that's missing in traditional UN*X
file systems.

posix_fadvise(2) may be a candidate. Read/Write bareers another pone, as
well asn syncing a bunch of data in different files with a single call
(so that the OS can determine the best write order). I can also imagine
some interaction with the FS journalling system (to avoid duplicate
efforts).

We should create a list of those needs, and then communicate those to
the kernel/fs developers. Then we (as well as other apps) can make use
of those features where they are available, and use the old way
everywhere else.

Maybe Reiser4 is a step into the right way, and maybe even a postgres
plugin for Reiser4 will be worth the effort. Maybe XFS/JFS etc. already
have such capabilities. Maybe that's completely wrong.

cheers,
Markus

-- 
markus schaber | dipl. informatiker
logi-track ag | rennweg 14-16 | ch 8001 zürich
phone +41-43-888 62 52 | fax +41-43-888 62 53
mailto:[EMAIL PROTECTED] | www.logi-track.com

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] Anything to be gained from a 'Postgres Filesystem'?

2004-11-04 Thread Chris Browne
[EMAIL PROTECTED] (Pierre-Frédéric Caillaud) writes:
 posix_fadvise(2) may be a candidate. Read/Write bareers another pone, as
 well asn syncing a bunch of data in different files with a single call
 (so that the OS can determine the best write order). I can also imagine
 some interaction with the FS journalling system (to avoid duplicate
 efforts).

   There is also the fact that syncing after every transaction
 could be  changed to syncing every N transactions (N fixed or
 depending on the data  size written by the transactions) which would
 be more efficient than the  current behaviour with a sleep. HOWEVER
 suppressing the sleep() would lead  to postgres returning from the
 COMMIT while it is in fact not synced,  which somehow rings a huge
 alarm bell somewhere.

   What about read order ?
   This could be very useful for SELECT queries involving
 indexes, which in  case of a non-clustered table lead to random seeks
 in the table.

Another thing that would be valuable would be to have some way to say:

  Read this data; don't bother throwing other data out of the cache
   to stuff this in.

Something like a read_uncached() call...

That would mean that a seq scan or a vacuum wouldn't force useful data
out of cache.
-- 
let name=cbbrowne and tld=cbbrowne.com in String.concat @ [name;tld];;
http://www.ntlug.org/~cbbrowne/linuxxian.html
A VAX is virtually a computer, but not quite.

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [PERFORM] Anything to be gained from a 'Postgres Filesystem'?

2004-11-04 Thread Simon Riggs
On Thu, 2004-11-04 at 15:47, Chris Browne wrote:

 Another thing that would be valuable would be to have some way to say:
 
   Read this data; don't bother throwing other data out of the cache
to stuff this in.
 
 Something like a read_uncached() call...
 
 That would mean that a seq scan or a vacuum wouldn't force useful data
 out of cache.

ARC does almost exactly those two things in 8.0.

Seq scans do get put in cache, but in a way that means they don't spoil
the main bulk of the cache.

-- 
Best Regards, Simon Riggs


---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] Anything to be gained from a 'Postgres Filesystem'?

2004-11-04 Thread Steinar H. Gunderson
On Thu, Nov 04, 2004 at 10:47:31AM -0500, Chris Browne wrote:
 Another thing that would be valuable would be to have some way to say:
 
   Read this data; don't bother throwing other data out of the cache
to stuff this in.
 
 Something like a read_uncached() call...

You mean, like, open(filename, O_DIRECT)? :-)

/* Steinar */
-- 
Homepage: http://www.sesse.net/

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] Anything to be gained from a 'Postgres Filesystem'?

2004-11-04 Thread Tom Lane
Simon Riggs [EMAIL PROTECTED] writes:
 On Thu, 2004-11-04 at 15:47, Chris Browne wrote:
 Something like a read_uncached() call...
 
 That would mean that a seq scan or a vacuum wouldn't force useful data
 out of cache.

 ARC does almost exactly those two things in 8.0.

But only for Postgres' own shared buffers.  The kernel cache still gets
trashed, because we have no way to suggest to the kernel that it not
hang onto the data read in.

regards, tom lane

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [PERFORM] Anything to be gained from a 'Postgres Filesystem'?

2004-11-04 Thread Simon Riggs
On Thu, 2004-11-04 at 19:34, Tom Lane wrote:
 Simon Riggs [EMAIL PROTECTED] writes:
  On Thu, 2004-11-04 at 15:47, Chris Browne wrote:
  Something like a read_uncached() call...
  
  That would mean that a seq scan or a vacuum wouldn't force useful data
  out of cache.
 
  ARC does almost exactly those two things in 8.0.
 
 But only for Postgres' own shared buffers.  The kernel cache still gets
 trashed, because we have no way to suggest to the kernel that it not
 hang onto the data read in.

I guess a difference in viewpoints. I'm inclined to give most of the RAM
to PostgreSQL, since as you point out, the kernel is out of our control.
That way, we can do what we like with it - keep it or not, as we choose.

-- 
Best Regards, Simon Riggs


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] Anything to be gained from a 'Postgres Filesystem'?

2004-11-04 Thread Tom Lane
Simon Riggs [EMAIL PROTECTED] writes:
 On Thu, 2004-11-04 at 19:34, Tom Lane wrote:
 But only for Postgres' own shared buffers.  The kernel cache still gets
 trashed, because we have no way to suggest to the kernel that it not
 hang onto the data read in.

 I guess a difference in viewpoints. I'm inclined to give most of the RAM
 to PostgreSQL, since as you point out, the kernel is out of our control.
 That way, we can do what we like with it - keep it or not, as we choose.

That's always been a Bad Idea for three or four different reasons, of
which ARC will eliminate no more than one.

regards, tom lane

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] Anything to be gained from a 'Postgres Filesystem'?

2004-11-04 Thread Christopher Browne
In an attempt to throw the authorities off his trail, [EMAIL PROTECTED] (Markus 
Schaber) transmitted:
 We should create a list of those needs, and then communicate those
 to the kernel/fs developers. Then we (as well as other apps) can
 make use of those features where they are available, and use the old
 way everywhere else.

Which kernel/fs developers did you have in mind?  The ones working on
Linux?  Or FreeBSD?  Or DragonflyBSD?  Or Solaris?  Or AIX?

Please keep in mind that many of the PostgreSQL developers are BSD
folk that aren't particularly interested in creating bleeding edge
Linux capabilities.

Furthermore, I'd think long and hard before jumping into such a
_spectacularly_ bleeding edge kind of project.  The reason why you
would want this would be if you needed to get some margin of
performance.  I can't see wanting that without also wanting some
_assurance_ of system reliability, at which point I also want things
like vendor support.

If you've ever contacted Red Hat Software, you'd know that they very
nearly refuse to provide support for any filesystem other than ext3.
Use anything else and they'll make noises about not being able to
assure you of anything at all.

If you need high performance, you'd also want to use interesting sorts
of hardware.  Disk arrays, RAID controllers, that sort of thing.
Vendors of such things don't particularly want to talk to you unless
you're using a supported Linux distribution and a supported
filesystem.

Jumping into a customized filesystem that neither hardware nor
software vendors would remotely consider supporting just doesn't look
like a viable strategy to me.

 Maybe Reiser4 is a step into the right way, and maybe even a
 postgres plugin for Reiser4 will be worth the effort. Maybe XFS/JFS
 etc. already have such capabilities. Maybe that's completely wrong.

The capabilities tend to be redundant.  They tend to implement vaguely
similar transactional capabilities to what databases have to
implement.  The similarities are not close enough to eliminate either
variety of commit as redundant.
-- 
cbbrowne,@,linuxfinances.info
http://linuxfinances.info/info/linux.html
Rules of the  Evil Overlord #128. I will not  employ robots as agents
of  destruction  if  there  is  any  possible way  that  they  can  be
re-programmed  or if their  battery packs  are externally  mounted and
easily removable. http://www.eviloverlord.com/

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] Anything to be gained from a 'Postgres Filesystem'?

2004-11-04 Thread Christopher Browne
After a long battle with technology, [EMAIL PROTECTED] (Simon Riggs), an earthling, 
wrote:
 On Thu, 2004-11-04 at 15:47, Chris Browne wrote:

 Another thing that would be valuable would be to have some way to say:
 
   Read this data; don't bother throwing other data out of the cache
to stuff this in.
 
 Something like a read_uncached() call...
 
 That would mean that a seq scan or a vacuum wouldn't force useful
 data out of cache.

 ARC does almost exactly those two things in 8.0.

 Seq scans do get put in cache, but in a way that means they don't
 spoil the main bulk of the cache.

We're not talking about the same cache.

ARC does these exact things for _shared memory_ cache, and is the
obvious inspiration.

But it does more or less nothing about the way OS file buffer cache is
managed, and the handling of _that_ would be the point of modifying OS
filesystem semantics.
-- 
select 'cbbrowne' || '@' || 'linuxfinances.info';
http://www3.sympatico.ca/cbbrowne/oses.html
Have you ever considered beating yourself with a cluestick?

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] Anything to be gained from a 'Postgres Filesystem'?

2004-11-04 Thread Neil Conway
On Fri, 2004-11-05 at 06:20, Steinar H. Gunderson wrote:
 You mean, like, open(filename, O_DIRECT)? :-)

This disables readahead (at least on Linux), which is certainly not we
want: for the very case where we don't want to keep the data in cache
for a while (sequential scans, VACUUM), we also want aggressive
readahead.

-Neil



---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [PERFORM] Anything to be gained from a 'Postgres Filesystem'?

2004-11-04 Thread Neil Conway
On Thu, 2004-11-04 at 23:29, Pierre-Frédéric Caillaud wrote:
   There is also the fact that syncing after every transaction could be  
 changed to syncing every N transactions (N fixed or depending on the data  
 size written by the transactions) which would be more efficient than the  
 current behaviour with a sleep.

Uh, which sleep are you referring to?

Also, how would interacting with the filesystem's journal effect how
often we need to force-write the WAL to disk? (ISTM we need to sync
_something_ to disk when a transaction commits in order to maintain the
WAL invariant.)

   There's fadvise to tell the OS to readahead on a seq scan (I think the OS  
 detects it anyway)

Not perfectly, though; also, Linux will do a more aggressive readahead
if you tell it to do so via posix_fadvise().

 if there was a system call telling the OS in the  
 next seconds I'm going to read these chunks of data from this file (gives  
 a list of offsets and lengths), could you put them in your cache in the  
 most efficient order without seeking too much, so that when I read() them  
 in random order, they will be in the cache already ?.

http://www.opengroup.org/onlinepubs/009695399/functions/posix_fadvise.html

POSIX_FADV_WILLNEED 
Specifies that the application expects to access the specified
data in the near future.

-Neil



---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] Anything to be gained from a 'Postgres Filesystem'?

2004-11-04 Thread Neil Conway
On Fri, 2004-11-05 at 02:47, Chris Browne wrote:
 Another thing that would be valuable would be to have some way to say:
 
   Read this data; don't bother throwing other data out of the cache
to stuff this in.

This is similar, although not exactly the same thing:

http://www.opengroup.org/onlinepubs/009695399/functions/posix_fadvise.html

POSIX_FADV_NOREUSE 
Specifies that the application expects to access the specified
data once and then not reuse it thereafter.

-Neil



---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


[PERFORM] Anything to be gained from a 'Postgres Filesystem'?

2004-10-21 Thread Matt Clark
I suppose I'm just idly wondering really.  Clearly it's against PG
philosophy to build an FS or direct IO management into PG, but now it's so
relatively easy to plug filesystems into the main open-source Oses, It
struck me that there might be some useful changes to, say, XFS or ext3, that
could be made that would help PG out.

I'm thinking along the lines of an FS that's aware of PG's strategies and
requirements and therefore optimised to make those activities as efiicient
as possible - possibly even being aware of PG's disk layout and treating
files differently on that basis.

Not being an FS guru I'm not really clear on whether this would help much
(enough to be worth it anyway) or not - any thoughts?  And if there were
useful gains to be had, would it need a whole new FS or could an existing
one be modified?

So there might be (as I said, I'm not an FS guru...):
* great append performance for the WAL?
* optimised scattered writes for checkpointing?
* Knowledge that FSYNC is being used for preserving ordering a lot of the
time, rather than requiring actual writes to disk (so long as the writes
eventually happen in order...)?


Matt



Matt Clark
Ymogen Ltd
P: 0845 130 4531
W: https://ymogen.net/
M: 0774 870 1584
 


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [PERFORM] Anything to be gained from a 'Postgres Filesystem'?

2004-10-21 Thread Leeuw van der, Tim
Hiya,

Looking at that list, I got the feeling that you'd want to push that PG-awareness down 
into the block-io layer as well, then, so as to be able to optimise for (perhaps) 
conflicting goals depending on what the app does; for the IO system to be able to read 
the apps mind it needs to have some knowledge of what the app is / needs / wants and I 
get the impression that this awareness needs to go deeper than the FS only.

--Tim

(But you might have time to rewrite Linux/BSD as a PG-OS? just kidding!)

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Behalf Of Matt Clark
Sent: Thursday, October 21, 2004 9:58 AM
To: [EMAIL PROTECTED]
Subject: [PERFORM] Anything to be gained from a 'Postgres Filesystem'?


I suppose I'm just idly wondering really.  Clearly it's against PG
philosophy to build an FS or direct IO management into PG, but now it's so
relatively easy to plug filesystems into the main open-source Oses, It
struck me that there might be some useful changes to, say, XFS or ext3, that
could be made that would help PG out.

I'm thinking along the lines of an FS that's aware of PG's strategies and
requirements and therefore optimised to make those activities as efiicient
as possible - possibly even being aware of PG's disk layout and treating
files differently on that basis.

Not being an FS guru I'm not really clear on whether this would help much
(enough to be worth it anyway) or not - any thoughts?  And if there were
useful gains to be had, would it need a whole new FS or could an existing
one be modified?

So there might be (as I said, I'm not an FS guru...):
* great append performance for the WAL?
* optimised scattered writes for checkpointing?
* Knowledge that FSYNC is being used for preserving ordering a lot of the
time, rather than requiring actual writes to disk (so long as the writes
eventually happen in order...)?


Matt



Matt Clark
Ymogen Ltd
P: 0845 130 4531
W: https://ymogen.net/
M: 0774 870 1584
 


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] Anything to be gained from a 'Postgres Filesystem'?

2004-10-21 Thread Pierre-Frdric Caillaud
Reiser4 ?
On Thu, 21 Oct 2004 08:58:01 +0100, Matt Clark [EMAIL PROTECTED] wrote:
I suppose I'm just idly wondering really.  Clearly it's against PG
philosophy to build an FS or direct IO management into PG, but now it's  
so
relatively easy to plug filesystems into the main open-source Oses, It
struck me that there might be some useful changes to, say, XFS or ext3,  
that
could be made that would help PG out.

I'm thinking along the lines of an FS that's aware of PG's strategies and
requirements and therefore optimised to make those activities as  
efiicient
as possible - possibly even being aware of PG's disk layout and treating
files differently on that basis.

Not being an FS guru I'm not really clear on whether this would help much
(enough to be worth it anyway) or not - any thoughts?  And if there were
useful gains to be had, would it need a whole new FS or could an existing
one be modified?
So there might be (as I said, I'm not an FS guru...):
* great append performance for the WAL?
* optimised scattered writes for checkpointing?
* Knowledge that FSYNC is being used for preserving ordering a lot of the
time, rather than requiring actual writes to disk (so long as the writes
eventually happen in order...)?
Matt

Matt Clark
Ymogen Ltd
P: 0845 130 4531
W: https://ymogen.net/
M: 0774 870 1584
---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] Anything to be gained from a 'Postgres Filesystem'?

2004-10-21 Thread Steinar H. Gunderson
On Thu, Oct 21, 2004 at 08:58:01AM +0100, Matt Clark wrote:
 I suppose I'm just idly wondering really.  Clearly it's against PG
 philosophy to build an FS or direct IO management into PG, but now it's so
 relatively easy to plug filesystems into the main open-source Oses, It
 struck me that there might be some useful changes to, say, XFS or ext3, that
 could be made that would help PG out.

This really sounds like a poor replacement for just making PostgreSQL use raw
devices to me. (I have no idea why that isn't done already, but presumably it
isn't all that easy to get right. :-) )

/* Steinar */
-- 
Homepage: http://www.sesse.net/

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] Anything to be gained from a 'Postgres Filesystem'?

2004-10-21 Thread Leeuw van der, Tim
Hi,

I guess the difference is in 'severe hacking inside PG' vs. 'some unknown amount of 
hacking that doesn't touch PG code'.

Hacking PG internally to handle raw devices will meet with strong resistance from 
large portions of the development team. I don't expect (m)any core devs of PG will be 
excited about rewriting the entire I/O architecture of PG and duplicating large 
amounts of OS type of code inside the application, just to try to attain an unknown 
performance benefit.

PG doesn't use one big file, as some databases do, but many small files. Now PG would 
need to be able to do file-management, if you put the PG database on a raw disk 
partition! That's icky stuff, and you'll find much resistance against putting such 
code inside PG.
So why not try to have the external FS know a bit about PG and it's directory-layout, 
and it's IO requirements? Then such type of code can at least be maintained outside 
the application, and will not be as much of a burden to the rest of the application.

(I'm not sure if it's a good idea to create a PG-specific FS in your OS of choice, but 
it's certainly gonna be easier than getting FS code inside of PG)

cheers,

--Tim

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Behalf Of Steinar H. Gunderson
Sent: Thursday, October 21, 2004 12:27 PM
To: [EMAIL PROTECTED]
Subject: Re: [PERFORM] Anything to be gained from a 'Postgres Filesystem'?


On Thu, Oct 21, 2004 at 08:58:01AM +0100, Matt Clark wrote:
 I suppose I'm just idly wondering really.  Clearly it's against PG
 philosophy to build an FS or direct IO management into PG, but now it's so
 relatively easy to plug filesystems into the main open-source Oses, It
 struck me that there might be some useful changes to, say, XFS or ext3, that
 could be made that would help PG out.

This really sounds like a poor replacement for just making PostgreSQL use raw
devices to me. (I have no idea why that isn't done already, but presumably it
isn't all that easy to get right. :-) )

/* Steinar */
-- 
Homepage: http://www.sesse.net/

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] Anything to be gained from a 'Postgres Filesystem'?

2004-10-21 Thread Aaron Werman
The intuitive thing would be to put pg into a file system. 

/Aaron

On Thu, 21 Oct 2004 12:44:10 +0200, Leeuw van der, Tim
[EMAIL PROTECTED] wrote:
 Hi,
 
 I guess the difference is in 'severe hacking inside PG' vs. 'some unknown amount of 
 hacking that doesn't touch PG code'.
 
 Hacking PG internally to handle raw devices will meet with strong resistance from 
 large portions of the development team. I don't expect (m)any core devs of PG will 
 be excited about rewriting the entire I/O architecture of PG and duplicating large 
 amounts of OS type of code inside the application, just to try to attain an unknown 
 performance benefit.
 
 PG doesn't use one big file, as some databases do, but many small files. Now PG 
 would need to be able to do file-management, if you put the PG database on a raw 
 disk partition! That's icky stuff, and you'll find much resistance against putting 
 such code inside PG.
 So why not try to have the external FS know a bit about PG and it's 
 directory-layout, and it's IO requirements? Then such type of code can at least be 
 maintained outside the application, and will not be as much of a burden to the rest 
 of the application.
 
 (I'm not sure if it's a good idea to create a PG-specific FS in your OS of choice, 
 but it's certainly gonna be easier than getting FS code inside of PG)
 
 cheers,
 
 --Tim
 
 
 
 -Original Message-
 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Behalf Of Steinar H. Gunderson
 Sent: Thursday, October 21, 2004 12:27 PM
 To: [EMAIL PROTECTED]
 Subject: Re: [PERFORM] Anything to be gained from a 'Postgres Filesystem'?
 
 On Thu, Oct 21, 2004 at 08:58:01AM +0100, Matt Clark wrote:
  I suppose I'm just idly wondering really.  Clearly it's against PG
  philosophy to build an FS or direct IO management into PG, but now it's so
  relatively easy to plug filesystems into the main open-source Oses, It
  struck me that there might be some useful changes to, say, XFS or ext3, that
  could be made that would help PG out.
 
 This really sounds like a poor replacement for just making PostgreSQL use raw
 devices to me. (I have no idea why that isn't done already, but presumably it
 isn't all that easy to get right. :-) )
 
 /* Steinar */
 --
 Homepage: http://www.sesse.net/
 
 ---(end of broadcast)---
 TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
 
 ---(end of broadcast)---
 TIP 7: don't forget to increase your free space map settings
 


-- 

Regards,
/Aaron

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [PERFORM] Anything to be gained from a 'Postgres Filesystem'?

2004-10-21 Thread Neil Conway
Matt Clark wrote:
I'm thinking along the lines of an FS that's aware of PG's strategies and
requirements and therefore optimised to make those activities as efiicient
as possible - possibly even being aware of PG's disk layout and treating
files differently on that basis.
As someone else noted, this doesn't belong in the filesystem (rather the 
kernel's block I/O layer/buffer cache). But I agree, an API by which we 
can tell the kernel what kind of I/O behavior to expect would be good. 
The kernel needs to provide good behavior for a wide range of 
applications, but the DBMS can take advantage of a lot of 
domain-specific information. In theory, being able to pass that 
domain-specific information on to the kernel would mean we could get 
better performance without needing to reimplement large chunks of 
functionality that really ought to be done by the kernel anyway (as 
implementing raw I/O would require, for example). On the other hand, it 
would probably mean adding a fair bit of OS-specific hackery, which 
we've largely managed to avoid in the past.

The closest API to what you're describing that I'm aware of is 
posix_fadvise(). While that is technically-speaking a POSIX standard, it 
is not widely implemented (I know Linux 2.6 implements it; based on some 
quick googling, it looks like AIX does too). Using posix_fadvise() has 
been discussed in the past, so you might want to search the archives. We 
could use FADV_SEQUENTIAL to request more aggressive readahead on a file 
that we know we're about to sequentially scan. We might be able to use 
FADV_NOREUSE on the WAL. We might be able to get away with specifying 
FADV_RANDOM for indexes all of the time, or at least most of the time. 
One question is how this would interact with concurrent access (AFAICS 
there is no way to fetch the current advice on an fd...)

Also, I would imagine Win32 provides some means to inform the kernel 
about your expected I/O pattern, but I haven't checked. Does anyone know 
of any other relevant APIs?

-Neil
---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match


Re: [PERFORM] Anything to be gained from a 'Postgres Filesystem'?

2004-10-21 Thread Steinar H. Gunderson
On Thu, Oct 21, 2004 at 12:44:10PM +0200, Leeuw van der, Tim wrote:
 Hacking PG internally to handle raw devices will meet with strong
 resistance from large portions of the development team. I don't expect
 (m)any core devs of PG will be excited about rewriting the entire I/O
 architecture of PG and duplicating large amounts of OS type of code inside
 the application, just to try to attain an unknown performance benefit.

Well, at least I see people claiming 30% difference between different file
systems, but no, I'm not shouting bah, you'd better do this or I'll warez
Oracle :-) I have no idea how much you can improve over the best
filesystems out there, but having two layers of journalling (both WAL _and_
FS journalling) on top of each other don't make all that much sense to me.
:-)

/* Steinar */
-- 
Homepage: http://www.sesse.net/

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] Anything to be gained from a 'Postgres Filesystem'?

2004-10-21 Thread Tom Lane
Steinar H. Gunderson [EMAIL PROTECTED] writes:
 ... I have no idea how much you can improve over the best
 filesystems out there, but having two layers of journalling (both WAL _and_
 FS journalling) on top of each other don't make all that much sense to me.

Which is why setting the FS to journal metadata but not file contents is
often suggested as best practice for a PG-only filesystem.

regards, tom lane

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [PERFORM] Anything to be gained from a 'Postgres Filesystem'?

2004-10-21 Thread Jan Dittmer
Neil Conway wrote:
 Also, I would imagine Win32 provides some means to inform the kernel 
 about your expected I/O pattern, but I haven't checked. Does anyone know 
 of any other relevant APIs?

See CreateFile, Parameter dwFlagsAndAttributes

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/fileio/base/createfile.asp

There is FILE_FLAG_NO_BUFFERING, FILE_FLAG_OPEN_NO_RECALL,
FILE_FLAG_RANDOM_ACCESS and even FILE_FLAG_POSIX_SEMANTICS

Jan


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [PERFORM] Anything to be gained from a 'Postgres Filesystem'?

2004-10-21 Thread Steinar H. Gunderson
On Thu, Oct 21, 2004 at 10:20:55AM -0400, Tom Lane wrote:
 ... I have no idea how much you can improve over the best
 filesystems out there, but having two layers of journalling (both WAL _and_
 FS journalling) on top of each other don't make all that much sense to me.
 Which is why setting the FS to journal metadata but not file contents is
 often suggested as best practice for a PG-only filesystem.

Mm, but you still journal the metadata. Oh well, noatime etc.. :-)

By the way, I'm probably hitting a FAQ here, but would O_DIRECT help
PostgreSQL any, given large enough shared_buffers?

/* Steinar */
-- 
Homepage: http://www.sesse.net/

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] Anything to be gained from a 'Postgres Filesystem'?

2004-10-21 Thread Sean Chittenden
As someone else noted, this doesn't belong in the filesystem (rather 
the kernel's block I/O layer/buffer cache). But I agree, an API by 
which we can tell the kernel what kind of I/O behavior to expect would 
be good.
[snip]
The closest API to what you're describing that I'm aware of is 
posix_fadvise(). While that is technically-speaking a POSIX standard, 
it is not widely implemented (I know Linux 2.6 implements it; based on 
some quick googling, it looks like AIX does too).
Don't forget about the existence/usefulness/widely implemented 
madvise(2)/posix_madvise(2) call, which can give the OS the following 
hints: MADV_NORMAL, MADV_SEQUENTIAL, MADV_RANDOM, MADV_WILLNEED, 
MADV_DONTNEED, and MADV_FREE.  :)  -sc

--
Sean Chittenden
---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match


Re: [PERFORM] Anything to be gained from a 'Postgres Filesystem'?

2004-10-21 Thread Jim C. Nasby
Note that most people are now moving away from raw devices for databases
in most applicaitons. The relatively small performance gain isn't worth
the hassles.

On Thu, Oct 21, 2004 at 12:27:27PM +0200, Steinar H. Gunderson wrote:
 On Thu, Oct 21, 2004 at 08:58:01AM +0100, Matt Clark wrote:
  I suppose I'm just idly wondering really.  Clearly it's against PG
  philosophy to build an FS or direct IO management into PG, but now it's so
  relatively easy to plug filesystems into the main open-source Oses, It
  struck me that there might be some useful changes to, say, XFS or ext3, that
  could be made that would help PG out.
 
 This really sounds like a poor replacement for just making PostgreSQL use raw
 devices to me. (I have no idea why that isn't done already, but presumably it
 isn't all that easy to get right. :-) )
 
 /* Steinar */
 -- 
 Homepage: http://www.sesse.net/
 
 ---(end of broadcast)---
 TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
 

-- 
Jim C. Nasby, Database Consultant   [EMAIL PROTECTED] 
Give your computer some brain candy! www.distributed.net Team #1828

Windows: Where do you want to go today?
Linux: Where do you want to go tomorrow?
FreeBSD: Are you guys coming, or what?

---(end of broadcast)---
TIP 8: explain analyze is your friend