Re: [PERFORM] Anything to be gained from a 'Postgres Filesystem'?
Hi, Christopher, [sorry for the delay of my answer, we were rather busy last weks] On Thu, 04 Nov 2004 21:29:04 -0500 Christopher Browne [EMAIL PROTECTED] wrote: In an attempt to throw the authorities off his trail, [EMAIL PROTECTED] (Markus Schaber) transmitted: We should create a list of those needs, and then communicate those to the kernel/fs developers. Then we (as well as other apps) can make use of those features where they are available, and use the old way everywhere else. Which kernel/fs developers did you have in mind? The ones working on Linux? Or FreeBSD? Or DragonflyBSD? Or Solaris? Or AIX? All of them, and others (e. G. Windows). Once we have a list of those needs, the advocates can talk to the OS developers. Some OS developers will follow, others not. Then the postgres folks (and other application developers that benefit from this capabilities) can point interested users to our benchmarks and tell them that Foox performs 3 times as fast as BaarOs because they provide better support for database needs. Please keep in mind that many of the PostgreSQL developers are BSD folk that aren't particularly interested in creating bleeding edge Linux capabilities. Then this should be motivation to add those things to BSD, maybe as a patch or loadable module so it does not bloat mainstream. I personally would prefer it to appear in BSD first, because in case it really pays of, it won't be long until it appears in Linux as well :-) Jumping into a customized filesystem that neither hardware nor software vendors would remotely consider supporting just doesn't look like a viable strategy to me. I did not vote for a custom filesystem, as the OP did. I did vote for isolating a set of useful capabilities PostgreSQL could exploit, and then try to confince the kernel developers to include this capabilities, so they are likely to be included in the main distributions. I don't know about the BSD market, but I know that Redhat and SuSE often ship their patched versions of the kernels (so then they officially support the extensions), and most of this is likely to be included in main stream later. Maybe Reiser4 is a step into the right way, and maybe even a postgres plugin for Reiser4 will be worth the effort. Maybe XFS/JFS etc. already have such capabilities. Maybe that's completely wrong. The capabilities tend to be redundant. They tend to implement vaguely similar transactional capabilities to what databases have to implement. The similarities are not close enough to eliminate either variety of commit as redundant. But a speed gain may be possible by coordinating DB and FS tansactions. Thanks, Markus -- markus schaber | dipl. informatiker logi-track ag | rennweg 14-16 | ch 8001 zürich phone +41-43-888 62 52 | fax +41-43-888 62 53 mailto:[EMAIL PROTECTED] | www.logi-track.com ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [PERFORM] Anything to be gained from a 'Postgres Filesystem'?
Hi, Leeuw, On Thu, 21 Oct 2004 12:44:10 +0200 Leeuw van der, Tim [EMAIL PROTECTED] wrote: (I'm not sure if it's a good idea to create a PG-specific FS in your OS of choice, but it's certainly gonna be easier than getting FS code inside of PG) I don't think PG really needs a specific FS. I rather think that PG could profit from some functionality that's missing in traditional UN*X file systems. posix_fadvise(2) may be a candidate. Read/Write bareers another pone, as well asn syncing a bunch of data in different files with a single call (so that the OS can determine the best write order). I can also imagine some interaction with the FS journalling system (to avoid duplicate efforts). We should create a list of those needs, and then communicate those to the kernel/fs developers. Then we (as well as other apps) can make use of those features where they are available, and use the old way everywhere else. Maybe Reiser4 is a step into the right way, and maybe even a postgres plugin for Reiser4 will be worth the effort. Maybe XFS/JFS etc. already have such capabilities. Maybe that's completely wrong. cheers, Markus -- markus schaber | dipl. informatiker logi-track ag | rennweg 14-16 | ch 8001 zürich phone +41-43-888 62 52 | fax +41-43-888 62 53 mailto:[EMAIL PROTECTED] | www.logi-track.com ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [PERFORM] Anything to be gained from a 'Postgres Filesystem'?
[EMAIL PROTECTED] (Pierre-Frédéric Caillaud) writes: posix_fadvise(2) may be a candidate. Read/Write bareers another pone, as well asn syncing a bunch of data in different files with a single call (so that the OS can determine the best write order). I can also imagine some interaction with the FS journalling system (to avoid duplicate efforts). There is also the fact that syncing after every transaction could be changed to syncing every N transactions (N fixed or depending on the data size written by the transactions) which would be more efficient than the current behaviour with a sleep. HOWEVER suppressing the sleep() would lead to postgres returning from the COMMIT while it is in fact not synced, which somehow rings a huge alarm bell somewhere. What about read order ? This could be very useful for SELECT queries involving indexes, which in case of a non-clustered table lead to random seeks in the table. Another thing that would be valuable would be to have some way to say: Read this data; don't bother throwing other data out of the cache to stuff this in. Something like a read_uncached() call... That would mean that a seq scan or a vacuum wouldn't force useful data out of cache. -- let name=cbbrowne and tld=cbbrowne.com in String.concat @ [name;tld];; http://www.ntlug.org/~cbbrowne/linuxxian.html A VAX is virtually a computer, but not quite. ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] Anything to be gained from a 'Postgres Filesystem'?
On Thu, 2004-11-04 at 15:47, Chris Browne wrote: Another thing that would be valuable would be to have some way to say: Read this data; don't bother throwing other data out of the cache to stuff this in. Something like a read_uncached() call... That would mean that a seq scan or a vacuum wouldn't force useful data out of cache. ARC does almost exactly those two things in 8.0. Seq scans do get put in cache, but in a way that means they don't spoil the main bulk of the cache. -- Best Regards, Simon Riggs ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [PERFORM] Anything to be gained from a 'Postgres Filesystem'?
On Thu, Nov 04, 2004 at 10:47:31AM -0500, Chris Browne wrote: Another thing that would be valuable would be to have some way to say: Read this data; don't bother throwing other data out of the cache to stuff this in. Something like a read_uncached() call... You mean, like, open(filename, O_DIRECT)? :-) /* Steinar */ -- Homepage: http://www.sesse.net/ ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Anything to be gained from a 'Postgres Filesystem'?
Simon Riggs [EMAIL PROTECTED] writes: On Thu, 2004-11-04 at 15:47, Chris Browne wrote: Something like a read_uncached() call... That would mean that a seq scan or a vacuum wouldn't force useful data out of cache. ARC does almost exactly those two things in 8.0. But only for Postgres' own shared buffers. The kernel cache still gets trashed, because we have no way to suggest to the kernel that it not hang onto the data read in. regards, tom lane ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [PERFORM] Anything to be gained from a 'Postgres Filesystem'?
On Thu, 2004-11-04 at 19:34, Tom Lane wrote: Simon Riggs [EMAIL PROTECTED] writes: On Thu, 2004-11-04 at 15:47, Chris Browne wrote: Something like a read_uncached() call... That would mean that a seq scan or a vacuum wouldn't force useful data out of cache. ARC does almost exactly those two things in 8.0. But only for Postgres' own shared buffers. The kernel cache still gets trashed, because we have no way to suggest to the kernel that it not hang onto the data read in. I guess a difference in viewpoints. I'm inclined to give most of the RAM to PostgreSQL, since as you point out, the kernel is out of our control. That way, we can do what we like with it - keep it or not, as we choose. -- Best Regards, Simon Riggs ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [PERFORM] Anything to be gained from a 'Postgres Filesystem'?
Simon Riggs [EMAIL PROTECTED] writes: On Thu, 2004-11-04 at 19:34, Tom Lane wrote: But only for Postgres' own shared buffers. The kernel cache still gets trashed, because we have no way to suggest to the kernel that it not hang onto the data read in. I guess a difference in viewpoints. I'm inclined to give most of the RAM to PostgreSQL, since as you point out, the kernel is out of our control. That way, we can do what we like with it - keep it or not, as we choose. That's always been a Bad Idea for three or four different reasons, of which ARC will eliminate no more than one. regards, tom lane ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [PERFORM] Anything to be gained from a 'Postgres Filesystem'?
In an attempt to throw the authorities off his trail, [EMAIL PROTECTED] (Markus Schaber) transmitted: We should create a list of those needs, and then communicate those to the kernel/fs developers. Then we (as well as other apps) can make use of those features where they are available, and use the old way everywhere else. Which kernel/fs developers did you have in mind? The ones working on Linux? Or FreeBSD? Or DragonflyBSD? Or Solaris? Or AIX? Please keep in mind that many of the PostgreSQL developers are BSD folk that aren't particularly interested in creating bleeding edge Linux capabilities. Furthermore, I'd think long and hard before jumping into such a _spectacularly_ bleeding edge kind of project. The reason why you would want this would be if you needed to get some margin of performance. I can't see wanting that without also wanting some _assurance_ of system reliability, at which point I also want things like vendor support. If you've ever contacted Red Hat Software, you'd know that they very nearly refuse to provide support for any filesystem other than ext3. Use anything else and they'll make noises about not being able to assure you of anything at all. If you need high performance, you'd also want to use interesting sorts of hardware. Disk arrays, RAID controllers, that sort of thing. Vendors of such things don't particularly want to talk to you unless you're using a supported Linux distribution and a supported filesystem. Jumping into a customized filesystem that neither hardware nor software vendors would remotely consider supporting just doesn't look like a viable strategy to me. Maybe Reiser4 is a step into the right way, and maybe even a postgres plugin for Reiser4 will be worth the effort. Maybe XFS/JFS etc. already have such capabilities. Maybe that's completely wrong. The capabilities tend to be redundant. They tend to implement vaguely similar transactional capabilities to what databases have to implement. The similarities are not close enough to eliminate either variety of commit as redundant. -- cbbrowne,@,linuxfinances.info http://linuxfinances.info/info/linux.html Rules of the Evil Overlord #128. I will not employ robots as agents of destruction if there is any possible way that they can be re-programmed or if their battery packs are externally mounted and easily removable. http://www.eviloverlord.com/ ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [PERFORM] Anything to be gained from a 'Postgres Filesystem'?
After a long battle with technology, [EMAIL PROTECTED] (Simon Riggs), an earthling, wrote: On Thu, 2004-11-04 at 15:47, Chris Browne wrote: Another thing that would be valuable would be to have some way to say: Read this data; don't bother throwing other data out of the cache to stuff this in. Something like a read_uncached() call... That would mean that a seq scan or a vacuum wouldn't force useful data out of cache. ARC does almost exactly those two things in 8.0. Seq scans do get put in cache, but in a way that means they don't spoil the main bulk of the cache. We're not talking about the same cache. ARC does these exact things for _shared memory_ cache, and is the obvious inspiration. But it does more or less nothing about the way OS file buffer cache is managed, and the handling of _that_ would be the point of modifying OS filesystem semantics. -- select 'cbbrowne' || '@' || 'linuxfinances.info'; http://www3.sympatico.ca/cbbrowne/oses.html Have you ever considered beating yourself with a cluestick? ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Anything to be gained from a 'Postgres Filesystem'?
On Fri, 2004-11-05 at 06:20, Steinar H. Gunderson wrote: You mean, like, open(filename, O_DIRECT)? :-) This disables readahead (at least on Linux), which is certainly not we want: for the very case where we don't want to keep the data in cache for a while (sequential scans, VACUUM), we also want aggressive readahead. -Neil ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [PERFORM] Anything to be gained from a 'Postgres Filesystem'?
On Thu, 2004-11-04 at 23:29, Pierre-Frédéric Caillaud wrote: There is also the fact that syncing after every transaction could be changed to syncing every N transactions (N fixed or depending on the data size written by the transactions) which would be more efficient than the current behaviour with a sleep. Uh, which sleep are you referring to? Also, how would interacting with the filesystem's journal effect how often we need to force-write the WAL to disk? (ISTM we need to sync _something_ to disk when a transaction commits in order to maintain the WAL invariant.) There's fadvise to tell the OS to readahead on a seq scan (I think the OS detects it anyway) Not perfectly, though; also, Linux will do a more aggressive readahead if you tell it to do so via posix_fadvise(). if there was a system call telling the OS in the next seconds I'm going to read these chunks of data from this file (gives a list of offsets and lengths), could you put them in your cache in the most efficient order without seeking too much, so that when I read() them in random order, they will be in the cache already ?. http://www.opengroup.org/onlinepubs/009695399/functions/posix_fadvise.html POSIX_FADV_WILLNEED Specifies that the application expects to access the specified data in the near future. -Neil ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Anything to be gained from a 'Postgres Filesystem'?
On Fri, 2004-11-05 at 02:47, Chris Browne wrote: Another thing that would be valuable would be to have some way to say: Read this data; don't bother throwing other data out of the cache to stuff this in. This is similar, although not exactly the same thing: http://www.opengroup.org/onlinepubs/009695399/functions/posix_fadvise.html POSIX_FADV_NOREUSE Specifies that the application expects to access the specified data once and then not reuse it thereafter. -Neil ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
[PERFORM] Anything to be gained from a 'Postgres Filesystem'?
I suppose I'm just idly wondering really. Clearly it's against PG philosophy to build an FS or direct IO management into PG, but now it's so relatively easy to plug filesystems into the main open-source Oses, It struck me that there might be some useful changes to, say, XFS or ext3, that could be made that would help PG out. I'm thinking along the lines of an FS that's aware of PG's strategies and requirements and therefore optimised to make those activities as efiicient as possible - possibly even being aware of PG's disk layout and treating files differently on that basis. Not being an FS guru I'm not really clear on whether this would help much (enough to be worth it anyway) or not - any thoughts? And if there were useful gains to be had, would it need a whole new FS or could an existing one be modified? So there might be (as I said, I'm not an FS guru...): * great append performance for the WAL? * optimised scattered writes for checkpointing? * Knowledge that FSYNC is being used for preserving ordering a lot of the time, rather than requiring actual writes to disk (so long as the writes eventually happen in order...)? Matt Matt Clark Ymogen Ltd P: 0845 130 4531 W: https://ymogen.net/ M: 0774 870 1584 ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [PERFORM] Anything to be gained from a 'Postgres Filesystem'?
Hiya, Looking at that list, I got the feeling that you'd want to push that PG-awareness down into the block-io layer as well, then, so as to be able to optimise for (perhaps) conflicting goals depending on what the app does; for the IO system to be able to read the apps mind it needs to have some knowledge of what the app is / needs / wants and I get the impression that this awareness needs to go deeper than the FS only. --Tim (But you might have time to rewrite Linux/BSD as a PG-OS? just kidding!) -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Behalf Of Matt Clark Sent: Thursday, October 21, 2004 9:58 AM To: [EMAIL PROTECTED] Subject: [PERFORM] Anything to be gained from a 'Postgres Filesystem'? I suppose I'm just idly wondering really. Clearly it's against PG philosophy to build an FS or direct IO management into PG, but now it's so relatively easy to plug filesystems into the main open-source Oses, It struck me that there might be some useful changes to, say, XFS or ext3, that could be made that would help PG out. I'm thinking along the lines of an FS that's aware of PG's strategies and requirements and therefore optimised to make those activities as efiicient as possible - possibly even being aware of PG's disk layout and treating files differently on that basis. Not being an FS guru I'm not really clear on whether this would help much (enough to be worth it anyway) or not - any thoughts? And if there were useful gains to be had, would it need a whole new FS or could an existing one be modified? So there might be (as I said, I'm not an FS guru...): * great append performance for the WAL? * optimised scattered writes for checkpointing? * Knowledge that FSYNC is being used for preserving ordering a lot of the time, rather than requiring actual writes to disk (so long as the writes eventually happen in order...)? Matt Matt Clark Ymogen Ltd P: 0845 130 4531 W: https://ymogen.net/ M: 0774 870 1584 ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED]) ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [PERFORM] Anything to be gained from a 'Postgres Filesystem'?
Reiser4 ? On Thu, 21 Oct 2004 08:58:01 +0100, Matt Clark [EMAIL PROTECTED] wrote: I suppose I'm just idly wondering really. Clearly it's against PG philosophy to build an FS or direct IO management into PG, but now it's so relatively easy to plug filesystems into the main open-source Oses, It struck me that there might be some useful changes to, say, XFS or ext3, that could be made that would help PG out. I'm thinking along the lines of an FS that's aware of PG's strategies and requirements and therefore optimised to make those activities as efiicient as possible - possibly even being aware of PG's disk layout and treating files differently on that basis. Not being an FS guru I'm not really clear on whether this would help much (enough to be worth it anyway) or not - any thoughts? And if there were useful gains to be had, would it need a whole new FS or could an existing one be modified? So there might be (as I said, I'm not an FS guru...): * great append performance for the WAL? * optimised scattered writes for checkpointing? * Knowledge that FSYNC is being used for preserving ordering a lot of the time, rather than requiring actual writes to disk (so long as the writes eventually happen in order...)? Matt Matt Clark Ymogen Ltd P: 0845 130 4531 W: https://ymogen.net/ M: 0774 870 1584 ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED]) ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [PERFORM] Anything to be gained from a 'Postgres Filesystem'?
On Thu, Oct 21, 2004 at 08:58:01AM +0100, Matt Clark wrote: I suppose I'm just idly wondering really. Clearly it's against PG philosophy to build an FS or direct IO management into PG, but now it's so relatively easy to plug filesystems into the main open-source Oses, It struck me that there might be some useful changes to, say, XFS or ext3, that could be made that would help PG out. This really sounds like a poor replacement for just making PostgreSQL use raw devices to me. (I have no idea why that isn't done already, but presumably it isn't all that easy to get right. :-) ) /* Steinar */ -- Homepage: http://www.sesse.net/ ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [PERFORM] Anything to be gained from a 'Postgres Filesystem'?
Hi, I guess the difference is in 'severe hacking inside PG' vs. 'some unknown amount of hacking that doesn't touch PG code'. Hacking PG internally to handle raw devices will meet with strong resistance from large portions of the development team. I don't expect (m)any core devs of PG will be excited about rewriting the entire I/O architecture of PG and duplicating large amounts of OS type of code inside the application, just to try to attain an unknown performance benefit. PG doesn't use one big file, as some databases do, but many small files. Now PG would need to be able to do file-management, if you put the PG database on a raw disk partition! That's icky stuff, and you'll find much resistance against putting such code inside PG. So why not try to have the external FS know a bit about PG and it's directory-layout, and it's IO requirements? Then such type of code can at least be maintained outside the application, and will not be as much of a burden to the rest of the application. (I'm not sure if it's a good idea to create a PG-specific FS in your OS of choice, but it's certainly gonna be easier than getting FS code inside of PG) cheers, --Tim -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Behalf Of Steinar H. Gunderson Sent: Thursday, October 21, 2004 12:27 PM To: [EMAIL PROTECTED] Subject: Re: [PERFORM] Anything to be gained from a 'Postgres Filesystem'? On Thu, Oct 21, 2004 at 08:58:01AM +0100, Matt Clark wrote: I suppose I'm just idly wondering really. Clearly it's against PG philosophy to build an FS or direct IO management into PG, but now it's so relatively easy to plug filesystems into the main open-source Oses, It struck me that there might be some useful changes to, say, XFS or ext3, that could be made that would help PG out. This really sounds like a poor replacement for just making PostgreSQL use raw devices to me. (I have no idea why that isn't done already, but presumably it isn't all that easy to get right. :-) ) /* Steinar */ -- Homepage: http://www.sesse.net/ ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED] ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [PERFORM] Anything to be gained from a 'Postgres Filesystem'?
The intuitive thing would be to put pg into a file system. /Aaron On Thu, 21 Oct 2004 12:44:10 +0200, Leeuw van der, Tim [EMAIL PROTECTED] wrote: Hi, I guess the difference is in 'severe hacking inside PG' vs. 'some unknown amount of hacking that doesn't touch PG code'. Hacking PG internally to handle raw devices will meet with strong resistance from large portions of the development team. I don't expect (m)any core devs of PG will be excited about rewriting the entire I/O architecture of PG and duplicating large amounts of OS type of code inside the application, just to try to attain an unknown performance benefit. PG doesn't use one big file, as some databases do, but many small files. Now PG would need to be able to do file-management, if you put the PG database on a raw disk partition! That's icky stuff, and you'll find much resistance against putting such code inside PG. So why not try to have the external FS know a bit about PG and it's directory-layout, and it's IO requirements? Then such type of code can at least be maintained outside the application, and will not be as much of a burden to the rest of the application. (I'm not sure if it's a good idea to create a PG-specific FS in your OS of choice, but it's certainly gonna be easier than getting FS code inside of PG) cheers, --Tim -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Behalf Of Steinar H. Gunderson Sent: Thursday, October 21, 2004 12:27 PM To: [EMAIL PROTECTED] Subject: Re: [PERFORM] Anything to be gained from a 'Postgres Filesystem'? On Thu, Oct 21, 2004 at 08:58:01AM +0100, Matt Clark wrote: I suppose I'm just idly wondering really. Clearly it's against PG philosophy to build an FS or direct IO management into PG, but now it's so relatively easy to plug filesystems into the main open-source Oses, It struck me that there might be some useful changes to, say, XFS or ext3, that could be made that would help PG out. This really sounds like a poor replacement for just making PostgreSQL use raw devices to me. (I have no idea why that isn't done already, but presumably it isn't all that easy to get right. :-) ) /* Steinar */ -- Homepage: http://www.sesse.net/ ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED] ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings -- Regards, /Aaron ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] Anything to be gained from a 'Postgres Filesystem'?
Matt Clark wrote: I'm thinking along the lines of an FS that's aware of PG's strategies and requirements and therefore optimised to make those activities as efiicient as possible - possibly even being aware of PG's disk layout and treating files differently on that basis. As someone else noted, this doesn't belong in the filesystem (rather the kernel's block I/O layer/buffer cache). But I agree, an API by which we can tell the kernel what kind of I/O behavior to expect would be good. The kernel needs to provide good behavior for a wide range of applications, but the DBMS can take advantage of a lot of domain-specific information. In theory, being able to pass that domain-specific information on to the kernel would mean we could get better performance without needing to reimplement large chunks of functionality that really ought to be done by the kernel anyway (as implementing raw I/O would require, for example). On the other hand, it would probably mean adding a fair bit of OS-specific hackery, which we've largely managed to avoid in the past. The closest API to what you're describing that I'm aware of is posix_fadvise(). While that is technically-speaking a POSIX standard, it is not widely implemented (I know Linux 2.6 implements it; based on some quick googling, it looks like AIX does too). Using posix_fadvise() has been discussed in the past, so you might want to search the archives. We could use FADV_SEQUENTIAL to request more aggressive readahead on a file that we know we're about to sequentially scan. We might be able to use FADV_NOREUSE on the WAL. We might be able to get away with specifying FADV_RANDOM for indexes all of the time, or at least most of the time. One question is how this would interact with concurrent access (AFAICS there is no way to fetch the current advice on an fd...) Also, I would imagine Win32 provides some means to inform the kernel about your expected I/O pattern, but I haven't checked. Does anyone know of any other relevant APIs? -Neil ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] Anything to be gained from a 'Postgres Filesystem'?
On Thu, Oct 21, 2004 at 12:44:10PM +0200, Leeuw van der, Tim wrote: Hacking PG internally to handle raw devices will meet with strong resistance from large portions of the development team. I don't expect (m)any core devs of PG will be excited about rewriting the entire I/O architecture of PG and duplicating large amounts of OS type of code inside the application, just to try to attain an unknown performance benefit. Well, at least I see people claiming 30% difference between different file systems, but no, I'm not shouting bah, you'd better do this or I'll warez Oracle :-) I have no idea how much you can improve over the best filesystems out there, but having two layers of journalling (both WAL _and_ FS journalling) on top of each other don't make all that much sense to me. :-) /* Steinar */ -- Homepage: http://www.sesse.net/ ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [PERFORM] Anything to be gained from a 'Postgres Filesystem'?
Steinar H. Gunderson [EMAIL PROTECTED] writes: ... I have no idea how much you can improve over the best filesystems out there, but having two layers of journalling (both WAL _and_ FS journalling) on top of each other don't make all that much sense to me. Which is why setting the FS to journal metadata but not file contents is often suggested as best practice for a PG-only filesystem. regards, tom lane ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [PERFORM] Anything to be gained from a 'Postgres Filesystem'?
Neil Conway wrote: Also, I would imagine Win32 provides some means to inform the kernel about your expected I/O pattern, but I haven't checked. Does anyone know of any other relevant APIs? See CreateFile, Parameter dwFlagsAndAttributes http://msdn.microsoft.com/library/default.asp?url=/library/en-us/fileio/base/createfile.asp There is FILE_FLAG_NO_BUFFERING, FILE_FLAG_OPEN_NO_RECALL, FILE_FLAG_RANDOM_ACCESS and even FILE_FLAG_POSIX_SEMANTICS Jan ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [PERFORM] Anything to be gained from a 'Postgres Filesystem'?
On Thu, Oct 21, 2004 at 10:20:55AM -0400, Tom Lane wrote: ... I have no idea how much you can improve over the best filesystems out there, but having two layers of journalling (both WAL _and_ FS journalling) on top of each other don't make all that much sense to me. Which is why setting the FS to journal metadata but not file contents is often suggested as best practice for a PG-only filesystem. Mm, but you still journal the metadata. Oh well, noatime etc.. :-) By the way, I'm probably hitting a FAQ here, but would O_DIRECT help PostgreSQL any, given large enough shared_buffers? /* Steinar */ -- Homepage: http://www.sesse.net/ ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [PERFORM] Anything to be gained from a 'Postgres Filesystem'?
As someone else noted, this doesn't belong in the filesystem (rather the kernel's block I/O layer/buffer cache). But I agree, an API by which we can tell the kernel what kind of I/O behavior to expect would be good. [snip] The closest API to what you're describing that I'm aware of is posix_fadvise(). While that is technically-speaking a POSIX standard, it is not widely implemented (I know Linux 2.6 implements it; based on some quick googling, it looks like AIX does too). Don't forget about the existence/usefulness/widely implemented madvise(2)/posix_madvise(2) call, which can give the OS the following hints: MADV_NORMAL, MADV_SEQUENTIAL, MADV_RANDOM, MADV_WILLNEED, MADV_DONTNEED, and MADV_FREE. :) -sc -- Sean Chittenden ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] Anything to be gained from a 'Postgres Filesystem'?
Note that most people are now moving away from raw devices for databases in most applicaitons. The relatively small performance gain isn't worth the hassles. On Thu, Oct 21, 2004 at 12:27:27PM +0200, Steinar H. Gunderson wrote: On Thu, Oct 21, 2004 at 08:58:01AM +0100, Matt Clark wrote: I suppose I'm just idly wondering really. Clearly it's against PG philosophy to build an FS or direct IO management into PG, but now it's so relatively easy to plug filesystems into the main open-source Oses, It struck me that there might be some useful changes to, say, XFS or ext3, that could be made that would help PG out. This really sounds like a poor replacement for just making PostgreSQL use raw devices to me. (I have no idea why that isn't done already, but presumably it isn't all that easy to get right. :-) ) /* Steinar */ -- Homepage: http://www.sesse.net/ ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED] -- Jim C. Nasby, Database Consultant [EMAIL PROTECTED] Give your computer some brain candy! www.distributed.net Team #1828 Windows: Where do you want to go today? Linux: Where do you want to go tomorrow? FreeBSD: Are you guys coming, or what? ---(end of broadcast)--- TIP 8: explain analyze is your friend