Re: reiserfs & databases.
I'd like to thank Russel Coker for taking the time to spell his thinking out in detail. I now know more than I did five minutes ago! cheers, BM
Re: reiserfs & databases.
On Wed, 30 Aug 2000, Bulent Murtezaoglu wrote: >[...] >RC> The idea is that the database vendor knows their data storage >RC> better than the OS can guess it, and that knowledge allows >RC> them to implement better caching algorithms than the OS can >RC> use. The fact that benchmark results show that raw partition >RC> access is slower indicates that the databases aren't written >RC> as well as they are supposed to be. > >I am not convinced that this conclusion is warranted, though I admit I >have not seen those benchmarks. The DB vendor's raw disk driver might I have to admit that I have not seen the benchmarks either. However one reason that I believe the results are likely to be correct is the issue of determining the cache size for the database. If the database does raw access then it must manage it's own cache, and for the sake of sanity it must mlock() the cache memory (having disk cache being swapped is stupid, and doubly stupid when swap is slower than the database storage file system as is often the case). This means that the cache memory is not available for the OS. If the machine does nothing but database access then this is probably OK, however such dedicated database servers are quite rare. If we assume that every database server will be running other tasks than the database server (if only cron jobs that manage backups, tripwire, reporting, etc) then you will be hit by two problems, one is the situation of having an idle database mlock()ing all your memory so active programs run very slow, another problem is the database being the only active program but being configured not to use all the memory. If the OS does the caching then it will dynamically allocate the system memory to the process that needs it. >be doing things like synchronous writes for maintaining its own >invariants, while a [non-journalling] file system will care about fs >meta-data consistency at best. While it is possible that the general The journalling will make sure that the file system doesn't get trashed after a crash. The database can call fdatasync() to make sure that it's own data is correctly synchronised. If there is a need to sync only part of a file then you can memory map it and use msync() to synchronise one page while leaving other data in the write-back cache. >purpose file system with more man-hours behind it is better written, >the benchmarks might be omitting crucial criteria like crash >protection and such. Do you guys have references to benchmarking >data? If the database correctly calls fsync(), fdatasync(), and msync() at appropriate times and the file system and OS correctly implement these system calls then the crash protection should be as good as it is going to get. Also it should reduce the code paths in the database. If the database is writing everything synchronously (as it will want to do with a raw device) then it will have to use it's own write-back cache which will involve lots of inter-process or inter-thread communication and other overhead. -- My current location - X marks the spot. X X X
Re: reiserfs & databases.
I'd like to thank Russel Coker for taking the time to spell his thinking out in detail. I now know more than I did five minutes ago! cheers, BM -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]
Re: reiserfs & databases.
On Wed, 30 Aug 2000, Bulent Murtezaoglu wrote: >[...] >RC> The idea is that the database vendor knows their data storage >RC> better than the OS can guess it, and that knowledge allows >RC> them to implement better caching algorithms than the OS can >RC> use. The fact that benchmark results show that raw partition >RC> access is slower indicates that the databases aren't written >RC> as well as they are supposed to be. > >I am not convinced that this conclusion is warranted, though I admit I >have not seen those benchmarks. The DB vendor's raw disk driver might I have to admit that I have not seen the benchmarks either. However one reason that I believe the results are likely to be correct is the issue of determining the cache size for the database. If the database does raw access then it must manage it's own cache, and for the sake of sanity it must mlock() the cache memory (having disk cache being swapped is stupid, and doubly stupid when swap is slower than the database storage file system as is often the case). This means that the cache memory is not available for the OS. If the machine does nothing but database access then this is probably OK, however such dedicated database servers are quite rare. If we assume that every database server will be running other tasks than the database server (if only cron jobs that manage backups, tripwire, reporting, etc) then you will be hit by two problems, one is the situation of having an idle database mlock()ing all your memory so active programs run very slow, another problem is the database being the only active program but being configured not to use all the memory. If the OS does the caching then it will dynamically allocate the system memory to the process that needs it. >be doing things like synchronous writes for maintaining its own >invariants, while a [non-journalling] file system will care about fs >meta-data consistency at best. While it is possible that the general The journalling will make sure that the file system doesn't get trashed after a crash. The database can call fdatasync() to make sure that it's own data is correctly synchronised. If there is a need to sync only part of a file then you can memory map it and use msync() to synchronise one page while leaving other data in the write-back cache. >purpose file system with more man-hours behind it is better written, >the benchmarks might be omitting crucial criteria like crash >protection and such. Do you guys have references to benchmarking >data? If the database correctly calls fsync(), fdatasync(), and msync() at appropriate times and the file system and OS correctly implement these system calls then the crash protection should be as good as it is going to get. Also it should reduce the code paths in the database. If the database is writing everything synchronously (as it will want to do with a raw device) then it will have to use it's own write-back cache which will involve lots of inter-process or inter-thread communication and other overhead. -- My current location - X marks the spot. X X X -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]
Re: reiserfs & databases.
[...] RC> The idea is that the database vendor knows their data storage RC> better than the OS can guess it, and that knowledge allows RC> them to implement better caching algorithms than the OS can RC> use. The fact that benchmark results show that raw partition RC> access is slower indicates that the databases aren't written RC> as well as they are supposed to be. I am not convinced that this conclusion is warranted, though I admit I have not seen those benchmarks. The DB vendor's raw disk driver might be doing things like synchronous writes for maintaining its own invariants, while a [non-journalling] file system will care about fs meta-data consistency at best. While it is possible that the general purpose file system with more man-hours behind it is better written, the benchmarks might be omitting crucial criteria like crash protection and such. Do you guys have references to benchmarking data? RC> ... One of RC> which was someone who did tests with IBM's HPFS386 file system RC> for server versions of OS/2. He tried using 2M of cache with RC> HPFS386 and 16M of physical cache in a caching hard drive RC> controller and using 18M of HPFS386 cache with no cache on the RC> controller. The results were surprisingly close on real-world RC> tests such as compiling large projects. It seemed that 2M of RC> cache was enough to cache directory entries and other RC> file-system meta-data and cache apart from that worked on a RC> LRU basis anyway. This I would buy, as you point out the controller and the FS code are doing the same thing (if they are giving the same write guarantees). BM
Re: reiserfs & databases.
[...] RC> The idea is that the database vendor knows their data storage RC> better than the OS can guess it, and that knowledge allows RC> them to implement better caching algorithms than the OS can RC> use. The fact that benchmark results show that raw partition RC> access is slower indicates that the databases aren't written RC> as well as they are supposed to be. I am not convinced that this conclusion is warranted, though I admit I have not seen those benchmarks. The DB vendor's raw disk driver might be doing things like synchronous writes for maintaining its own invariants, while a [non-journalling] file system will care about fs meta-data consistency at best. While it is possible that the general purpose file system with more man-hours behind it is better written, the benchmarks might be omitting crucial criteria like crash protection and such. Do you guys have references to benchmarking data? RC> ... One of RC> which was someone who did tests with IBM's HPFS386 file system RC> for server versions of OS/2. He tried using 2M of cache with RC> HPFS386 and 16M of physical cache in a caching hard drive RC> controller and using 18M of HPFS386 cache with no cache on the RC> controller. The results were surprisingly close on real-world RC> tests such as compiling large projects. It seemed that 2M of RC> cache was enough to cache directory entries and other RC> file-system meta-data and cache apart from that worked on a RC> LRU basis anyway. This I would buy, as you point out the controller and the FS code are doing the same thing (if they are giving the same write guarantees). BM -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]
Re: reiserfs & databases.
to sum things up - my idea to use reiserfs as database placeholder ain't that stupid. - modern fs's do better job that commercial database designers well, actually I'm using postgresql which can't use raw partitions anyway. thanks for the response.
Re: reiserfs & databases.
to sum things up - my idea to use reiserfs as database placeholder ain't that stupid. - modern fs's do better job that commercial database designers well, actually I'm using postgresql which can't use raw partitions anyway. thanks for the response. -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]
Re: reiserfs & databases.
On Wed, 30 Aug 2000, Nathan E Norman wrote: >On Tue, Aug 29, 2000 at 04:36:23PM +0200, Dariush Pietrzak wrote: >> but, there are some commercial databases which keep their data directly >> on partitions ( this should be much better then any *fs including >> reiserfs) and the weird part is that that direct-partition instalation >> scheme seems to be a little bit slower that fs-based in benchmarks. >> And this means that I'm missing something here, what is it that I haven't >> thought about, anyone, any comments on this? > >If I understand your question, you're saying that RDBMs do benchmark >faster using a native filesystems rather than rolling their own on >a partition, and you're wondering why ... I would have to hazard a >guess that the operating system disk cache and buffers are coming >into play when you're using a native filesystem, but there's no >caching when a "raw" partition is used. The idea is that the database vendor knows their data storage better than the OS can guess it, and that knowledge allows them to implement better caching algorithms than the OS can use. The fact that benchmark results show that raw partition access is slower indicates that the databases aren't written as well as they are supposed to be. The concept of the database being able to cache better than the OS sounds reasonable, but seems to not work in practise. I have seen other examples of similar principles. One of which was someone who did tests with IBM's HPFS386 file system for server versions of OS/2. He tried using 2M of cache with HPFS386 and 16M of physical cache in a caching hard drive controller and using 18M of HPFS386 cache with no cache on the controller. The results were surprisingly close on real-world tests such as compiling large projects. It seemed that 2M of cache was enough to cache directory entries and other file-system meta-data and cache apart from that worked on a LRU basis anyway. Russell Coker
Re: reiserfs & databases.
On Wed, 30 Aug 2000, Nathan E Norman wrote: >On Tue, Aug 29, 2000 at 04:36:23PM +0200, Dariush Pietrzak wrote: >> but, there are some commercial databases which keep their data directly >> on partitions ( this should be much better then any *fs including >> reiserfs) and the weird part is that that direct-partition instalation >> scheme seems to be a little bit slower that fs-based in benchmarks. >> And this means that I'm missing something here, what is it that I haven't >> thought about, anyone, any comments on this? > >If I understand your question, you're saying that RDBMs do benchmark >faster using a native filesystems rather than rolling their own on >a partition, and you're wondering why ... I would have to hazard a >guess that the operating system disk cache and buffers are coming >into play when you're using a native filesystem, but there's no >caching when a "raw" partition is used. The idea is that the database vendor knows their data storage better than the OS can guess it, and that knowledge allows them to implement better caching algorithms than the OS can use. The fact that benchmark results show that raw partition access is slower indicates that the databases aren't written as well as they are supposed to be. The concept of the database being able to cache better than the OS sounds reasonable, but seems to not work in practise. I have seen other examples of similar principles. One of which was someone who did tests with IBM's HPFS386 file system for server versions of OS/2. He tried using 2M of cache with HPFS386 and 16M of physical cache in a caching hard drive controller and using 18M of HPFS386 cache with no cache on the controller. The results were surprisingly close on real-world tests such as compiling large projects. It seemed that 2M of cache was enough to cache directory entries and other file-system meta-data and cache apart from that worked on a LRU basis anyway. Russell Coker -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]
Re: reiserfs & databases.
On Tue, Aug 29, 2000 at 04:36:23PM +0200, Dariush Pietrzak wrote: > but, there are some commercial databases which keep their data directly > on partitions ( this should be much better then any *fs including > reiserfs) and the weird part is that that direct-partition instalation > scheme seems to be a little bit slower that fs-based in benchmarks. > And this means that I'm missing something here, what is it that I haven't > thought about, anyone, any comments on this? If I understand your question, you're saying that RDBMs do benchmark faster using a native filesystems rather than rolling their own on a partition, and you're wondering why ... I would have to hazard a guess that the operating system disk cache and buffers are coming into play when you're using a native filesystem, but there's no caching when a "raw" partition is used. -- "Eschew Obfuscation" email:[EMAIL PROTECTED]http://incanus.net/~nnorman pgpQMv31j0vlY.pgp Description: PGP signature
Re: reiserfs & databases.
On Tue, Aug 29, 2000 at 04:36:23PM +0200, Dariush Pietrzak wrote: > but, there are some commercial databases which keep their data directly > on partitions ( this should be much better then any *fs including > reiserfs) and the weird part is that that direct-partition instalation > scheme seems to be a little bit slower that fs-based in benchmarks. > And this means that I'm missing something here, what is it that I haven't > thought about, anyone, any comments on this? If I understand your question, you're saying that RDBMs do benchmark faster using a native filesystems rather than rolling their own on a partition, and you're wondering why ... I would have to hazard a guess that the operating system disk cache and buffers are coming into play when you're using a native filesystem, but there's no caching when a "raw" partition is used. -- "Eschew Obfuscation" email:[EMAIL PROTECTED]http://incanus.net/~nnorman PGP signature