Re: reiserfs databases.

2000-09-01 Thread Russell Coker

On Wed, 30 Aug 2000, Bulent Murtezaoglu wrote:
[...]
RC The idea is that the database vendor knows their data storage
RC better than the OS can guess it, and that knowledge allows
RC them to implement better caching algorithms than the OS can
RC use.  The fact that benchmark results show that raw partition
RC access is slower indicates that the databases aren't written
RC as well as they are supposed to be.

I am not convinced that this conclusion is warranted, though I admit I
have not seen those benchmarks.  The DB vendor's raw disk driver might

I have to admit that I have not seen the benchmarks either.  However one
reason that I believe the results are likely to be correct is the issue of
determining the cache size for the database.  If the database does raw access
then it must manage it's own cache, and for the sake of sanity it must
mlock() the cache memory (having disk cache being swapped is stupid, and
doubly stupid when swap is slower than the database storage file system as is
often the case).  This means that the cache memory is not available for the
OS.  If the machine does nothing but database access then this is probably
OK, however such dedicated database servers are quite rare.
If we assume that every database server will be running other tasks than the
database server (if only cron jobs that manage backups, tripwire, reporting,
etc) then you will be hit by two problems, one is the situation of having an
idle database mlock()ing all your memory so active programs run very slow,
another problem is the database being the only active program but being
configured not to use all the memory.  If the OS does the caching then it
will dynamically allocate the system memory to the process that needs it.

be doing things like synchronous writes for maintaining its own
invariants, while a [non-journalling] file system will care about fs
meta-data consistency at best.  While it is possible that the general

The journalling will make sure that the file system doesn't get trashed after
a crash.  The database can call fdatasync() to make sure that it's own data
is correctly synchronised.  If there is a need to sync only part of a file
then you can memory map it and use msync() to synchronise one page while
leaving other data in the write-back cache.

purpose file system with more man-hours behind it is better written,
the benchmarks might be omitting crucial criteria like crash
protection and such.  Do you guys have references to benchmarking
data?

If the database correctly calls fsync(), fdatasync(), and msync() at
appropriate times and the file system and OS correctly implement these system
calls then the crash protection should be as good as it is going to get.

Also it should reduce the code paths in the database.  If the database is
writing everything synchronously (as it will want to do with a raw device)
then it will have to use it's own write-back cache which will involve lots of
inter-process or inter-thread communication and other overhead.

-- 
My current location - X marks the spot.
X
X
X


--  
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]




Re: reiserfs databases.

2000-09-01 Thread Russell Coker
On Wed, 30 Aug 2000, Bulent Murtezaoglu wrote:
[...]
RC The idea is that the database vendor knows their data storage
RC better than the OS can guess it, and that knowledge allows
RC them to implement better caching algorithms than the OS can
RC use.  The fact that benchmark results show that raw partition
RC access is slower indicates that the databases aren't written
RC as well as they are supposed to be.

I am not convinced that this conclusion is warranted, though I admit I
have not seen those benchmarks.  The DB vendor's raw disk driver might

I have to admit that I have not seen the benchmarks either.  However one
reason that I believe the results are likely to be correct is the issue of
determining the cache size for the database.  If the database does raw access
then it must manage it's own cache, and for the sake of sanity it must
mlock() the cache memory (having disk cache being swapped is stupid, and
doubly stupid when swap is slower than the database storage file system as is
often the case).  This means that the cache memory is not available for the
OS.  If the machine does nothing but database access then this is probably
OK, however such dedicated database servers are quite rare.
If we assume that every database server will be running other tasks than the
database server (if only cron jobs that manage backups, tripwire, reporting,
etc) then you will be hit by two problems, one is the situation of having an
idle database mlock()ing all your memory so active programs run very slow,
another problem is the database being the only active program but being
configured not to use all the memory.  If the OS does the caching then it
will dynamically allocate the system memory to the process that needs it.

be doing things like synchronous writes for maintaining its own
invariants, while a [non-journalling] file system will care about fs
meta-data consistency at best.  While it is possible that the general

The journalling will make sure that the file system doesn't get trashed after
a crash.  The database can call fdatasync() to make sure that it's own data
is correctly synchronised.  If there is a need to sync only part of a file
then you can memory map it and use msync() to synchronise one page while
leaving other data in the write-back cache.

purpose file system with more man-hours behind it is better written,
the benchmarks might be omitting crucial criteria like crash
protection and such.  Do you guys have references to benchmarking
data?

If the database correctly calls fsync(), fdatasync(), and msync() at
appropriate times and the file system and OS correctly implement these system
calls then the crash protection should be as good as it is going to get.

Also it should reduce the code paths in the database.  If the database is
writing everything synchronously (as it will want to do with a raw device)
then it will have to use it's own write-back cache which will involve lots of
inter-process or inter-thread communication and other overhead.

-- 
My current location - X marks the spot.
X
X
X




Re: reiserfs databases.

2000-09-01 Thread Bulent Murtezaoglu

I'd like to thank Russel Coker for taking the time to spell his
thinking out in detail.  I now know more than I did five minutes 
ago!  

cheers,

BM




Re: reiserfs databases.

2000-08-30 Thread Bulent Murtezaoglu

[...]
RC The idea is that the database vendor knows their data storage
RC better than the OS can guess it, and that knowledge allows
RC them to implement better caching algorithms than the OS can
RC use.  The fact that benchmark results show that raw partition
RC access is slower indicates that the databases aren't written
RC as well as they are supposed to be.

I am not convinced that this conclusion is warranted, though I admit I
have not seen those benchmarks.  The DB vendor's raw disk driver might
be doing things like synchronous writes for maintaining its own
invariants, while a [non-journalling] file system will care about fs
meta-data consistency at best.  While it is possible that the general
purpose file system with more man-hours behind it is better written,
the benchmarks might be omitting crucial criteria like crash
protection and such.  Do you guys have references to benchmarking
data?

RC ... One of
RC which was someone who did tests with IBM's HPFS386 file system
RC for server versions of OS/2.  He tried using 2M of cache with
RC HPFS386 and 16M of physical cache in a caching hard drive
RC controller and using 18M of HPFS386 cache with no cache on the
RC controller.  The results were surprisingly close on real-world
RC tests such as compiling large projects.  It seemed that 2M of
RC cache was enough to cache directory entries and other
RC file-system meta-data and cache apart from that worked on a
RC LRU basis anyway.

This I would buy, as you point out the controller and the FS code
are doing the same thing (if they are giving the same write guarantees).   

BM


--  
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]




Re: reiserfs databases.

2000-08-30 Thread Russell Coker
On Wed, 30 Aug 2000, Nathan E Norman wrote:
On Tue, Aug 29, 2000 at 04:36:23PM +0200, Dariush Pietrzak wrote:
 but,  there are some commercial databases which keep their data directly
 on partitions ( this should be much better then any *fs including
 reiserfs) and the weird part is that that direct-partition instalation
 scheme seems to be a little bit slower that fs-based in benchmarks.
 And this means that I'm missing something here, what is it that I haven't
 thought about, anyone, any comments on this?

If I understand your question, you're saying that RDBMs do benchmark
faster using a native filesystems rather than rolling their own on
a partition, and you're wondering why ... I would have to hazard a
guess that the operating system disk cache and buffers are coming
into play when you're using a native filesystem, but there's no
caching when a raw partition is used.

The idea is that the database vendor knows their data storage better than the
OS can guess it, and that knowledge allows them to implement better caching
algorithms than the OS can use.
The fact that benchmark results show that raw partition access is slower
indicates that the databases aren't written as well as they are supposed to
be.

The concept of the database being able to cache better than the OS sounds
reasonable, but seems to not work in practise.  I have seen other examples of
similar principles.  One of which was someone who did tests with IBM's
HPFS386 file system for server versions of OS/2.  He tried using 2M of cache
with HPFS386 and 16M of physical cache in a caching hard drive controller and
using 18M of HPFS386 cache with no cache on the controller.  The results were
surprisingly close on real-world tests such as compiling large projects.  It
seemed that 2M of cache was enough to cache directory entries and other
file-system meta-data and cache apart from that worked on a LRU basis anyway.


Russell Coker




Re: reiserfs databases.

2000-08-30 Thread Dariush Pietrzak

to sum things up 
 - my idea to use reiserfs as database placeholder ain't that stupid.
 - modern fs's do better job that commercial database designers
well, actually I'm using postgresql which can't use raw
partitions anyway.

thanks for the response.




Re: reiserfs databases.

2000-08-30 Thread Bulent Murtezaoglu
[...]
RC The idea is that the database vendor knows their data storage
RC better than the OS can guess it, and that knowledge allows
RC them to implement better caching algorithms than the OS can
RC use.  The fact that benchmark results show that raw partition
RC access is slower indicates that the databases aren't written
RC as well as they are supposed to be.

I am not convinced that this conclusion is warranted, though I admit I
have not seen those benchmarks.  The DB vendor's raw disk driver might
be doing things like synchronous writes for maintaining its own
invariants, while a [non-journalling] file system will care about fs
meta-data consistency at best.  While it is possible that the general
purpose file system with more man-hours behind it is better written,
the benchmarks might be omitting crucial criteria like crash
protection and such.  Do you guys have references to benchmarking
data?

RC ... One of
RC which was someone who did tests with IBM's HPFS386 file system
RC for server versions of OS/2.  He tried using 2M of cache with
RC HPFS386 and 16M of physical cache in a caching hard drive
RC controller and using 18M of HPFS386 cache with no cache on the
RC controller.  The results were surprisingly close on real-world
RC tests such as compiling large projects.  It seemed that 2M of
RC cache was enough to cache directory entries and other
RC file-system meta-data and cache apart from that worked on a
RC LRU basis anyway.

This I would buy, as you point out the controller and the FS code
are doing the same thing (if they are giving the same write guarantees).   

BM




reiserfs databases.

2000-08-29 Thread Dariush Pietrzak

AFAIK reiserfs is about keeping files (blocks) in b-trees,
and DBMS keep their data in a bunch of files, which are accessed directly
(non-sequential access).
So I figured that reiserfs would be great for keeping DBMS's data on it.

but,  there are some commercial databases which keep their data directly
on partitions ( this should be much better then any *fs including
reiserfs) and the weird part is that that direct-partition instalation
scheme seems to be a little bit slower that fs-based in benchmarks.
And this means that I'm missing something here, what is it that I haven't
thought about, anyone, any comments on this?

regards, Eyck



--  
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]




Re: reiserfs databases.

2000-08-29 Thread Nathan E Norman

On Tue, Aug 29, 2000 at 04:36:23PM +0200, Dariush Pietrzak wrote:
 but,  there are some commercial databases which keep their data directly
 on partitions ( this should be much better then any *fs including
 reiserfs) and the weird part is that that direct-partition instalation
 scheme seems to be a little bit slower that fs-based in benchmarks.
 And this means that I'm missing something here, what is it that I haven't
 thought about, anyone, any comments on this?

If I understand your question, you're saying that RDBMs do benchmark
faster using a native filesystems rather than rolling their own on
a partition, and you're wondering why ... I would have to hazard a
guess that the operating system disk cache and buffers are coming
into play when you're using a native filesystem, but there's no
caching when a "raw" partition is used.

-- 
 "Eschew Obfuscation"
email:[EMAIL PROTECTED]http://incanus.net/~nnorman

 PGP signature


reiserfs databases.

2000-08-29 Thread Dariush Pietrzak
AFAIK reiserfs is about keeping files (blocks) in b-trees,
and DBMS keep their data in a bunch of files, which are accessed directly
(non-sequential access).
So I figured that reiserfs would be great for keeping DBMS's data on it.

but,  there are some commercial databases which keep their data directly
on partitions ( this should be much better then any *fs including
reiserfs) and the weird part is that that direct-partition instalation
scheme seems to be a little bit slower that fs-based in benchmarks.
And this means that I'm missing something here, what is it that I haven't
thought about, anyone, any comments on this?

regards, Eyck





Re: reiserfs databases.

2000-08-29 Thread Nathan E Norman
On Tue, Aug 29, 2000 at 04:36:23PM +0200, Dariush Pietrzak wrote:
 but,  there are some commercial databases which keep their data directly
 on partitions ( this should be much better then any *fs including
 reiserfs) and the weird part is that that direct-partition instalation
 scheme seems to be a little bit slower that fs-based in benchmarks.
 And this means that I'm missing something here, what is it that I haven't
 thought about, anyone, any comments on this?

If I understand your question, you're saying that RDBMs do benchmark
faster using a native filesystems rather than rolling their own on
a partition, and you're wondering why ... I would have to hazard a
guess that the operating system disk cache and buffers are coming
into play when you're using a native filesystem, but there's no
caching when a raw partition is used.

-- 
 Eschew Obfuscation
email:[EMAIL PROTECTED]http://incanus.net/~nnorman


pgpQMv31j0vlY.pgp
Description: PGP signature