Re: [HACKERS] [PATCHES] O_DIRECT for WAL writes

2005-08-11 Thread Mark Wong
Ok, I finally got a couple of tests done against CVS from Aug 3, 2005.
I'm not sure if I'm showing anything insightful though.  I've learned
that fdatasync and O_DSYNC are simply fsync and O_SYNC respectively on
Linux, which you guys may have already known.  There appears to be a
fair performance decrease in using open_sync.  Just to double check, am
I correct in understanding only open_sync uses O_DIRECT?

fdatasync
http://www.testing.osdl.org/projects/dbt2dev/results/dev4-015/38/
5462 notpm

open_sync
http://www.testing.osdl.org/projects/dbt2dev/results/dev4-015/40/
4860 notpm

Mark

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] [PATCHES] O_DIRECT for WAL writes

2005-08-11 Thread Bruce Momjian
Mark Wong wrote:
 Ok, I finally got a couple of tests done against CVS from Aug 3, 2005.
 I'm not sure if I'm showing anything insightful though.  I've learned
 that fdatasync and O_DSYNC are simply fsync and O_SYNC respectively on
 Linux, which you guys may have already known.  There appears to be a

That is not what we thought for Linux, but many other OS's behave that
way.

 fair performance decrease in using open_sync.  Just to double check, am
 I correct in understanding only open_sync uses O_DIRECT?

Right.

 fdatasync
 http://www.testing.osdl.org/projects/dbt2dev/results/dev4-015/38/
 5462 notpm
 
 open_sync
 http://www.testing.osdl.org/projects/dbt2dev/results/dev4-015/40/
 4860 notpm

Right now open_sync is our last choice, which seems to still be valid
for Linux, at least.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  pgman@candle.pha.pa.us   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] [PATCHES] O_DIRECT for WAL writes

2005-08-09 Thread Bruce Momjian
Mark Wong wrote:
 O_DIRECT + fsync() can make sense.  It avoids the copying of data
 to the page cache before being written and will also guarantee
 that the file's metadata is also written to disk.  It also
 prevents the page cache from filling up with write data that
 will never be read (I assume it is only read if a recovery
 is necessary - which should be rare).  It can also
 helps disks with write back cache when using the journaling
 file system that use i/o barriers.  You would want to use
 large writes, since the kernel page cache won't be writing
 multiple pages for you.

Right, but it seems O_DIRECT is pretty much the same as O_DIRECT with
O_DSYNC because the data is always written to disk on write().  Our
logic is that there is nothing for fdatasync to do in most cases after
using O_DIRECT, so the O_DIRECT/fdatasync() combination doesn't make
sense.

And FreeBSD, and perhaps others, need O_SYNC or fdatasync with O_DIRECT
because O_DIRECT doesn't force stuff to disk in all cases.

 I need to look at the kernel code more to comment on O_DIRECT with
 O_SYNC.
 
 Questions:
 
 Does the database transaction logger preallocate the log file?

Yes.

 Does the logger care about the order in which each write hits the disk?

Not really.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  pgman@candle.pha.pa.us   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] [PATCHES] O_DIRECT for WAL writes

2005-08-06 Thread Mark Wong
Here are comments that Daniel McNeil made earlier, which I've neglected
to forward earlier.  I've cc'ed him and Mark Havercamp, which some of
you got to meet the other day.

Mark

-

With O_DIRECT on Linux, when the write() returns the i/o has been
transferred to the disk.  

Normally, this i/o will be DMAed directly from user-space to the
device.  The current exception is when doing an O_DIRECT write to a 
hole in a file.  (If an program does a truncate() or lseek()/write()
that makes a file larger, the file system does not allocated space
between the old end of file and the new end of file.)  An O_DIRECT
write to hole like this, requires the file system to allocated space,
but there is a race condition between the O_DIRECT write doing the
allocate and then write to initialized the newly allocated data and
any other process that attempts a buffered (page cache) read of the
same area in the file -- it was possible for the read to data from
the allocated region before the O_DIRECT write().  The fix in Linux
is for the O_DIRECT write() to fall back to use buffer i/o to do
the write() and flush the data from the page cache to the disk.

A write() with O_DIRECT only means the data has been transferred to
the disk.   Depending on the file system and mount options, it does
not mean the meta data for the file has been written to disk (see
fsync man page).  Fsync() will guarantee the data and metadata have
been written to disk.

Lastly, if a disk has a write back cache, an O_DIRECT write() does not
guarantee that the disk has put the data on the physical media.
I think some of the journal file systems now support i/o barriers
on commit which will flush the disk write back cache.  (I'm still
looking the kernel code to see how this is done).

Conclusion:

O_DIRECT + fsync() can make sense.  It avoids the copying of data
to the page cache before being written and will also guarantee
that the file's metadata is also written to disk.  It also
prevents the page cache from filling up with write data that
will never be read (I assume it is only read if a recovery
is necessary - which should be rare).  It can also
helps disks with write back cache when using the journaling
file system that use i/o barriers.  You would want to use
large writes, since the kernel page cache won't be writing
multiple pages for you.

I need to look at the kernel code more to comment on O_DIRECT with
O_SYNC.

Questions:

Does the database transaction logger preallocate the log file?

Does the logger care about the order in which each write hits the disk?

Now someone else can comment on my comments.

Daniel

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] [PATCHES] O_DIRECT for WAL writes

2005-07-28 Thread Bruce Momjian

Patch applied.  Thanks.

---


ITAGAKI Takahiro wrote:
 Thanks for reviewing!
 But the patch does not work on HEAD, because of the changes in 
 BootStrapXLOG().
 I send the patch with a fix for it.
 
 
 Bruce Momjian pgman@candle.pha.pa.us wrote:
 
  If you are doing fsync(), I don't see how O_DIRECT
  makes any sense because O_DIRECT is writing to disk on every write, and
  then what is the fsync() actually doing.
 
 It's depends on OSes. Manpage of Linux says,
   http://linux.com.hk/PenguinWeb/manpage.jsp?name=opensection=2
 File I/O is done directly to/from user space buffers. The I/O is
 synchronous, i.e., at the completion of the read(2) or write(2) system
 call, data is **guaranteed to have been transferred**.
 But manpage of FreeBSD says,
   http://www.manpages.info/freebsd/open.2.html
 O_DIRECT may be used to minimize or eliminate the cache effects of read-
 ing and writing.  The system will attempt to avoid caching the data you
 read or write.  If it cannot avoid caching the data,
 it will **minimize the impact the data has on the cache**.
 
 In my understanding, the completion of write() with O_DIRECT does not always
 assure an actual write. So there may be difference between O_DIRECT+O_SYNC
 and O_DIRECT+fsync(), but I think that is not very often.
 
 
  What I did was to add O_DIRECT unconditionally for all uses of O_SYNC
  and O_DSYNC, so it is automatically used in those cases.  And of course,
  if your operating system doens't support O_DIRECT, it isn't used.
 
 I agree with your way, where O_DIRECT is automatically used. 
 I bet the combination of O_DIRECT and O_SYNC is always better than
 the case O_SYNC only used.
 
 ---
 ITAGAKI Takahiro
 NTT Cyber Space Laboratories
 

[ Attachment, skipping... ]

 
 ---(end of broadcast)---
 TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  pgman@candle.pha.pa.us   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] [PATCHES] O_DIRECT for WAL writes

2005-07-27 Thread Bruce Momjian
ITAGAKI Takahiro wrote:
 Thanks for reviewing!
 But the patch does not work on HEAD, because of the changes in 
 BootStrapXLOG().
 I send the patch with a fix for it.

Thanks.

  If you are doing fsync(), I don't see how O_DIRECT
  makes any sense because O_DIRECT is writing to disk on every write, and
  then what is the fsync() actually doing.
 
 It's depends on OSes. Manpage of Linux says,
   http://linux.com.hk/PenguinWeb/manpage.jsp?name=opensection=2
 File I/O is done directly to/from user space buffers. The I/O is
 synchronous, i.e., at the completion of the read(2) or write(2) system
 call, data is **guaranteed to have been transferred**.
 But manpage of FreeBSD says,
   http://www.manpages.info/freebsd/open.2.html
 O_DIRECT may be used to minimize or eliminate the cache effects of read-
 ing and writing.  The system will attempt to avoid caching the data you
 read or write.  If it cannot avoid caching the data,
 it will **minimize the impact the data has on the cache**.
 
 In my understanding, the completion of write() with O_DIRECT does not always
 assure an actual write. So there may be difference between O_DIRECT+O_SYNC
 and O_DIRECT+fsync(), but I think that is not very often.

Yes, I do remember that.  I know we _need_ fsync when using O_DIRECT,
but the downside of O_DIRECT (force every write to disk) is the same as
O_SYNC, so it seems if we are using O_DIRECT, we might as well use
O_SYNC too and skip the fsync().

I will add a comment mentioning this.

  What I did was to add O_DIRECT unconditionally for all uses of O_SYNC
  and O_DSYNC, so it is automatically used in those cases.  And of course,
  if your operating system doens't support O_DIRECT, it isn't used.
 
 I agree with your way, where O_DIRECT is automatically used. 
 I bet the combination of O_DIRECT and O_SYNC is always better than
 the case O_SYNC only used.

OK.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  pgman@candle.pha.pa.us   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] [PATCHES] O_DIRECT for WAL writes

2005-07-23 Thread Bruce Momjian

I have modified and attached your patch for your review.  I didn't see
any value to adding new fsync_method values because, to me, O_DIRECT is
basically just like O_SYNC except it doesn't keep a copy of the buffer
in the kernel cache.  If you are doing fsync(), I don't see how O_DIRECT
makes any sense because O_DIRECT is writing to disk on every write, and
then what is the fsync() actually doing.  This might explain why your
fsync/direct and open/direct performance numbers are almost identical.
Basically, if you are going to use O_DIRECT, why not use open_sync.

What I did was to add O_DIRECT unconditionally for all uses of O_SYNC
and O_DSYNC, so it is automatically used in those cases.  And of course,
if your operating system doens't support O_DIRECT, it isn't used.

With your posted performance numbers, perhaps we should favor
fsync_method O_SYNC on platforms that have O_DIRECT even if we don't
support OPEN_DATASYNC, but I bet most platforms that have O_DIRECT also
have O_DATASYNC.  Perhaps some folks can run testes once the patch is
applied.

---

ITAGAKI Takahiro wrote:
 Tom Lane [EMAIL PROTECTED] wrote:
 
  Yeah, this is about what I was afraid of: if you're actually fsyncing
  then you get at best one commit per disk revolution, and the negotiation
  with the OS is down in the noise.
 
 If we disable writeback-cache and use open_sync, the per-page writing
 behavior in WAL module will show up as bad result. O_DIRECT is similar
 to O_DSYNC (at least on linux), so that the benefit of it will disappear
 behind the slow disk revolution.
 
 In the current source, WAL is written as:
 for (i = 0; i  N; i++) { write(buffers[i], BLCKSZ); }
 Is this intentional? Can we rewrite it as follows?
write(buffers[0], N * BLCKSZ);
 
 In order to achieve it, I wrote a 'gather-write' patch (xlog.gw.diff).
 Aside from this, I'll also send the fixed direct io patch (xlog.dio.diff).
 These two patches are independent, so they can be applied either or both.
 
 
 I tested them on my machine and the results as follows. It shows that
 direct-io and gather-write is the best choice when writeback-cache is off.
 Are these two patches worth trying if they are used together?
 
 
 | writeback | fsync= | fdata | open_ | fsync_ | open_ 
 patch   | cache |  false |  sync |  sync | direct | direct
 +---++---+---++-
 direct io   | off   |  124.2 | 105.7 |  48.3 |   48.3 |  48.2 
 direct io   | on|  129.1 | 112.3 | 114.1 |  142.9 | 144.5 
 gather-write| off   |  124.3 | 108.7 | 105.4 |  (N/A) | (N/A) 
 both| off   |  131.5 | 115.5 | 114.4 |  145.4 | 145.2 
 
 - 20runs * pgbench -s 100 -c 50 -t 200
- with tuning (wal_buffers=64, commit_delay=500, checkpoint_segments=8)
 - using 2 ATA disks:
- hda(reiserfs) includes system and wal.
- hdc(jfs) includes database files. writeback-cache is always on.
 
 ---
 ITAGAKI Takahiro
 NTT Cyber Space Laboratories
 

[ Attachment, skipping... ]

[ Attachment, skipping... ]

 
 ---(end of broadcast)---
 TIP 5: Have you checked our extensive FAQ?
 
http://www.postgresql.org/docs/faq

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  pgman@candle.pha.pa.us   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073
Index: src/backend/access/transam/xlog.c
===
RCS file: /cvsroot/pgsql/src/backend/access/transam/xlog.c,v
retrieving revision 1.210
diff -c -c -r1.210 xlog.c
*** src/backend/access/transam/xlog.c   23 Jul 2005 15:31:16 -  1.210
--- src/backend/access/transam/xlog.c   23 Jul 2005 16:09:12 -
***
*** 48,77 
  
  
  /*
   * This chunk of hackery attempts to determine which file sync methods
   * are available on the current platform, and to choose an appropriate
   * default method.We assume that fsync() is always available, and that
   * configure determined whether fdatasync() is.
   */
  #if defined(O_SYNC)
! #define OPEN_SYNC_FLAGO_SYNC
  #else
  #if defined(O_FSYNC)
! #define OPEN_SYNC_FLAGO_FSYNC
  #endif
  #endif
  
  #if defined(O_DSYNC)
  #if defined(OPEN_SYNC_FLAG)
! #if O_DSYNC != OPEN_SYNC_FLAG
! #define OPEN_DATASYNC_FLAGO_DSYNC
  #endif
  #else /* !defined(OPEN_SYNC_FLAG) */
  /* Win32 only has O_DSYNC */
! #define OPEN_DATASYNC_FLAGO_DSYNC
  #endif
  #endif
  
  #if defined(OPEN_DATASYNC_FLAG)
  #define DEFAULT_SYNC_METHOD_STR   open_datasync
  #define DEFAULT_SYNC_METHOD   SYNC_METHOD_OPEN
--- 48,114 
  
  
  /*
+  *Becauase O_DIRECT bypasses the kernel buffers, and because we never
+  *read those buffers except during