Re: [HACKERS] WAL fsync scheduling
Added to TODO.detail and TODO list. [ Charset ISO-8859-1 unsupported, converting... ] There are two parts to transaction commit. The first is writing all dirty buffers or log changes to the kernel, and second is fsync of the Backend doesn't write any dirty buffer to the kernel at commit time. Yes, I suspected that. log file. The first part is writing commit record into WAL buffers in shmem. This is what XLogInsert does. After that XLogFlush is called to ensure that entire commit record is on disk. XLogFlush does *both* write() and fsync() (single slock is used for both writing and fsyncing) if it needs to do it at all. Yes, I realize there are new steps in WAL. I suggest having a per-backend shared memory byte that has the following values: START_LOG_WRITE WAIT_ON_FSYNC NOT_IN_COMMIT backend_number_doing_fsync I suggest that when each backend starts a commit, it sets its byte to START_LOG_WRITE. ^^^ Isn't START_COMMIT more meaningful? Yes. When it gets ready to fsync, it checks all backends. ^^ What do you mean by this? The moment just after XLogInsert? Just before it calls fsync(). If all are NOT_IN_COMMIT, it does fsync and continues. 1st edition: If one or more are in START_LOG_WRITE, it waits until no one is in START_LOG_WRITE. It then checks all WAIT_ON_FSYNC, and if it is the lowest backend in WAIT_ON_FSYNC, marks all others with its backend number, and does fsync. It then clears all backends with its number to NOT_IN_COMMIT. Other backend will see they are not the lowest WAIT_ON_FSYNC and will wait for their byte to be set to NOT_IN_COMMIT so they can then continue, knowing their data was synced. 2nd edition: I have another idea. If a backend gets to the point that it needs fsync, and there is another backend in START_LOG_WRITE, it can go to an interuptable sleep, knowing another backend will perform the fsync and wake it up. Therefore, there is no busy-wait or timed sleep. Of course, a backend must set its status to WAIT_ON_FSYNC to avoid a race condition. The 2nd edition is much better. But I'm not sure do we really need in these per-backend bytes in shmem. Why not just have some counters? We can use a semaphore to wake-up all waiters at once. Yes, that is much better and clearer. My idea was just to say, "if no one is entering commit phase, do the commit. If someone else is coming, sleep and wait for them to do the fsync and wake me up with a singal." This allows a single backend not to sleep, and allows multiple backends to bunch up only when they are all about to commit. The reason backend numbers are written is so other backends entering the commit code will not interfere with the backends performing fsync. Being waked-up backend can check what's written/fsynced by calling XLogFlush. Seems that may not be needed anymore with a counter. The only issue is that other backends may enter commit while fsync() is happening. The process that did the fsync must be sure to wake up only the backends that were waiting for it, and not other backends that may be also be doing fsync as a group while the first fsync was happening. I leave those details to people more experienced. :-) I am just glad people liked my idea. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup.| Drexel Hill, Pennsylvania 19026 -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup.| Drexel Hill, Pennsylvania 19026
Re: [HACKERS] WAL fsync scheduling
* Tom Lane [EMAIL PROTECTED] [001117 23:21]: Bruce Momjian [EMAIL PROTECTED] writes: Other backend will see they are not the lowest WAIT_ON_FSYNC and will wait for their byte to be set to NOT_IN_COMMIT so they can then continue, knowing their data was synced. How will they wait? Without a semaphore involved, your answer must be either "timed sleep" or "busy-wait loop", neither of which is attractive ... how about sigpause, and using SIGUSR1/SIGUSR2 to wake them up ? regards, tom lane -- Larry Rosenman http://www.lerctr.org/~ler Phone: +1 972-414-9812 (voice) Internet: [EMAIL PROTECTED] US Mail: 1905 Steamboat Springs Drive, Garland, TX 75044-6749
Re: [HACKERS] WAL fsync scheduling
* Tom Lane [EMAIL PROTECTED] [001117 23:21]: Bruce Momjian [EMAIL PROTECTED] writes: Other backend will see they are not the lowest WAIT_ON_FSYNC and will wait for their byte to be set to NOT_IN_COMMIT so they can then continue, knowing their data was synced. How will they wait? Without a semaphore involved, your answer must be either "timed sleep" or "busy-wait loop", neither of which is attractive ... how about sigpause, and using SIGUSR1/SIGUSR2 to wake them up ? Looks like a winner. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup.| Drexel Hill, Pennsylvania 19026
Re: [HACKERS] WAL fsync scheduling
Bruce Momjian [EMAIL PROTECTED] writes: how about sigpause, and using SIGUSR1/SIGUSR2 to wake them up ? Looks like a winner. sigpause() is a BSD-ism, and not part of any recognized standard according to my HP man pages. How portable do you think it is? regards, tom lane
Re: [HACKERS] WAL fsync scheduling
Bruce Momjian [EMAIL PROTECTED] writes: how about sigpause, and using SIGUSR1/SIGUSR2 to wake them up ? The standard is sigsuspend: OK, we can probably assume that at least one of sigsuspend or sigpause is available everywhere. Now all you need is a free signal number. Unfortunately we're already using both SIGUSR1 and SIGUSR2. regards, tom lane
Re: [HACKERS] WAL fsync scheduling
Bruce Momjian [EMAIL PROTECTED] writes: how about sigpause, and using SIGUSR1/SIGUSR2 to wake them up ? The standard is sigsuspend: OK, we can probably assume that at least one of sigsuspend or sigpause is available everywhere. Now all you need is a free signal number. Unfortunately we're already using both SIGUSR1 and SIGUSR2. Oh, I didn't want to hear that one. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup.| Drexel Hill, Pennsylvania 19026
Re: [HACKERS] WAL fsync scheduling
Peter Eisentraut [EMAIL PROTECTED] writes: Now all you need is a free signal number. Unfortunately we're already using both SIGUSR1 and SIGUSR2. Maybe you could dump the old meaning SIGQUIT (externally invoked error), move quickdie() to SIGQUIT, and you got SIGUSR1 free. (That would even make sense in two ways: 1) SIGQUIT would actually cause the guy to quit; 2) there is a correspondence between postmaster and postgres signals.) Seems like a plan. The current definition of backend SIGQUIT is really stupid anyway --- what's the value of forcing an error asynchronously? Also, it always bothered me that the postmaster and backend signals weren't consistent, so I'd be inclined to make this change even if we end up not using SIGUSR1 for Bruce's idea ... regards, tom lane
Re: [HACKERS] WAL fsync scheduling
Larry Rosenman writes: how about sigpause, and using SIGUSR1/SIGUSR2 to wake them up ? Both of these signals are already used. -- Peter Eisentraut [EMAIL PROTECTED] http://yi.org/peter-e/
Re: [HACKERS] WAL fsync scheduling
Tom Lane writes: OK, we can probably assume that at least one of sigsuspend or sigpause is available everywhere. #ifdef HAVE_POSIX_SIGNALS should tell you. Now all you need is a free signal number. Unfortunately we're already using both SIGUSR1 and SIGUSR2. Maybe you could dump the old meaning SIGQUIT (externally invoked error), move quickdie() to SIGQUIT, and you got SIGUSR1 free. (That would even make sense in two ways: 1) SIGQUIT would actually cause the guy to quit; 2) there is a correspondence between postmaster and postgres signals.) -- Peter Eisentraut [EMAIL PROTECTED] http://yi.org/peter-e/
Re: [HACKERS] WAL fsync scheduling
Tom Lane writes: OK, we can probably assume that at least one of sigsuspend or sigpause is available everywhere. #ifdef HAVE_POSIX_SIGNALS should tell you. Now all you need is a free signal number. Unfortunately we're already using both SIGUSR1 and SIGUSR2. Maybe you could dump the old meaning SIGQUIT (externally invoked error), move quickdie() to SIGQUIT, and you got SIGUSR1 free. (That would even make sense in two ways: 1) SIGQUIT would actually cause the guy to quit; 2) there is a correspondence between postmaster and postgres signals.) Good idea. Of course, this assumes my idea was valid. Was it? -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup.| Drexel Hill, Pennsylvania 19026