Re: [HACKERS] double-buffering page writes
ITAGAKI Takahiro wrote: I have some comments about the double-buffering: Since posting this patch I have realized that this implementation is bogus. I'm now playing with WAL-logging hint bits though. As to your questions: - Are there any performance degradation because of addtional memcpy? 8kB of memcpy seems not to be free. Of course, it is not free. However it comes with the benefit that we can release the io_in_progress lock earlier for the block -- we lock, copy, unlock; whereas the old code did lock, write(), unlock. Avoding a system call in the locked area could be a win. Whether this is a net benefit is something that I have not measured. - Is it ok to allocale dblbuf[BLCKSZ] as local variable? It might be unaligned. AFAICS we avoid such usages in other places. I thought about that too. I admit I am not sure if this really works portably; however I don't want to add a palloc() to that routine. - It is the best if we can delay double-buffering until locks are conflicted actually. But we might need to allocale shadow buffers from shared buffers instead of local memory. The point of double-buffering is that the potential writer (a process doing concurrent hint-bit setting) is not going to grab any locks. - Are there any other modules that can share in the benefits of double-buffering? For example, we could avoid avoid waiting for LockBufferForCleanup(). It is cool if the double-buffering can be used for multiple purposes. Not sure on this. -- Alvaro Herrerahttp://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] double-buffering page writes
Alvaro Herrera wrote: ITAGAKI Takahiro wrote: I have some comments about the double-buffering: Since posting this patch I have realized that this implementation is bogus. I'm now playing with WAL-logging hint bits though. Yeah, the torn page + hint bit updates problem is the tough question. - Is it ok to allocale dblbuf[BLCKSZ] as local variable? It might be unaligned. AFAICS we avoid such usages in other places. I thought about that too. I admit I am not sure if this really works portably; however I don't want to add a palloc() to that routine. It should work, AFAIK, but unaligned memcpy()s and write()s can be a significantly slower. There can be only one write() happening at any time, so you could just palloc() a single 8k buffer in TopMemoryContext in backend startup, and always use that. - Are there any other modules that can share in the benefits of double-buffering? For example, we could avoid avoid waiting for LockBufferForCleanup(). It is cool if the double-buffering can be used for multiple purposes. Not sure on this. You'd need to keep both versions of the buffer simultaneously in the buffer cache for that. When we talked about the various designs for HOT, that was one of the ideas I had to enable more aggressive pruning: if you can't immediately get a vacuum lock, allocate a new buffer in the buffer cache for the same block, copy the page to the new buffer, and do the pruning, including moving tuples around, there. Any new ReadBuffer calls would return the new page version, but old readers would keep referencing the old one. The intrusive part of that approach, in addition to the obvious changes required in the buffer manager to keep around multiple copies of the same block, is that all modifications must be done on the new version, so anyone who needs to lock the page for modification would need to switch to the new page version at the LockBuffer call. As discussed in the other thread with Simon, we also use vacuum locks in b-tree for waiting out index scans, so avoiding the waiting there would be just wrong. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] double-buffering page writes
Heikki Linnakangas [EMAIL PROTECTED] writes: Alvaro Herrera wrote: I thought about that too. I admit I am not sure if this really works portably; however I don't want to add a palloc() to that routine. It should work, AFAIK, but unaligned memcpy()s and write()s can be a significantly slower. There can be only one write() happening at any time, so you could just palloc() a single 8k buffer in TopMemoryContext in backend startup, and always use that. Some time ago, we arranged for shared buffers to be aligned on *more* than maxalign boundaries (cf BUFFERALIGN) because on at least some platforms this makes a very significant difference in the speed of copying to/from kernel space. If you are going to double-buffer it is going to be important to have the intermediate buffer just as well aligned. A local char array won't be acceptable, and even for a palloc'd one you'll need to take some extra steps (like wasting ALIGNOF_BUFFER extra bytes so you can align the pointer palloc gives). regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] double-buffering page writes
Alvaro Herrera [EMAIL PROTECTED] wrote: I'm trying to see if it makes sense to do the double-buffering of page writes before going further ahead with CRC checking. I came up with the attached patch; it does the double-buffering inconditionally, because as it was said, it allows releasing the io_in_progress lock (and resetting BM_IO_IN_PROGRESS) early. I have some comments about the double-buffering: - Are there any performance degradation because of addtional memcpy? 8kB of memcpy seems not to be free. - Is it ok to allocale dblbuf[BLCKSZ] as local variable? It might be unaligned. AFAICS we avoid such usages in other places. - It is the best if we can delay double-buffering until locks are conflicted actually. But we might need to allocale shadow buffers from shared buffers instead of local memory. - Are there any other modules that can share in the benefits of double-buffering? For example, we could avoid avoid waiting for LockBufferForCleanup(). It is cool if the double-buffering can be used for multiple purposes. Regards, --- ITAGAKI Takahiro NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] double-buffering page writes
Hi, I'm trying to see if it makes sense to do the double-buffering of page writes before going further ahead with CRC checking. I came up with the attached patch; it does the double-buffering inconditionally, because as it was said, it allows releasing the io_in_progress lock (and resetting BM_IO_IN_PROGRESS) early. So far I have not managed to convince me that this is a correct change to make; the io_in_progress bits are pretty convoluted -- for example, I wonder how does releasing the buffer early (before actually sending the write to the kernel and marking it not dirty) interact with checkpoint and a possible full-page image. Basically the change is to take the unsetting of BM_DIRTY out of TerminateBufferIO and into its own routine; and in FlushBuffer, release io_in_progress just after copying the buffer contents elsewhere, and mark the buffer not dirty after actually doing the write. Thoughts? -- Alvaro Herrerahttp://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support Index: src/backend/storage/buffer/bufmgr.c === RCS file: /home/alvherre/Code/cvs/pgsql/src/backend/storage/buffer/bufmgr.c,v retrieving revision 1.239 diff -c -p -r1.239 bufmgr.c *** src/backend/storage/buffer/bufmgr.c 20 Oct 2008 21:11:15 - 1.239 --- src/backend/storage/buffer/bufmgr.c 21 Oct 2008 15:30:38 - *** static void BufferSync(int flags); *** 84,91 static int SyncOneBuffer(int buf_id, bool skip_recently_used); static void WaitIO(volatile BufferDesc *buf); static bool StartBufferIO(volatile BufferDesc *buf, bool forInput); ! static void TerminateBufferIO(volatile BufferDesc *buf, bool clear_dirty, ! int set_flag_bits); static void buffer_write_error_callback(void *arg); static volatile BufferDesc *BufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum, --- 84,91 static int SyncOneBuffer(int buf_id, bool skip_recently_used); static void WaitIO(volatile BufferDesc *buf); static bool StartBufferIO(volatile BufferDesc *buf, bool forInput); ! static void TerminateBufferIO(volatile BufferDesc *buf, int set_flag_bits); ! static void MarkBufferNotDirty(volatile BufferDesc *buf); static void buffer_write_error_callback(void *arg); static volatile BufferDesc *BufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum, *** ReadBuffer_common(SMgrRelation smgr, boo *** 395,401 else { /* Set BM_VALID, terminate IO, and wake up any waiters */ ! TerminateBufferIO(bufHdr, false, BM_VALID); } if (VacuumCostActive) --- 395,401 else { /* Set BM_VALID, terminate IO, and wake up any waiters */ ! TerminateBufferIO(bufHdr, BM_VALID); } if (VacuumCostActive) *** FlushBuffer(volatile BufferDesc *buf, SM *** 1792,1797 --- 1792,1798 { XLogRecPtr recptr; ErrorContextCallback errcontext; + char dblbuf[BLCKSZ]; /* * Acquire the buffer's io_in_progress lock. If StartBufferIO returns *** FlushBuffer(volatile BufferDesc *buf, SM *** 1834,1856 buf-flags = ~BM_JUST_DIRTIED; UnlockBufHdr(buf); smgrwrite(reln, buf-tag.forkNum, buf-tag.blockNum, ! (char *) BufHdrGetBlock(buf), false); BufferFlushCount++; TRACE_POSTGRESQL_BUFFER_FLUSH_DONE(reln-smgr_rnode.spcNode, reln-smgr_rnode.dbNode, reln-smgr_rnode.relNode); - /* - * Mark the buffer as clean (unless BM_JUST_DIRTIED has become set) and - * end the io_in_progress state. - */ - TerminateBufferIO(buf, true, 0); - /* Pop the error context stack */ error_context_stack = errcontext.previous; } --- 1835,1863 buf-flags = ~BM_JUST_DIRTIED; UnlockBufHdr(buf); + /* + * We make a copy of the buffer to write. This allows us to release the + * io_in_progress lock early, before actually doing the write. + */ + memcpy(dblbuf, BufHdrGetBlock(buf), BLCKSZ); + + /* End the io_in_progress state. */ + TerminateBufferIO(buf, 0); + smgrwrite(reln, buf-tag.forkNum, buf-tag.blockNum, ! (char *) dblbuf, false); + /* Mark the buffer as clean (unless BM_JUST_DIRTIED has become set) */ + MarkBufferNotDirty(buf); + BufferFlushCount++; TRACE_POSTGRESQL_BUFFER_FLUSH_DONE(reln-smgr_rnode.spcNode, reln-smgr_rnode.dbNode, reln-smgr_rnode.relNode); /* Pop the error context stack */ error_context_stack = errcontext.previous; } *** StartBufferIO(volatile BufferDesc *buf, *** 2578,2595 * We hold the buffer's io_in_progress lock * The buffer is Pinned * - * If clear_dirty is TRUE and BM_JUST_DIRTIED is not set, we clear the - * buffer's BM_DIRTY flag. This is appropriate when terminating a - * successful write. The check on BM_JUST_DIRTIED is necessary to avoid - * marking the buffer clean if it was