On Mon, Feb 1, 2016 at 8:05 PM, Jim Nasby <jim.na...@bluetreble.com> wrote:
> On 1/31/16 3:26 PM, Jan Wieck wrote: > >> On 01/27/2016 08:30 AM, Amit Kapila wrote: >> >>> operation. Now why OS couldn't find the corresponding block in >>> memory is that, while closing the WAL file, we use >>> POSIX_FADV_DONTNEED if wal_level is less than 'archive' which >>> lead to this problem. So with this experiment, the conclusion is that >>> though we can avoid re-write of WAL data by doing exact writes, but >>> it could lead to significant reduction in TPS. >>> >> >> POSIX_FADV_DONTNEED isn't the only way how those blocks would vanish >> from OS buffers. If I am not mistaken we recycle WAL segments in a round >> robin fashion. In a properly configured system, where the reason for a >> checkpoint is usually "time" rather than "xlog", a recycled WAL file >> written to had been closed and not touched for about a complete >> checkpoint_timeout or longer. You must have a really big amount of spare >> RAM in the machine to still find those blocks in memory. Basically we >> are talking about the active portion of your database, shared buffers, >> the sum of all process local memory and the complete pg_xlog directory >> content fitting into RAM. >> > I think that could only be problem if reads were happening at write or fsync call, but that is not the case here. Further investigation on this point reveals that the reads are not for fsync operation, rather they happen when we call posix_fadvise(,,POSIX_FADV_DONTNEED). Although this behaviour (writing in non-OS-page-cache-size chunks could lead to reads if followed by a call to posix_fadvise (,,POSIX_FADV_DONTNEED)) is not very clearly documented, but the reason for the same is that fadvise() call maps the specified data range (which in our case is whole file) into the list of pages and then invalidate them which will further lead to removing them from OS cache, now any misaligned (w.r.t OS page-size) writes done during writing/fsyncing to file could cause additional reads as everything written by us will not be on OS-page-boundary. This theory is based on code of fadvise [1] and some googling [2] which suggests that misaligned reads followed with POSIX_FADV_DONTNEED could cause similar problem. Colleague of mine, Dilip Kumar has verified it even by writing a simple program for open/write/fsync/fdvise/close as well. > > But that's only going to matter when the segment is newly recycled. My > impression from Amit's email is that the OS was repeatedly reading even in > the same segment? > > As explained above the reads are only happening during file close. > Either way, I would think it wouldn't be hard to work around this by > spewing out a bunch of zeros to the OS in advance of where we actually need > to write, preventing the need for reading back from disk. > > I think we can simply prohibit to set wal_chunk_size to a value other than OS-page-cache or XLOG_BLCKSZ (whichever is lesser) if the wal_level is lesser than archive. This can avoid the problem of extra reads for misaligned writes as we won't call fadvise(). We can even choose to always write in OS-page-cache boundary or XLOG_BLCKSZ (whichever is lesser) as in many cases OS-page-cache boundary is 4K which can also save significant re-writes. > Amit, did you do performance testing with archiving enabled an a no-op > archive_command? > No, but what kind of advantage are you expecting from such tests? With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com