On Wed, Feb 3, 2016 at 11:12 AM, Amit Kapila <amit.kapil...@gmail.com> wrote: > > On Mon, Feb 1, 2016 at 8:05 PM, Jim Nasby <jim.na...@bluetreble.com> wrote: >> >> On 1/31/16 3:26 PM, Jan Wieck wrote: >>> >>> On 01/27/2016 08:30 AM, Amit Kapila wrote: >>>> >>>> operation. Now why OS couldn't find the corresponding block in >>>> memory is that, while closing the WAL file, we use >>>> POSIX_FADV_DONTNEED if wal_level is less than 'archive' which >>>> lead to this problem. So with this experiment, the conclusion is that >>>> though we can avoid re-write of WAL data by doing exact writes, but >>>> it could lead to significant reduction in TPS. >>> >>> >>> POSIX_FADV_DONTNEED isn't the only way how those blocks would vanish >>> from OS buffers. If I am not mistaken we recycle WAL segments in a round >>> robin fashion. In a properly configured system, where the reason for a >>> checkpoint is usually "time" rather than "xlog", a recycled WAL file >>> written to had been closed and not touched for about a complete >>> checkpoint_timeout or longer. You must have a really big amount of spare >>> RAM in the machine to still find those blocks in memory. Basically we >>> are talking about the active portion of your database, shared buffers, >>> the sum of all process local memory and the complete pg_xlog directory >>> content fitting into RAM. > > > > I think that could only be problem if reads were happening at write or > fsync call, but that is not the case here. Further investigation on this > point reveals that the reads are not for fsync operation, rather they > happen when we call posix_fadvise(,,POSIX_FADV_DONTNEED). > Although this behaviour (writing in non-OS-page-cache-size chunks could > lead to reads if followed by a call to posix_fadvise > (,,POSIX_FADV_DONTNEED)) is not very clearly documented, but the > reason for the same is that fadvise() call maps the specified data range > (which in our case is whole file) into the list of pages and then invalidate > them which will further lead to removing them from OS cache, now any > misaligned (w.r.t OS page-size) writes done during writing/fsyncing to file > could cause additional reads as everything written by us will not be on > OS-page-boundary. >
On further testing, it has been observed that misaligned writes could cause reads even when blocks related to file are not in-memory, so I think what Jan is describing is right. The case where there is absolutely zero chance of reads is when we write in OS-page boundary which is generally 4K. However I still think it is okay to provide an option for WAL writing in smaller chunks (512 bytes , 1024 bytes, etc) for the cases when these are beneficial like when wal_level is greater than equal to Archive and keep default as OS-page size if the same is smaller than 8K. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com