Re: PostgreSQL 17 Release Management Team & Feature Freeze
On Mon, 8 Apr 2024 at 16:26, Robert Haas wrote: > And maybe we need to think of a way to further mitigate this crush of > last minute commits. e.g. In the last week, you can't have more > feature commits, or more lines of insertions in your commits, than you > did in the prior 3 weeks combined. I don't know. I think this mad rush > of last-minute commits is bad for the project. > I think some part of this rush of commits could also be explained as a form of entrainment[1]. Only patches reasonably close to commit will get picked up with extra attention to get them ready before the deadline. After the release hammer drops, the pool of remaining patches will have few patches close to commit remaining. And to make matters worse the attention of working on them will be spread thinner. When repeated, this pattern can be self reinforcing. If this hypothesis is true, maybe some forces could be introduced to counteract this natural tendency. I don't have any bright ideas on how exactly yet. Ants [1] Emergent synchronization of interacting oscillators, see: https://en.wikipedia.org/wiki/Injection_locking#Entrainment https://en.wikipedia.org/wiki/Entrainment_(biomusicology)
Re: Popcount optimization using AVX512
On Fri, 5 Apr 2024 at 07:15, Nathan Bossart wrote: > Here is an updated patch set. IMHO this is in decent shape and is > approaching committable. I checked the code generation on various gcc and clang versions. It looks mostly fine starting from versions where avx512 is supported, gcc-7.1 and clang-5. The main issue I saw was that clang was able to peel off the first iteration of the loop and then eliminate the mask assignment and replace masked load with a memory operand for vpopcnt. I was not able to convince gcc to do that regardless of optimization options. Generated code for the inner loop: clang: : 50: add rdx, 64 54: cmp rdx, rdi 57: jae 59: vpopcntq zmm1, zmmword ptr [rdx] 5f: vpaddq zmm0, zmm1, zmm0 65: jmp gcc: : 38: kmovq k1, rdx 3d: vmovdqu8 zmm0 {k1} {z}, zmmword ptr [rax] 43: add rax, 64 47: mov rdx, -1 4e: vpopcntq zmm0, zmm0 54: vpaddq zmm0, zmm0, zmm1 5a: vmovdqa64 zmm1, zmm0 60: cmp rax, rsi 63: jb I'm not sure how much that matters in practice. Attached is a patch to do this manually giving essentially the same result in gcc. As most distro packages are built using gcc I think it would make sense to have the extra code if it gives a noticeable benefit for large cases. The visibility map patch has the same issue, otherwise looks good. Regards, Ants Aasma diff --git a/src/port/pg_popcount_avx512.c b/src/port/pg_popcount_avx512.c index dacc7553d29..f6e718b86e9 100644 --- a/src/port/pg_popcount_avx512.c +++ b/src/port/pg_popcount_avx512.c @@ -52,13 +52,21 @@ pg_popcount_avx512(const char *buf, int bytes) * Iterate through all but the final iteration. Starting from second * iteration, the start index mask is ignored. */ - for (; buf < final; buf += sizeof(__m512i)) + if (buf < final) { val = _mm512_maskz_loadu_epi8(mask, (const __m512i *) buf); cnt = _mm512_popcnt_epi64(val); accum = _mm512_add_epi64(accum, cnt); + buf += sizeof(__m512i); mask = ~UINT64CONST(0); + + for (; buf < final; buf += sizeof(__m512i)) + { + val = _mm512_load_si512((const __m512i *) buf); + cnt = _mm512_popcnt_epi64(val); + accum = _mm512_add_epi64(accum, cnt); + } } /* Final iteration needs to ignore bytes that are not within the length */
Re: Popcount optimization using AVX512
On Thu, 4 Apr 2024 at 01:50, Nathan Bossart wrote: > If we can verify this approach won't cause segfaults and can stomach the > regression between 8 and 16 bytes, I'd happily pivot to this approach so > that we can avoid the function call dance that I have in v25. The approach I posted does not rely on masking performing page fault suppression. All loads are 64 byte aligned and always contain at least one byte of the buffer and therefore are guaranteed to be within a valid page. I personally don't mind it being slower for the very small cases, because when performance on those sizes really matters it makes much more sense to shoot for an inlined version instead. Speaking of which, what does bumping up the inlined version threshold to 16 do with and without AVX-512 available? Linearly extrapolating the 2 and 4 byte numbers it might just come ahead in both cases, making the choice easy. Regards, Ants Aasma
Re: Popcount optimization using AVX512
On Tue, 2 Apr 2024 at 00:31, Nathan Bossart wrote: > On Tue, Apr 02, 2024 at 12:11:59AM +0300, Ants Aasma wrote: > > What about using the masking capabilities of AVX-512 to handle the > > tail in the same code path? Masked out portions of a load instruction > > will not generate an exception. To allow byte level granularity > > masking, -mavx512bw is needed. Based on wikipedia this will only > > disable this fast path on Knights Mill (Xeon Phi), in all other cases > > VPOPCNTQ implies availability of BW. > > Sounds promising. IMHO we should really be sure that these kinds of loads > won't generate segfaults and the like due to the masked-out portions. I > searched around a little bit but haven't found anything that seemed > definitive. After sleeping on the problem, I think we can avoid this question altogether while making the code faster by using aligned accesses. Loads that straddle cache line boundaries run internally as 2 load operations. Gut feel says that there are enough out-of-order resources available to make it not matter in most cases. But even so, not doing the extra work is surely better. Attached is another approach that does aligned accesses, and thereby avoids going outside bounds. Would be interesting to see how well that fares in the small use case. Anything that fits into one aligned cache line should be constant speed, and there is only one branch, but the mask setup and folding the separate popcounts together should add up to about 20-ish cycles of overhead. Regards, Ants Aasma diff --git a/src/port/pg_popcount_avx512.c b/src/port/pg_popcount_avx512.c index f86558d1ee5..e1fbd98fa14 100644 --- a/src/port/pg_popcount_avx512.c +++ b/src/port/pg_popcount_avx512.c @@ -30,20 +30,44 @@ uint64 pg_popcount_avx512(const char *buf, int bytes) { - uint64 popcnt; + __m512i val, cnt; __m512i accum = _mm512_setzero_si512(); + const char *final; + int tail_idx; + __mmask64 mask = -1; - for (; bytes >= sizeof(__m512i); bytes -= sizeof(__m512i)) - { - const __m512i val = _mm512_loadu_si512((const __m512i *) buf); - const __m512i cnt = _mm512_popcnt_epi64(val); + /* + * Align buffer down to avoid double load overhead from unaligned access. + * Calculate a mask to ignore preceding bytes. Find start offset of final + * iteration and number of valid bytes making sure that final iteration + * is not empty. + */ + mask <<= ((uintptr_t) buf) % sizeof(__m512i); + tail_idx = (((uintptr_t) buf + bytes - 1) % sizeof(__m512i)) + 1; + final = (const char *) TYPEALIGN_DOWN(sizeof(__m512i), buf + bytes - 1); + buf = (const char *) TYPEALIGN_DOWN(sizeof(__m512i), buf); + /* + * Iterate through all but the final iteration. Starting from second + * iteration, the start index mask is ignored. + */ + for (; buf < final; buf += sizeof(__m512i)) + { + val = _mm512_maskz_loadu_epi8(mask, (const __m512i *) buf); + cnt = _mm512_popcnt_epi64(val); accum = _mm512_add_epi64(accum, cnt); - buf += sizeof(__m512i); + + mask = -1; } - popcnt = _mm512_reduce_add_epi64(accum); - return popcnt + pg_popcount_fast(buf, bytes); + /* Final iteration needs to ignore bytes that are not within the length */ + mask &= ((~0ULL) >> (64 - tail_idx)); + + val = _mm512_maskz_loadu_epi8(mask, (const __m512i *) buf); + cnt = _mm512_popcnt_epi64(val); + accum = _mm512_add_epi64(accum, cnt); + + return _mm512_reduce_add_epi64(accum); } #endif /* TRY_POPCNT_FAST */
Re: Popcount optimization using AVX512
On Tue, 2 Apr 2024 at 00:31, Nathan Bossart wrote: > > On Tue, Apr 02, 2024 at 12:11:59AM +0300, Ants Aasma wrote: > > What about using the masking capabilities of AVX-512 to handle the > > tail in the same code path? Masked out portions of a load instruction > > will not generate an exception. To allow byte level granularity > > masking, -mavx512bw is needed. Based on wikipedia this will only > > disable this fast path on Knights Mill (Xeon Phi), in all other cases > > VPOPCNTQ implies availability of BW. > > Sounds promising. IMHO we should really be sure that these kinds of loads > won't generate segfaults and the like due to the masked-out portions. I > searched around a little bit but haven't found anything that seemed > definitive. Interestingly the Intel software developer manual is not exactly crystal clear on how memory faults with masks work, but volume 2A chapter 2.8 [1] does specify that MOVDQU8 is of exception class E4.nb that supports memory fault suppression on page fault. Regards, Ants Aasma [1] https://cdrdv2-public.intel.com/819712/253666-sdm-vol-2a.pdf
Re: Popcount optimization using AVX512
On Mon, 1 Apr 2024 at 18:53, Nathan Bossart wrote: > > On Mon, Apr 01, 2024 at 01:06:12PM +0200, Alvaro Herrera wrote: > > On 2024-Mar-31, Nathan Bossart wrote: > >> +popcnt = _mm512_reduce_add_epi64(accum); > >> +return popcnt + pg_popcount_fast(buf, bytes); > > > > Hmm, doesn't this arrangement cause an extra function call to > > pg_popcount_fast to be used here? Given the level of micro-optimization > > being used by this code, I would have thought that you'd have tried to > > avoid that. (At least, maybe avoid the call if bytes is 0, no?) > > Yes, it does. I did another benchmark on very small arrays and can see the > overhead. This is the time in milliseconds to run pg_popcount() on an > array 1 billion times: > > size (bytes) HEAD AVX512-POPCNT > 1 1707.685 3480.424 > 2 1926.694 4606.182 > 4 3210.412 5284.506 > 8 1920.703 3640.968 > 162936.91 4045.586 > 323627.956 5538.418 > 645347.213 3748.212 > > I suspect that anything below 64 bytes will see this regression, as that is > the earliest point where there are enough bytes for ZMM registers. What about using the masking capabilities of AVX-512 to handle the tail in the same code path? Masked out portions of a load instruction will not generate an exception. To allow byte level granularity masking, -mavx512bw is needed. Based on wikipedia this will only disable this fast path on Knights Mill (Xeon Phi), in all other cases VPOPCNTQ implies availability of BW. Attached is an example of what I mean. I did not have a machine to test it with, but the code generated looks sane. I added the clang pragma because it insisted on unrolling otherwise and based on how the instruction dependencies look that is probably not too helpful even for large cases (needs to be tested). The configure check and compile flags of course need to be amended for BW. Regards, Ants Aasma diff --git a/src/port/pg_popcount_avx512.c b/src/port/pg_popcount_avx512.c index f86558d1ee5..7fb2ada16c9 100644 --- a/src/port/pg_popcount_avx512.c +++ b/src/port/pg_popcount_avx512.c @@ -30,20 +30,27 @@ uint64 pg_popcount_avx512(const char *buf, int bytes) { - uint64 popcnt; + __m512i val, cnt; + __mmask64 remaining_mask; __m512i accum = _mm512_setzero_si512(); - for (; bytes >= sizeof(__m512i); bytes -= sizeof(__m512i)) + #pragma clang loop unroll(disable) + for (; bytes > sizeof(__m512i); bytes -= sizeof(__m512i)) { - const __m512i val = _mm512_loadu_si512((const __m512i *) buf); - const __m512i cnt = _mm512_popcnt_epi64(val); + val = _mm512_loadu_si512((const __m512i *) buf); + cnt = _mm512_popcnt_epi64(val); accum = _mm512_add_epi64(accum, cnt); buf += sizeof(__m512i); } - popcnt = _mm512_reduce_add_epi64(accum); - return popcnt + pg_popcount_fast(buf, bytes); + remaining_mask = ~0ULL >> (sizeof(__m512i) - bytes); + val = _mm512_maskz_loadu_epi8(remaining_mask, (const __m512i *) buf); + cnt = _mm512_popcnt_epi64(val); + + accum = _mm512_add_epi64(accum, cnt); + + return _mm512_reduce_add_epi64(accum); } #endif /* TRY_POPCNT_FAST */
Re: Infinite loop in XLogPageRead() on standby
On Wed, 13 Mar 2024 at 04:56, Kyotaro Horiguchi wrote: > > At Mon, 11 Mar 2024 16:43:32 +0900 (JST), Kyotaro Horiguchi > wrote in > > Oh, I once saw the fix work, but seems not to be working after some > > point. The new issue was a corruption of received WAL records on the > > first standby, and it may be related to the setting. > > I identified the cause of the second issue. When I tried to replay the > issue, the second standby accidentally received the old timeline's > last page-spanning record till the end while the first standby was > promoting (but it had not been read by recovery). In addition to that, > on the second standby, there's a time window where the timeline > increased but the first segment of the new timeline is not available > yet. In this case, the second standby successfully reads the > page-spanning record in the old timeline even after the second standby > noticed that the timeline ID has been increased, thanks to the > robustness of XLogFileReadAnyTLI(). > > I think the primary change to XLogPageRead that I suggested is correct > (assuming the use of wal_segment_size instead of the > constant). However, still XLogFileReadAnyTLI() has a chance to read > the segment from the old timeline after the second standby notices a > timeline switch, leading to the second issue. The second issue was > fixed by preventing XLogFileReadAnyTLI from reading segments from > older timelines than those suggested by the latest timeline > history. (In other words, disabling the "AnyTLI" part). > > I recall that there was a discussion for commit 4bd0ad9e44, about the > objective of allowing reading segments from older timelines than the > timeline history suggests. In my faint memory, we concluded to > postpone making the decision to remove the feature due to uncertainity > about the objective. If there's no clear reason to continue using > XLogFileReadAnyTLI(), I suggest we stop its use and instead adopt > XLogFileReadOnTLHistory(), which reads segments that align precisely > with the timeline history. This sounds very similar to the problem described in [1]. And I think both will be resolved by that change. [1] https://postgr.es/m/CANwKhkMN3QwAcvuDZHb6wsvLRtkweBiYso-KLFykkQVWuQLcOw%40mail.gmail.com
Re: Change GUC hashtable to use simplehash?
On Tue, 30 Jan 2024 at 12:04, John Naylor wrote: > > On Tue, Jan 30, 2024 at 4:13 AM Ants Aasma wrote: > > But given that we know the data length and we have it in a register > > already, it's easy enough to just mask out data past the end with a > > shift. See patch 1. Performance benefit is about 1.5x Measured on a > > small test harness that just hashes and finalizes an array of strings, > > with a data dependency between consecutive hashes (next address > > depends on the previous hash output). > > Interesting work! I've taken this idea and (I'm guessing, haven't > tested) improved it by re-using an intermediate step for the > conditional, simplifying the creation of the mask, and moving the > bitscan out of the longest dependency chain. Since you didn't attach > the test harness, would you like to run this and see how it fares? > (v16-0001 is same as your 0001, and v16-0002 builds upon it.) I plan > to test myself as well, but since your test tries to model true > latency, I'm more interested in that one. It didn't calculate the same result because the if (mask) condition was incorrect. Changed it to if (chunk & 0xFF) and removed the right shift from the mask. It seems to be half a nanosecond faster, but as I don't have a machine set up for microbenchmarking it's quite close to measurement noise. I didn't post the harness as it's currently so messy to be near useless to others. But if you'd like to play around, I can tidy it up a bit and post it. > > Not sure if the second one is worth the extra code. > > I'd say it's not worth optimizing the case we think won't be taken > anyway. I also like having a simple path to assert against. Agreed. As an addendum, I couldn't resist trying out using 256bit vectors with two parallel AES hashes running, unaligned loads with special casing page boundary straddling loads. Requires -march=x86-64-v3 -maes. About 20% faster than fasthash on short strings, 2.2x faster on 4k strings. Right now requires 4 bytes alignment (uses vpmaskmovd), but could be made to work with any alignment. Regards, Ants Aasma #include #include #define PAGE_SIZE 0x1000 uint64_t fast_vec_hash_cstring_avx2(char *buf) { __m128i hash0 = {0, 0}; __m128i hash1 = {0, 0}; __m128i k0 = {0x0807060504030201, 0x100F0E0D0C0B0A09}; __m128i k1 = {0x1117161514131211, 0x201F1E1D1C1B1A19}; char *cur = buf; int mask; __m256i chunk; int offset = (uintptr_t) buf & (sizeof(chunk) - 1); int endpos; do { char *end_of_page = (char*) uintptr_t) cur) | (PAGE_SIZE-1)) + 1); for (; cur + sizeof(chunk) <= end_of_page; cur += sizeof(chunk)) { chunk = _mm256_loadu_si256((__m256i*) cur); __m256i ends = _mm256_cmpeq_epi8(chunk, _mm256_set1_epi8(0)); mask = _mm256_movemask_epi8(ends); if (mask) goto last_iteration; hash0 = _mm_aesenc_si128(hash0, k0); hash1 = _mm_aesenc_si128(hash1, k1); hash0 = _mm_aesenc_si128(hash0, _mm256_extracti128_si256(chunk, 0)); hash1 = _mm_aesenc_si128(hash1, _mm256_extracti128_si256(chunk, 1)); } if (offset) { __m256i load_mask = _mm256_cmpgt_epi32(_mm256_set1_epi32(offset / 4), _mm256_setr_epi32(0,1,2,3,4,5,6,7)); chunk = _mm256_maskload_epi32((const int*) cur, load_mask); __m256i ends = load_mask & _mm256_cmpeq_epi8(chunk, _mm256_set1_epi8(0)); mask = _mm256_movemask_epi8(ends); if (mask) goto last_iteration; chunk |= _mm256_maskload_epi32((const int*) cur, load_mask); ends = load_mask & _mm256_cmpeq_epi8(chunk, _mm256_set1_epi8(0)); mask = _mm256_movemask_epi8(ends); if (mask) goto last_iteration; hash0 = _mm_aesenc_si128(hash0, k0); hash1 = _mm_aesenc_si128(hash1, k1); hash0 = _mm_aesenc_si128(hash0, _mm256_extracti128_si256(chunk, 0)); hash1 = _mm_aesenc_si128(hash1, _mm256_extracti128_si256(chunk, 1)); cur += sizeof(chunk); } } while(1); last_iteration: // chunk contains data, mask contains location of end of line endpos = _tzcnt_u32(mask); _mm256_cmpgt_epi8(_mm256_set1_epi8(endpos), _mm256_setr_epi8(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31)); hash0 = _mm_aesenc_si128(hash0, k0); hash1 = _mm_aesenc_si128(hash1, k1); hash0 = _mm_aesenc_si128(hash0, _mm256_extracti128_si256(chunk, 0)); hash1 = _mm_aesenc_si128(hash1, _mm256_extracti128_si256(chunk, 1)); hash0 = _mm_aesenc_si128(hash0, k0); hash1 = _mm_aesenc_si128(hash1, k1); hash0 = _mm_aesenc_si128(hash0, k1); hash1 = _mm_aesenc_si128(hash1, k0); hash0 = _mm_aesenc_si128(hash0, k0); hash1 = _mm_aesenc_si128(hash1, k1); __m128i intermediate = hash1 ^ hash0; return intermediate[1] ^ intermediate[0]; }
Re: Change GUC hashtable to use simplehash?
On Sun, 21 Jan 2024 at 03:06, Jeff Davis wrote: > Yes, thank you. I don't think we need to change the algorithm. Jumping in here at a random point just to share my findings from poking around this on and off. I am concentrating here on cstring hashing as that is the most complicated one. One thing that caught my eye in testing was that the unaligned cstring code was unexpectedly faster for short strings (3-18B uniform distribution). Looking into it the cause was fasthash_accum() called in the final iteration. In the unaligned case compiler (clang-15) unrolled the inner loop which allowed it to jump directly into the correct place in the switch. In the unaligned case clang decided to use a data dependent jump which then mispredicts all of the time. But given that we know the data length and we have it in a register already, it's easy enough to just mask out data past the end with a shift. See patch 1. Performance benefit is about 1.5x Measured on a small test harness that just hashes and finalizes an array of strings, with a data dependency between consecutive hashes (next address depends on the previous hash output). Unaligned case can actually take advantage of the same trick as the aligned case, it just has to shuffle the data from two consecutive words before applying the combine function. Patch 2 implements this. It makes the unaligned case almost as fast as the aligned one, both on short and long strings. 10% benefit on short strings, 50% on long ones. Not sure if the second one is worth the extra code. A different approach would be to use the simple word at a time hashing for the unaligned case too and handle word accesses that straddle a page boundary as a special case. Obviously this only makes sense for platforms that support unaligned access. On x86 unaligned access within a cache line is basically free, and across cache lines is only slightly more expensive. On benchmarks calling the aligned code on unaligned strings only has a 5% penalty on long strings, short ones are indistinguishable. I also took a look at using SIMD for implementing the hash using the same aligned access + shuffle trick. The good news is that the shuffling works well enough that neither it nor checking for string end are the longest chain. The bad news is that the data load, alignment, zero finding and masking form a big dependency chain on the first iteration. Mixing and finalization is even worse, fasthash uses 64bit imul instruction that has a 3 cycle latency, the iteration to iteration chain is imul + xor, for 4 cycles or 2 B/cycle (in practice a bit less due to ALU port contention). In SIMD registers there is no 64bit multiply, and 32 bit multiply has a terrible 10 cycle latency on Intel. AES instructions are an interesting option, but it seems that 2 are needed for good enough mixing, at 4 cycles each, we again end up at 2B/cycle. Finalization needs another 3 AES instructions, a shuffle and a xor fold to pass SMHasher, for 17 cycles. The mix latency issue could be worked around by doing more mixing in parallel, potentially up to 8x faster, but this does not help short strings at all and would make the code way bigger. SIMD code does use fewer instructions so it interleaves better with nearby code that is not dependent on it, not sure if that matters anywhere. The short version is that for very long (4k+) strings the attached SIMD code is 35% faster, for short strings it is 35% slower, and this is very much x86-64-v3 only and would need a fallback when AVX and AES-NI are not available. Basically a dead end for the use cases this hash function is used for. Regards, Ants Aasma From 912f46be12536985dda7bcfb669d4ec13e79d073 Mon Sep 17 00:00:00 2001 From: Ants Aasma Date: Mon, 29 Jan 2024 21:07:44 +0200 Subject: [PATCH 2/2] Unaligned fasthash word at a time hashing About 10% performance benefit on short strings, 50% on long ones, making the performance almost identical to the aligned case. --- src/include/common/hashfn_unstable.h | 156 +++ 1 file changed, 138 insertions(+), 18 deletions(-) diff --git a/src/include/common/hashfn_unstable.h b/src/include/common/hashfn_unstable.h index 8ee1b99a204..1e44814d84a 100644 --- a/src/include/common/hashfn_unstable.h +++ b/src/include/common/hashfn_unstable.h @@ -189,6 +189,38 @@ first_byte_nonzero(uint64 v) #endif } +/* + * Selects first n bits in memory order and masks the rest with NUL. + * Using value 0 for n results in undefined behavior. + */ +static inline uint64 +first_n64(uint64 v, uint64 n) +{ + Assert(0 < n && n <= 64); +#ifdef WORDS_BIGENDIAN + return v & ((~0ULL) << (64 - n)); +#else + return v & ((~0ULL) >> (64 - n)); +#endif +} + +/* + * Does the equivalent of an unaligned word access into two consecutive + * words, taking the last 8 - offset bytes from first and adding first + * offset bytes from second word. offset must be in range [1..7] + */ +static inline uint64 +align_n64(uint64
Re: add AVX2 support to simd.h
On Tue, 9 Jan 2024 at 18:20, Nathan Bossart wrote: > > On Tue, Jan 09, 2024 at 09:20:09AM +0700, John Naylor wrote: > > On Tue, Jan 9, 2024 at 12:37 AM Nathan Bossart > > wrote: > >> > >> > I suspect that there could be a regression lurking for some inputs > >> > that the benchmark doesn't look at: pg_lfind32() currently needs to be > >> > able to read 4 vector registers worth of elements before taking the > >> > fast path. There is then a tail of up to 15 elements that are now > >> > checked one-by-one, but AVX2 would increase that to 31. That's getting > >> > big enough to be noticeable, I suspect. It would be good to understand > >> > that case (n*32 + 31), because it may also be relevant now. It's also > >> > easy to improve for SSE2/NEON for v17. > >> > >> Good idea. If it is indeed noticeable, we might be able to "fix" it by > >> processing some of the tail with shorter vectors. But that probably means > >> finding a way to support multiple vector sizes on the same build, which > >> would require some work. > > > > What I had in mind was an overlapping pattern I've seen in various > > places: do one iteration at the beginning, then subtract the > > aligned-down length from the end and do all those iterations. And > > one-by-one is only used if the total length is small. > > Sorry, I'm not sure I understood this. Do you mean processing the first > several elements individually or with SSE2 until the number of remaining > elements can be processed with just the AVX2 instructions (a bit like how > pg_comp_crc32c_armv8() is structured for memory alignment)? For some operations (min, max, = any) processing the same elements multiple times doesn't change the result. So the vectors for first and/or last iterations can overlap with the main loop. In other cases it's possible to mask out the invalid elements and replace them with zeroes. Something along the lines of: static inline Vector8 vector8_mask_right(int num_valid) { __m256i seq = _mm256_set_epi8(31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0); return _mm256_cmpgt_epi8(_mm256_set1_epi8(num_valid), seq); } /* final incomplete iteration */ Vector8 mask = vector8_mask_right(end - cur); final_vec = vector8_and((Vector8*) (end - sizeof(Vector8), mask); accum = vector8_add(accum, final_vec); It helps that on any halfway recent x86 unaligned loads only have a minor performance penalty and only when straddling cache line boundaries. Not sure what the state on ARM is. If we don't care about unaligned loads then we only need to care about the load not crossing page boundaries which could cause segfaults. Though I'm sure memory sanitizer tools will have plenty to complain about around such hacks.
Re: add AVX2 support to simd.h
On Tue, 9 Jan 2024 at 16:03, Peter Eisentraut wrote: > On 29.11.23 18:15, Nathan Bossart wrote: > > Using the same benchmark as we did for the SSE2 linear searches in > > XidInMVCCSnapshot() (commit 37a6e5d) [1] [2], I see the following: > > > >writerssse2avx2 % > >25611951188-1 > >512 9281054 +14 > > 1024 633 716 +13 > > 2048 332 420 +27 > > 4096 162 203 +25 > > 8192 162 182 +12 > > AFAICT, your patch merely provides an alternative AVX2 implementation > for where currently SSE2 is supported, but it doesn't provide any new > API calls or new functionality. One might naively expect that these are > just two different ways to call the underlying primitives in the CPU, so > these performance improvements are surprising to me. Or do the CPUs > actually have completely separate machinery for SSE2 and AVX2, and just > using the latter to do the same thing is faster? The AVX2 implementation uses a wider vector register. On most current processors the throughput of the instructions in question is the same on 256bit vectors as on 128bit vectors. Basically, the chip has AVX2 worth of machinery and using SSE2 leaves half of it unused. Notable exceptions are efficiency cores on recent Intel desktop CPUs and AMD CPUs pre Zen 2 where AVX2 instructions are internally split up into two 128bit wide instructions. For AVX512 the picture is much more complicated. Some instructions run at half rate, some at full rate, but not on all ALU ports, some instructions cause aggressive clock rate reduction on some microarchitectures. AVX-512 adds mask registers and masked vector instructions that enable quite a bit simpler code in many cases. Interestingly I have seen Clang make quite effective use of these masked instructions even when using AVX2 intrinsics, but targeting an AVX-512 capable platform. The vector width independent approach used in the patch is nice for simple cases by not needing a separate implementation for each vector width. However for more complicated cases where "horizontal" operations are needed it's going to be much less useful. But these cases can easily just drop down to using intrinsics directly.
Re: autovectorize page checksum code included elsewhere
On Wed, 22 Nov 2023 at 11:44, John Naylor wrote: > > On Tue, Nov 7, 2023 at 9:47 AM Nathan Bossart > wrote: > > > > Presently, we ask compilers to autovectorize checksum.c and numeric.c. The > > page checksum code actually lives in checksum_impl.h, and checksum.c just > > includes it. But checksum_impl.h is also used in pg_upgrade/file.c and > > pg_checksums.c, and since we don't ask compilers to autovectorize those > > files, the page checksum code may remain un-vectorized. > > Poking in those files a bit, I also see references to building with > SSE 4.1. Maybe that's an avenue that we should pursue? (an indirect > function call is surely worth it for page-sized data) For reference, executing the page checksum 10M times on a AMD 3900X CPU: clang-14 -O2 4.292s (17.8 GiB/s) clang-14 -O2 -msse4.12.859s (26.7 GiB/s) clang-14 -O2 -msse4.1 -mavx2 1.378s (55.4 GiB/s) -- Ants Aasma Senior Database Engineer www.cybertec-postgresql.com
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
On Sat, 4 Nov 2023 at 22:08, Andrey M. Borodin wrote: > On 30 Oct 2023, at 09:20, Dilip Kumar wrote: > > changed the logic of SlruAdjustNSlots() in 0002, such that now it > starts with the next power of 2 value of the configured slots and > keeps doubling the number of banks until we reach the number of banks > to the max SLRU_MAX_BANKS(128) and bank size is bigger than > SLRU_MIN_BANK_SIZE (8). By doing so, we will ensure we don't have too > many banks > > There was nothing wrong with having too many banks. Until bank-wise locks > and counters were added in later patchsets. > Having hashtable to find SLRU page in the buffer IMV is too slow. Some > comments on this approach can be found here [0]. > I'm OK with having HTAB for that if we are sure performance does not > degrade significantly, but I really doubt this is the case. > I even think SLRU buffers used HTAB in some ancient times, but I could not > find commit when it was changed to linear search. > > Maybe we could decouple locks and counters from SLRU banks? Banks were > meant to be small to exploit performance of local linear search. Lock > partitions have to be bigger for sure. > Is there a particular reason why lock partitions need to be bigger? We have one lock per buffer anyway, bankwise locks will increase the number of locks < 10%. I am working on trying out a SIMD based LRU mechanism that uses a 16 entry bank. The data layout is: struct CacheBank { int page_numbers[16]; char access_age[16]; } The first part uses up one cache line, and the second line has 48 bytes of space left over that could fit a lwlock and page_status, page_dirty arrays. Lookup + LRU maintenance has 20 instructions/14 cycle latency and the only branch is for found/not found. Hoping to have a working prototype of SLRU on top in the next couple of days. Regards, Ants Aasma
Re: Lowering the default wal_blocksize to 4K
On Thu, 12 Oct 2023 at 16:36, Robert Haas wrote: > On Wed, Oct 11, 2023 at 4:28 PM Thomas Munro > wrote: > > That leaves only the segments where a record starts exactly on the > > first usable byte of a segment, which is why I was trying to think of > > a way to cover that case too. I suggested we could notice and insert > > a new record at that place. But Andres suggests it would be too > > expensive and not worth worrying about. > > Hmm. Even in that case, xl_prev has to match. It's not like it's the > wild west. Sure, it's not nearly as good of a cross-check, but it's > something. It seems to me that it's not worth worrying very much about > xlp_seg_size or xlp_blcksz changing undetected in that scenario - if > you're doing that kind of advanced magic, you need to be careful > enough to not mess it up, and if we still cross-check once per > checkpoint cycle that's pretty good. I do worry a bit about the sysid > changing under us, though. It's not that hard to get your WAL archives > mixed up, and it'd be nice to catch that right away. > This reminds me that xlp_tli is not being used to its full potential right now either. We only check that it's not going backwards, but there is at least one not very hard to hit way to get postgres to silently replay on the wrong timeline. [1] [1] https://www.postgresql.org/message-id/canwkhkmn3qwacvudzhb6wsvlrtkwebiyso-klfykkqvwuql...@mail.gmail.com -- Ants Aasma Senior Database Engineerwww.cybertec-postgresql.com
Re: Disabling Heap-Only Tuples
On Fri, 7 Jul 2023 at 13:18, Tomas Vondra wrote: > On 7/7/23 11:55, Matthias van de Meent wrote: > > On Fri, 7 Jul 2023 at 06:53, Dilip Kumar wrote: > >> > >> On Fri, Jul 7, 2023 at 1:48 AM Matthias van de Meent > >> wrote: > >>> > >>> On Wed, 5 Jul 2023 at 19:55, Thom Brown wrote: > >>>> > >>>> On Wed, 5 Jul 2023 at 18:05, Matthias van de Meent > >>>> wrote: > >>>>> So what were you thinking of? A session GUC? A table option? > >>>> > >>>> Both. > >>> > >>> Here's a small patch implementing a new table option max_local_update > >>> (name very much bikesheddable). Value is -1 (default, disabled) or the > >>> size of the table in MiB that you still want to allow to update on the > >>> same page. I didn't yet go for a GUC as I think that has too little > >>> control on the impact on the system. > >> > >> So IIUC, this parameter we can control that instead of putting the new > >> version of the tuple on the same page, it should choose using > >> RelationGetBufferForTuple(), and that can reduce the fragmentation > >> because now if there is space then most of the updated tuple will be > >> inserted in same pages. But this still can not truncate the pages > >> from the heap right? because we can not guarantee that the new page > >> selected by RelationGetBufferForTuple() is not from the end of the > >> heap, and until we free the pages from the end of the heap, the vacuum > >> can not truncate any page. Is my understanding correct? > > > > Yes. If you don't have pages with (enough) free space for the updated > > tuples in your table, or if the FSM doesn't accurately reflect the > > actual state of free space in your table, this won't help (which is > > also the reason why I run vacuum in the tests). It also won't help if > > you don't update the tuples physically located at the end of your > > table, but in the targeted workload this would introduce a bias where > > new tuple versions are moved to the front of the table. > > > > Something to note is that this may result in very bad bloat when this > > is combined with a low fillfactor: All blocks past max_local_update > > will be unable to use space reserved by fillfactor because FSM lookups > > always take fillfactor into account, and all updates (which ignore > > fillfactor when local) would go through the FSM instead, thus reducing > > the space available on each block to exactly the fillfactor. So, this > > might need some extra code to make sure we don't accidentally blow up > > the table's size with UPDATEs when max_local_update is combined with > > low fillfactors. I'm not sure where that would fit best. > > > > I know the thread started as "let's disable HOT" and this essentially > just proposes to do that using a table option. But I wonder if that's > far too simple to be reliable, because hoping RelationGetBufferForTuple > happens to do the right thing does not seem great. > > I wonder if we should invent some definition of "strategy" that would > tell RelationGetBufferForTuple what it should aim for ... > > I'm imagining either a table option with a couple possible values > (default, non-hot, first-page, ...) or maybe something even more > elaborate (perhaps even a callback?). > > Now, it's not my intention to hijack this thread, but this discussion > reminds me one of the ideas from my "BRIN improvements" talk, about > maybe using BRIN indexes for routing. UPDATEs may be a major issue for > BRIN, making them gradually worse over time. If we could "tell" > RelationGetBufferForTuple() which buffers are more suitable (by looking > at an index, histogram or some approximate mapping), that might help. Just as another point in support of strategy based/extensible tuple placement, I would at some point try out placing INSERT ON CONFLICT tuples on the same page as the preceding key in the index. Use case is in tables with (series, timestamp) primary key to get locality of access range scanning for a single series. Placement will always be a tradeoff that is dependent on hardware and workload, and the effect can be pretty large. For the mentioned use case, if placement can maintain some semblance of clustering, there will be a 10-100x reduction in buffers accessed for a relatively minor increase in bloat. -- Ants Aasma Senior Database Engineer www.cybertec-postgresql.com
Re: ReadRecentBuffer() doesn't scale well
On Tue, 27 Jun 2023 at 18:40, Andres Freund wrote: > On 2023-06-27 14:49:48 +0300, Ants Aasma wrote: > > If you want to experiment, here is a rebased version of something I > > hacked up a couple of years back on the way to Fosdem Pgday. I didn't > > pursue it further because I didn't have a use case where it showed a > > significant difference. > > Thanks for posting! > > Based on past experiments, anything that requires an atomic op during spinlock > release on x86 will be painful :/. I'm not sure there's a realistic way to > avoid that with futexes though :(. Do you happen to know if a plain xchg instruction counts as an atomic for this? I haven't done atomics stuff in a while, so I might be missing something, but at first glance I think using a plain xchg would be enough for the releasing side. -- Ants
Re: ReadRecentBuffer() doesn't scale well
On Tue, 27 Jun 2023 at 07:09, Andres Freund wrote: > On 2023-06-27 15:33:57 +1200, Thomas Munro wrote: > > On Tue, Jun 27, 2023 at 2:05 PM Andres Freund wrote: > > > Unfortunately it scaled way worse at first. This is not an inherent > > > issue, but > > > due to an implementation choice in ReadRecentBuffer(). Whereas the normal > > > BufferAlloc() path uses PinBuffer(), ReadRecentBuffer() first does > > > LockBufHdr(), checks if the buffer ID is the same and then uses > > > PinBuffer_Locked(). > > > > > > The problem with that is that PinBuffer() takes care to not hold the > > > buffer > > > header spinlock, it uses compare_exchange to atomically acquire the pin, > > > while > > > guaranteing nobody holds the lock. When holding the buffer header > > > spinlock, > > > there obviously is the risk of being scheduled out (or even just not have > > > exclusive access to the cacheline). > > > > Yeah. Aside from inherent nastiness of user-space spinlocks > > I've been wondering about making our backoff path use futexes, after some > adaptive spinning. If you want to experiment, here is a rebased version of something I hacked up a couple of years back on the way to Fosdem Pgday. I didn't pursue it further because I didn't have a use case where it showed a significant difference. -- Ants diff --git a/src/backend/storage/lmgr/s_lock.c b/src/backend/storage/lmgr/s_lock.c index 327ac64f7c2..67a5e8a0246 100644 --- a/src/backend/storage/lmgr/s_lock.c +++ b/src/backend/storage/lmgr/s_lock.c @@ -92,6 +92,7 @@ s_lock_stuck(const char *file, int line, const char *func) int s_lock(volatile slock_t *lock, const char *file, int line, const char *func) { +#ifndef HAS_FUTEX SpinDelayStatus delayStatus; init_spin_delay(, file, line, func); @@ -104,6 +105,8 @@ s_lock(volatile slock_t *lock, const char *file, int line, const char *func) finish_spin_delay(); return delayStatus.delays; +#endif + elog(FATAL, "Should not be called"); } #ifdef USE_DEFAULT_S_UNLOCK @@ -230,6 +233,71 @@ update_spins_per_delay(int shared_spins_per_delay) return (shared_spins_per_delay * 15 + spins_per_delay) / 16; } +#ifdef HAS_FUTEX +#include +#include +#include + +static int +futex(volatile uint32 *uaddr, int futex_op, int val, + const struct timespec *timeout, int *uaddr2, int val3) +{ + return syscall(SYS_futex, uaddr, futex_op, val, + timeout, uaddr, val3); +} + +int +futex_lock(volatile slock_t *lock, uint32 current, const char *file, int line, const char *func) +{ + int i, s; + /* + * First lets wait for a bit without involving the kernel, it is quite likely + * the lock holder is still running. + **/ + if (likely(current < 2)) + { + uint32 expected; + for (i = 0; i < DEFAULT_SPINS_PER_DELAY; i++) + { + SPIN_DELAY(); + expected = lock->value; + if (expected == 0 && pg_atomic_compare_exchange_u32(lock, , 1)) +return i; + } + + while (expected != 2 && !pg_atomic_compare_exchange_u32(lock, , 2)) { + if (expected == 0 && pg_atomic_compare_exchange_u32(lock, , 2)) +return i; + } + } + + /* At this point lock value is 2 and we will get waken up */ + while (true) + { + uint32 expected = 0; + s = futex(&(lock->value), FUTEX_WAIT, 2, NULL, NULL, 0); + if (s == -1 && errno != EAGAIN) + elog(FATAL, "Futex wait failed with error: %m"); + + /* Maybe someone else was waiting too, we will try to wake them up. */ + if (pg_atomic_compare_exchange_u32(lock, , 2)) + break; + + } + + return i; +} + +int futex_unlock(volatile slock_t *lock, uint32 current) +{ + lock->value = 0; + if (futex(&(lock->value), FUTEX_WAKE, 1, NULL, NULL, 0) == -1) + elog(FATAL, "Futex wake failed with error: %m"); + + return 0; +} + +#endif /* HAS_FUTEX */ /*/ #if defined(S_LOCK_TEST) diff --git a/src/include/storage/s_lock.h b/src/include/storage/s_lock.h index c9fa84cc43c..6351ec0804e 100644 --- a/src/include/storage/s_lock.h +++ b/src/include/storage/s_lock.h @@ -205,6 +205,52 @@ spin_delay(void) #ifdef __x86_64__ /* AMD Opteron, Intel EM64T */ #define HAS_TEST_AND_SET +#if defined(__linux__) +#define HAS_FUTEX 1 /* TODO: move to configure to check for old kernels */ +#endif + +#ifdef HAS_FUTEX + +#include "port/atomics.h" + +typedef pg_atomic_uint32 slock_t; + +#define S_LOCK(lock) \ + do { \ + uint32 expected = 0; \ + if (unlikely(!pg_atomic_compare_exchange_u32((lock), , 1))) \ + futex_lock((lock), expected, __FILE__, __LINE__, __func__); \ + } while (0) + + +#define S_UNLOCK(lock) \ + do { \ + uint32 actual = pg_atomic_exchange_u32((lock), 0); \ + if (unlikely(actual == 2)) \ + futex_unlock((lock), actual); \ + } while (0) +extern int futex_lock(volatile slock_t *lock, uint32 current, const char *file, int line, const char *func); +extern int futex_unlock(volatile slock_t *lock, uint32 current); + +/* TAS only needed for regress */ +#define TAS(lock) tas(lock) + +static __inline__ int +tas(volatile slock_t
Re: Do we want a hashset type?
On Wed, 31 May 2023 at 18:40, Joel Jacobson wrote: > > On Wed, May 31, 2023, at 16:53, Tomas Vondra wrote: > > I think this needs a better explanation - what exactly is a hashset in > > this context? Something like an array with a hash for faster lookup of > > unique elements, or what? > > In this context, by "hashset" I am indeed referring to a data structure > similar > to an array, where each element would be unique, and lookups would be faster > than arrays for larger number of elements due to hash-based lookups. > > This data structure would store identifiers (IDs) of the nodes, not the > complete > nodes themselves. Have you looked at roaring bitmaps? There is a pg_roaringbitmap extension [1] already available that offers very fast unions, intersections and membership tests over integer sets. I used it to get some pretty impressive performance results for faceting search on large document sets. [2] Depending on the graph fan-outs and operations it might make sense in the graph use case. For small sets it's probably not too different from the intarray extension in contrib. But for finding intersections over large sets (i.e. a join) it's very-very fast. If the workload is traversal heavy it might make sense to even cache materialized transitive closures up to some depth (a friend-of-a-friend list). Roaring bitmaps only support int4 right now, but that is easily fixable. And they need a relatively dense ID space to get the performance boost, which seems essential to the approach. The latter issue means that it can't be easily dropped into GIN or B-tree indexes for ctid storage. [1] https://github.com/ChenHuajun/pg_roaringbitmap [2] https://github.com/cybertec-postgresql/pgfaceting -- Ants Aasma www.cybertec-postgresql.com
Re: Option to not use ringbuffer in VACUUM, using it in failsafe mode
On Mon, 20 Mar 2023 at 00:59, Melanie Plageman wrote: > > On Wed, Mar 15, 2023 at 6:46 AM Ants Aasma wrote: > > > > On Wed, 15 Mar 2023 at 02:29, Melanie Plageman > > wrote: > > > As for routine vacuuming and the other buffer access strategies, I think > > > there is an argument for configurability based on operator knowledge -- > > > perhaps your workload will use the data you are COPYing as soon as the > > > COPY finishes, so you might as well disable a buffer access strategy or > > > use a larger fraction of shared buffers. Also, the ring sizes were > > > selected sixteen years ago and average server memory and data set sizes > > > have changed. > > > > To be clear I'm not at all arguing against configurability. I was > > thinking that dynamic use could make the configuration simpler by self > > tuning to use no more buffers than is useful. > > Yes, but I am struggling with how we would define "useful". For copy and vacuum, the only reason I can see for keeping visited buffers around is to avoid flushing WAL or at least doing it in larger batches. Once the ring is big enough that WAL doesn't need to be flushed on eviction, making it bigger only wastes space that could be used by something that is not going to be evicted soon. > > > StrategyRejectBuffer() will allow bulkreads to, as you say, use more > > > buffers than the original ring size, since it allows them to kick > > > dirty buffers out of the ring and claim new shared buffers. > > > > > > Bulkwrites and vacuums, however, will inevitably dirty buffers and > > > require flushing the buffer (and thus flushing the associated WAL) when > > > reusing them. Bulkwrites and vacuum do not kick dirtied buffers out of > > > the ring, since dirtying buffers is their common case. A dynamic > > > resizing like the one you suggest would likely devolve to vacuum and > > > bulkwrite strategies always using the max size. > > > > I think it should self stabilize around the point where the WAL is > > either flushed by other commit activity, WAL writer or WAL buffers > > filling up. Writing out their own dirtied buffers will still happen, > > just the associated WAL flushes will be in larger chunks and possibly > > done by other processes. > > They will have to write out any WAL associated with modifications to the > dirty buffer before flushing it, so I'm not sure I understand how this > would work. By the time the dirty buffer needs eviction the WAL associated with it can already be written out by concurrent commits, WAL writer or by WAL buffers filling up. The bigger the ring is, the higher the chance that one of these will happen before we loop around. > > > As for decreasing the ring size, buffers are only "added" to the ring > > > lazily and, technically, as it is now, buffers which have been added > > > added to the ring can always be reclaimed by the clocksweep (as long as > > > they are not pinned). The buffer access strategy is more of a > > > self-imposed restriction than it is a reservation. Since the ring is > > > small and the buffers are being frequently reused, odds are the usage > > > count will be 1 and we will be the one who set it to 1, but there is no > > > guarantee. If, when attempting to reuse the buffer, its usage count is > > > > 1 (or it is pinned), we also will kick it out of the ring and go look > > > for a replacement buffer. > > > > Right, but while the buffer is actively used by the ring it is > > unlikely that clocksweep will find it at usage 0 as the ring buffer > > should cycle more often than the clocksweep. Whereas if the ring stops > > using a buffer, clocksweep will eventually come and reclaim it. And if > > the ring shrinking decision turns out to be wrong before the > > clocksweep gets around to reusing it, we can bring the same buffer > > back into the ring. > > I can see what you mean about excluding a buffer from the ring being a > more effective way of allowing it to be reclaimed. However, I'm not sure > I understand the use case. If the operation, say vacuum, is actively > using the buffer and keeping its usage count at one, then what would be > the criteria for it to decide to stop using it? The criteria for reducing ring size could be that we have cycled the ring buffer n times without having to do any WAL flushes. > Also, if vacuum used the buffer once and then didn't reuse it but, for > some reason, the vacuum isn't over, it isn't any different at that point > than some other buffer with a usage count of one. It isn't any harder > for it to be reclaimed by the clocksweep
Re: Option to not use ringbuffer in VACUUM, using it in failsafe mode
On Wed, 15 Mar 2023 at 02:57, Melanie Plageman wrote: > > > Subject: [PATCH v3 3/3] add vacuum option to specify ring_size and guc > > > > > #define INT_ACCESS_ONCE(var) ((int)(*((volatile int *)&(var > > > +#define bufsize_limit_to_nbuffers(bufsize) (bufsize * 1024 / BLCKSZ) > > > > Macros are normally be capitalized > > Yes, there doesn't seem to be a great amount of consistency around > this... See pgstat.c read_chunk_s and bufmgr.c BufHdrGetBlock and > friends. Though there are probably more capitalized than not. Since it > does a bit of math and returns a value, I wanted to convey that it was > more like a function. Also, since the name was long, I thought all-caps > would be hard to read. However, if you or others feel strongly, I am > attached neither to the capitalization nor to the name at all (what do > you think of the name?). A static inline function seems like a less surprising and more type safe solution for this. -- Ants Aasma Senior Database Engineer www.cybertec-postgresql.com
Re: Option to not use ringbuffer in VACUUM, using it in failsafe mode
On Wed, 15 Mar 2023 at 02:29, Melanie Plageman wrote: > As for routine vacuuming and the other buffer access strategies, I think > there is an argument for configurability based on operator knowledge -- > perhaps your workload will use the data you are COPYing as soon as the > COPY finishes, so you might as well disable a buffer access strategy or > use a larger fraction of shared buffers. Also, the ring sizes were > selected sixteen years ago and average server memory and data set sizes > have changed. To be clear I'm not at all arguing against configurability. I was thinking that dynamic use could make the configuration simpler by self tuning to use no more buffers than is useful. > StrategyRejectBuffer() will allow bulkreads to, as you say, use more > buffers than the original ring size, since it allows them to kick > dirty buffers out of the ring and claim new shared buffers. > > Bulkwrites and vacuums, however, will inevitably dirty buffers and > require flushing the buffer (and thus flushing the associated WAL) when > reusing them. Bulkwrites and vacuum do not kick dirtied buffers out of > the ring, since dirtying buffers is their common case. A dynamic > resizing like the one you suggest would likely devolve to vacuum and > bulkwrite strategies always using the max size. I think it should self stabilize around the point where the WAL is either flushed by other commit activity, WAL writer or WAL buffers filling up. Writing out their own dirtied buffers will still happen, just the associated WAL flushes will be in larger chunks and possibly done by other processes. > As for decreasing the ring size, buffers are only "added" to the ring > lazily and, technically, as it is now, buffers which have been added > added to the ring can always be reclaimed by the clocksweep (as long as > they are not pinned). The buffer access strategy is more of a > self-imposed restriction than it is a reservation. Since the ring is > small and the buffers are being frequently reused, odds are the usage > count will be 1 and we will be the one who set it to 1, but there is no > guarantee. If, when attempting to reuse the buffer, its usage count is > > 1 (or it is pinned), we also will kick it out of the ring and go look > for a replacement buffer. Right, but while the buffer is actively used by the ring it is unlikely that clocksweep will find it at usage 0 as the ring buffer should cycle more often than the clocksweep. Whereas if the ring stops using a buffer, clocksweep will eventually come and reclaim it. And if the ring shrinking decision turns out to be wrong before the clocksweep gets around to reusing it, we can bring the same buffer back into the ring. > I do think that it is a bit unreasonable to expect users to know how > large they would like to make their buffer access strategy ring. What we > want is some way of balancing different kinds of workloads and > maintenance tasks reasonably. If your database has no activity because > it is the middle of the night or it was shutdown because of transaction > id wraparound, there is no reason why vacuum should limit the number of > buffers it uses. I'm sure there are many other such examples. Ideally yes, though I am not hopeful of finding a solution that does this any time soon. Just to take your example, if a nightly maintenance job wipes out the shared buffer contents slightly optimizing its non time-critical work and then causes morning user visible load to have big latency spikes due to cache misses, that's not a good tradeoff either. -- Ants Aasma Senior Database Engineer www.cybertec-postgresql.com
Re: Option to not use ringbuffer in VACUUM, using it in failsafe mode
On Sat, 11 Mar 2023 at 16:55, Melanie Plageman wrote: > > > On Tue, Feb 28, 2023 at 3:16 AM Bharath Rupireddy > > wrote: > > > > > On Thu, Jan 12, 2023 at 6:06 AM Andres Freund wrote: > > > > > > > > On 2023-01-11 17:26:19 -0700, David G. Johnston wrote: > > > > > Should we just add "ring_buffers" to the existing "shared_buffers" and > > > > > "temp_buffers" settings? > > > > > > > > The different types of ring buffers have different sizes, for good > > > > reasons. So > > > > I don't see that working well. I also think it'd be more often useful to > > > > control this on a statement basis - if you have a parallel import tool > > > > that > > > > starts NCPU COPYs you'd want a smaller buffer than a single threaded > > > > COPY. Of > > > > course each session can change the ring buffer settings, but still. > > > > > > How about having GUCs for each ring buffer (bulk_read_ring_buffers, > > > bulk_write_ring_buffers, vacuum_ring_buffers - ah, 3 more new GUCs)? > > > These options can help especially when statement level controls aren't > > > easy to add (COPY, CREATE TABLE AS/CTAS, REFRESH MAT VIEW/RMV)? If > > > needed users can also set them at the system level. For instance, one > > > can set bulk_write_ring_buffers to other than 16MB or -1 to disable > > > the ring buffer to use shared_buffers and run a bunch of bulk write > > > queries. > > In attached v3, I've changed the name of the guc from buffer_usage_limit > to vacuum_buffer_usage_limit, since it is only used for vacuum and > autovacuum. Sorry for arriving late to this thread, but what about sizing the ring dynamically? From what I gather the primary motivation for larger ring size is avoiding WAL flushes due to dirty buffer writes. We already catch that event with StrategyRejectBuffer(). So maybe a dynamic sizing algorithm could be applied to the ringbuffer. Make the buffers array in strategy capable of holding up to the limit of buffers, but set ring size conservatively. If we have to flush WAL, double the ring size (up to the limit). If we loop around the ring without flushing, decrease the ring size by a small amount to let clock sweep reclaim them for use by other backends. -- Ants Aasma Senior Database Engineer www.cybertec-postgresql.com
Re: Standby recovers records from wrong timeline
On Fri, 21 Oct 2022 at 11:44, Kyotaro Horiguchi wrote: > > At Fri, 21 Oct 2022 17:12:45 +0900 (JST), Kyotaro Horiguchi > wrote in > > latest works. It dones't consider the case of explict target timlines > > so it's just a PoC. (So this doesn't work if recovery_target_timeline > > is set to 2 for the "standby" in the repro.) > > So, finally I noticed that the function XLogFileReadAnyTLI is not > needed at all if we are going this direction. > > Regardless of recvoery_target_timeline is latest or any explicit > imeline id or checkpoint timeline, what we can do to reach the target > timline is just to follow the history file's direction. > > If segments are partly gone while reading on a timeline, a segment on > the older timelines is just a crap since it should be incompatible. I came to the same conclusion. I adjusted XLogFileReadAnyTLI to not use any timeline that ends within the segment (attached patch). At this point the name of the function becomes really wrong, XLogFileReadCorrectTLI or something to that effect would be much more descriptive and the code could be simplified. However I'm not particularly happy with this approach as it will not use valid WAL if that is not available. Consider scenario of a cascading failure. Node A has a hard failure, then node B promotes, archives history file, but doesn't see enough traffic to archive a full segment before failing itself. While this is happening we restore node A from backup and start it up as a standby. If node b fails before node A has a chance to connect then either we are continuing recovery on the wrong timeline (current behavior) or we will not try to recover the first portion of the archived WAL file (with patch). So I think the correct approach would still be to have ReadRecord() or ApplyWalRecord() determine that switching timelines is needed. -- Ants Aasma www.cybertec-postgresql.com diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c index cb07694aea6..73bde98b920 100644 --- a/src/backend/access/transam/xlogrecovery.c +++ b/src/backend/access/transam/xlogrecovery.c @@ -4171,6 +4171,7 @@ XLogFileReadAnyTLI(XLogSegNo segno, int emode, XLogSource source) { TimeLineHistoryEntry *hent = (TimeLineHistoryEntry *) lfirst(cell); TimeLineID tli = hent->tli; + XLogSegNo beginseg = 0; if (tli < curFileTLI) break;/* don't bother looking at too-old TLIs */ @@ -4181,7 +4182,6 @@ XLogFileReadAnyTLI(XLogSegNo segno, int emode, XLogSource source) */ if (hent->begin != InvalidXLogRecPtr) { - XLogSegNo beginseg = 0; XLByteToSeg(hent->begin, beginseg, wal_segment_size); @@ -4223,6 +4223,14 @@ XLogFileReadAnyTLI(XLogSegNo segno, int emode, XLogSource source) return fd; } } + + /* + * For segments containing known timeline switches only consider the + * last timeline as redo otherwise doesn't know when to switch + * timelines. + */ + if (segno == beginseg && beginseg > 0) + break; } /* Couldn't find it. For simplicity, complain about front timeline */
Re: Standby recovers records from wrong timeline
On Thu, 20 Oct 2022 at 11:30, Kyotaro Horiguchi wrote: > > primary_restored did a time-travel to past a bit because of the > recovery_target=immediate. In other words, the primary_restored and > the replica diverge. I don't think it is legit to connect a diverged > standby to a primary. primary_restored did timetravel to the past, as we're doing PITR on the primary that's the expected behavior. However replica is not diverged, it's a copy of the exact same basebackup. The usecase is restoring a cluster from backup using PITR and using the same backup to create a standby. Currently this breaks when primary has not yet archived any segments. > So, about the behavior in doubt, it is the correct behavior to > seemingly ignore the history file in the archive. Recovery assumes > that the first half of the first segment of the new timeline is the > same with the same segment of the old timeline (.partial) so it is > legit to read the file til the end and that causes the > replica goes beyond the divergence point. What is happening is that primary_restored has a timeline switch at tli 2, lsn 0/2000100, and the next insert record starts in the same segment. Replica is starting on the same backup on timeline 1, tries to find tli 2 seg 2, which is not archived yet, so falls back to tli 1 seg 2 and replays tli 1 seg 2 continuing to tli seg 3, then connects to primary and starts applying wal starting from tli 2 seg 4. To me that seems completely broken. > As you know, when new primary starts a diverged history, the > recommended way is to blow (or stash) away the archive, then take a > new backup from the running primary. My understanding is that backup archives are supposed to remain valid even after PITR or equivalently a lagging standby promoting. -- Ants Aasma Senior Database Engineer www.cybertec-postgresql.com
Standby recovers records from wrong timeline
When standby is recovering to a timeline that doesn't have any segments archived yet it will just blindly blow past the timeline switch point and keeps on recovering on the old timeline. Typically that will eventually result in an error about incorrect prev-link, but under unhappy circumstances can result in standby silently having different contents. Attached is a shell script that reproduces the issue. Goes back to at least v12, probably longer. I think we should be keeping track of where the current replay timeline is going to end and not read any records past it on the old timeline. Maybe while at it, we should also track that the next record should be a checkpoint record for the timeline switch and error out if not. Thoughts? -- Ants Aasma Senior Database Engineerwww.cybertec-postgresql.com recoverytest.sh Description: application/shellscript
Re: storing an explicit nonce
On Wed, 13 Oct 2021 at 02:20, Bruce Momjian wrote: > On Wed, Oct 13, 2021 at 12:48:51AM +0300, Ants Aasma wrote: > > On Wed, 13 Oct 2021 at 00:25, Bruce Momjian wrote: > > > > On Tue, Oct 12, 2021 at 11:21:28PM +0300, Ants Aasma wrote: > > > Page encrypting to all zeros is for all practical purposes > impossible to > > hit. > > > Basically an attacker would have to be able to arbitrarily set the > whole > > > contents of the page and they would then achieve that this page > gets > > ignored. > > > > Uh, how do we know that valid data can't produce an encrypted > all-zero > > page? > > > > > > Because the chances of that happening by accident are equivalent to > making a > > series of commits to postgres and ending up with the same git commit > hash 400 > > times in a row. > > Yes, 256^8192 is 1e+19728, but why not just assume a page LSN=0 is an > empty page, and if not, an error? Seems easier than checking if each > page contains all zeros every time. > We already check it anyway, see PageIsVerifiedExtended(). -- Ants Aasma Senior Database Engineerwww.cybertec-postgresql.com
Re: storing an explicit nonce
On Wed, 13 Oct 2021 at 00:25, Bruce Momjian wrote: > On Tue, Oct 12, 2021 at 11:21:28PM +0300, Ants Aasma wrote: > > On Tue, 12 Oct 2021 at 16:14, Bruce Momjian wrote: > > > > Well, how do you detect an all-zero page vs a page that encrypted to > all > > zeros? > > > > Page encrypting to all zeros is for all practical purposes impossible to > hit. > > Basically an attacker would have to be able to arbitrarily set the whole > > contents of the page and they would then achieve that this page gets > ignored. > > Uh, how do we know that valid data can't produce an encrypted all-zero > page? > Because the chances of that happening by accident are equivalent to making a series of commits to postgres and ending up with the same git commit hash 400 times in a row. -- Ants Aasma Senior Database Engineerwww.cybertec-postgresql.com
Re: storing an explicit nonce
On Tue, 12 Oct 2021 at 16:14, Bruce Momjian wrote: > Well, how do you detect an all-zero page vs a page that encrypted to all > zeros? > Page encrypting to all zeros is for all practical purposes impossible to hit. Basically an attacker would have to be able to arbitrarily set the whole contents of the page and they would then achieve that this page gets ignored. -- Ants Aasma Senior Database Engineerwww.cybertec-postgresql.com
Re: storing an explicit nonce
On Mon, 11 Oct 2021 at 22:15, Bruce Momjian wrote: > > Yes, that's the direction that I was thinking also and specifically with > > XTS as the encryption algorithm to allow us to exclude the LSN but keep > > everything else, and to address the concern around the nonce/tweak/etc > > being the same sometimes across multiple writes. Another thing to > > consider is if we want to encrypt zero'd page. There was a point > > brought up that if we do then we are encrypting a fair bit of very > > predictable bytes and that's not great (though there's a fair bit about > > our pages that someone could quite possibly predict anyway based on > > table structures and such...). I would think that if it's easy enough > > to not encrypt zero'd pages that we should avoid doing so. Don't recall > > offhand which way zero'd pages were being handled already but thought it > > made sense to mention that as part of this discussion. > > Yeah, I wanted to mention that. I don't see any security difference > between fully-zero pages, pages with headers and no tuples, and pages > with headers and only a few tuples. If any of those are insecure, they > all are. Therefore, I don't see any reason to treat them differently. > We had to special case zero pages and not encrypt them because as far as I can tell, there is no atomic way to extend a file and initialize it to Enc(zero) in the same step. -- Ants Aasma Senior Database Engineerwww.cybertec-postgresql.com
Re: storing an explicit nonce
On Thu, 7 Oct 2021 at 21:52, Stephen Frost wrote: > With XTS this isn't actually the case though, is it..? Part of the > point of XTS is that the last block doesn't have to be a full 16 bytes. > What you're saying is true for XEX, but that's also why XEX isn't used > for FDE in a lot of cases, because disk sectors aren't typically > divisible by 16. > > https://en.wikipedia.org/wiki/Disk_encryption_theory > > Assuming that's correct, and I don't see any reason to doubt it, then > perhaps it would make sense to have the LSN be unencrypted and include > it in the tweak as that would limit the risk from re-use of the same > tweak over time. > Right, my thought was to leave the first 8 bytes of pages, the LSN, unencrypted and include the value in the tweak. Just tested that OpenSSL aes-256-xts handles non multiple-of-16 messages just fine. -- Ants Aasma Senior Database Engineerwww.cybertec-postgresql.com
Re: storing an explicit nonce
On Wed, 6 Oct 2021 at 23:08, Bruce Momjian wrote: > Yes, I would prefer we don't use the LSN. I only mentioned it since > Ants Aasma mentioned LSN use above. > Is there a particular reason why you would prefer not to use LSN? I suggested it because in my view having a variable tweak is still better than not having it even if we deem the risks of XTS tweak reuse not important for our use case. The comment was made under the assumption that requiring wal_log_hints for encryption is acceptable. -- Ants Aasma Senior Database Engineerwww.cybertec-postgresql.com
Re: storing an explicit nonce
On Mon, 27 Sept 2021 at 23:34, Bruce Momjian wrote: > On Sun, Sep 5, 2021 at 10:51:42PM +0800, Sasasu wrote: > > Hi, community, > > > > It looks like we are still considering AES-CBC, AES-XTS, and > AES-GCM(-SIV). > > I want to say something that we don't think about. > > > > For AES-CBC, the IV should be not predictable. I think LSN or HASH(LSN, > > block number or something) is predictable. There are many CVE related to > > AES-CBC with a predictable IV. > > The LSN would change every time the page is modified, so while the LSN > could be predicted, it would not be reused. However, there is currently > no work being done on page-level encryption of Postgres. > We are still working on our TDE patch. Right now the focus is on refactoring temporary file access to make the TDE patch itself smaller. Reconsidering encryption mode choices given concerns expressed is next. Currently a viable option seems to be AES-XTS with LSN added into the IV. XTS doesn't have an issue with predictable IV and isn't totally broken in case of IV reuse. -- Ants Aasma Senior Database Engineerwww.cybertec-postgresql.com
Re: track_planning causing performance regression
On Tue, 30 Jun 2020 at 08:43, Fujii Masao wrote: > > The problem looks to be that spinlocks are terrible with overloaded > CPU and a contended spinlock. A process holding the spinlock might easily > get scheduled out leading to excessive spinning by everybody. I think a > simple thing to try would be to replace the spinlock with LWLock. > > Yes. Attached is the POC patch that replaces per-counter spinlock with > LWLock. > Great. I think this is the one that should get considered for testing. > > I did a prototype patch that replaces spinlocks with futexes, but was > not able to find a workload where it mattered. > > I'm not familiar with futex, but could you tell me why you used futex > instead > of LWLock that we already have? Is futex portable? > Futex is a Linux kernel call that allows to build a lock that has uncontended cases work fully in user space almost exactly like a spinlock, while falling back to syscalls that wait for wakeup in case of contention. It's not portable, but probably something similar could be implemented for other operating systems. I did not pursue this further because it became apparent that every performance critical spinlock had already been removed. To be clear, I am not advocating for this patch to get included. I just had the patch immediately available and it could have confirmed that using a better lock fixes things. -- Ants Aasma Senior Database Engineerwww.cybertec-postgresql.com
Re: track_planning causing performance regression
On Mon, 29 Jun 2020 at 12:17, Julien Rouhaud wrote: > On Mon, Jun 29, 2020 at 10:55 AM Fujii Masao > wrote: > > > > On 2020/06/29 16:05, Julien Rouhaud wrote: > > > On Mon, Jun 29, 2020 at 7:49 AM Tharakan, Robins > wrote: > > >> > > >> During fully-cached SELECT-only test using pgbench, Postgres v13Beta1 > shows > > > > Thanks for the benchmark! > > > > > > >> ~45% performance drop [2] at high DB connection counts (when compared > with v12.3) > > > > That's bad :( > > > > > > >> > > >> Disabling pg_stat_statements.track_planning (which is 'On' by default) > > >> brings the TPS numbers up to v12.3 levels. > > >> > > >> The inflection point (in this test-case) is 128 Connections, beyond > which the > > >> TPS numbers are consistently low. Looking at the mailing list [1], > this issue > > >> didn't surface earlier possibly since the regression is trivial at > low connection counts. > > >> > > >> It would be great if this could be optimized further, or > track_planning > > >> disabled (by default) so as to not trip users upgrading from v12 with > pg_stat_statement > > >> enabled (but otherwise not particularly interested in track_planning). > > > > Your benchmark result seems to suggest that the cause of the problem is > > the contention of per-query spinlock in pgss_store(). Right? > > This lock contention is likely to happen when multiple sessions run > > the same queries. > > > > One idea to reduce that lock contention is to separate per-query spinlock > > into two; one is for planning, and the other is for execution. > pgss_store() > > determines which lock to use based on the given "kind" argument. > > To make this idea work, also every pgss counters like shared_blks_hit > > need to be separated into two, i.e., for planning and execution. > > This can probably remove some overhead, but won't it eventually hit > the same issue when multiple connections try to plan the same query, > given the number of different queries and very low execution runtime? > It'll also quite increase the shared memory consumption. > > I'm wondering if we could instead use atomics to store the counters. > The only downside is that we won't guarantee per-row consistency > anymore, which may be problematic. > The problem looks to be that spinlocks are terrible with overloaded CPU and a contended spinlock. A process holding the spinlock might easily get scheduled out leading to excessive spinning by everybody. I think a simple thing to try would be to replace the spinlock with LWLock. I did a prototype patch that replaces spinlocks with futexes, but was not able to find a workload where it mattered. We have done a great job at eliminating spinlocks from contended code paths. Robins, perhaps you could try it to see if it reduces the regression you are observing. The patch is against v13 stable branch. -- Ants Aasma Senior Database Engineerwww.cybertec-postgresql.com diff --git a/src/backend/storage/lmgr/s_lock.c b/src/backend/storage/lmgr/s_lock.c index 7fac0703419..56d45b7cfce 100644 --- a/src/backend/storage/lmgr/s_lock.c +++ b/src/backend/storage/lmgr/s_lock.c @@ -90,6 +90,7 @@ s_lock_stuck(const char *file, int line, const char *func) int s_lock(volatile slock_t *lock, const char *file, int line, const char *func) { +#ifndef HAS_FUTEX SpinDelayStatus delayStatus; init_spin_delay(, file, line, func); @@ -102,6 +103,8 @@ s_lock(volatile slock_t *lock, const char *file, int line, const char *func) finish_spin_delay(); return delayStatus.delays; +#endif + elog(FATAL, "Should not be called"); } #ifdef USE_DEFAULT_S_UNLOCK @@ -218,6 +221,71 @@ update_spins_per_delay(int shared_spins_per_delay) return (shared_spins_per_delay * 15 + spins_per_delay) / 16; } +#ifdef HAS_FUTEX +#include +#include +#include + +static int +futex(volatile uint32 *uaddr, int futex_op, int val, + const struct timespec *timeout, int *uaddr2, int val3) +{ + return syscall(SYS_futex, uaddr, futex_op, val, + timeout, uaddr, val3); +} + +int +futex_lock(volatile slock_t *lock, uint32 current, const char *file, int line, const char *func) +{ + int i, s; + /* + * First lets wait for a bit without involving the kernel, it is quite likely + * the lock holder is still running. + **/ + if (likely(current < 2)) + { + uint32 expected; + for (i = 0; i < DEFAULT_SPINS_PER_DELAY; i++) + { + SPIN_DELAY(); + expected = lock->value; + if (expected == 0 && pg_atomic_compare_exchange_u32(lock, , 1)) +return i; + } + + while (expected != 2 && !pg_atomic_compare_exchange_u32(lock, , 2)) { + if (
Re: what can go in root.crt ?
On Tue, 2 Jun 2020 at 20:14, Bruce Momjian wrote: > The server certificate should be issued by a certificate authority root > outside of your organization only if you want people outside of your > organization to trust your server certificate, but you are then asking > for the client to only trust an intermediate inside your organization. > The big question is why bother having the server certificate chain to a > root certificat you don't trust when you have no intention of having > clients outside of your organization trust the server certificate. > Postgres could be made to handle such cases, but is is really a valid > configuration we should support? > I think the "why" the org cert is not root was already made clear, that is the copmany policy. I don't think postgres should take a stance whether the certificate designated as the root of trust is self-signed or claims to get its power from somewhere else. It's pretty easy to conceive of certificate management procedures that make use of this chain to implement certificate replacement securely. For example one might trust the global issuer to verify that a CSR is coming from the O= value that it's claiming to come from to automate replacement of intermediate certificates, but not trust that every other sub-CA signed by root and their sub-sub-CA-s are completely honest and secure. Regards, Ants Aasma
Re: spin_delay() for ARM
On Thu, 16 Apr 2020 at 10:33, Pavel Stehule wrote: > what I know, pgbench cannot be used for testing spinlocks problems. > > Maybe you can see this issue when a) use higher number clients - hundreds, > thousands. Decrease share memory, so there will be press on related spin lock. There really aren't many spinlocks left that could be tickled by a normal workload. I looked for a way to trigger spinlock contention when I prototyped a patch to replace spinlocks with futexes. The only one that I could figure out a way to make contended was the lock protecting parallel btree scan. A highly parallel index only scan on a fully cached index should create at least some spinlock contention. Regards, Ants Aasma
Re: Parallel copy
On Mon, 13 Apr 2020 at 23:16, Andres Freund wrote: > > Still, if the reader does the splitting, then you don't need as much > > IPC, right? The shared memory data structure is just a ring of bytes, > > and whoever reads from it is responsible for the rest. > > I don't think so. If only one process does the splitting, the > exclusively locked section is just popping off a bunch of offsets of the > ring. And that could fairly easily be done with atomic ops (since what > we need is basically a single producer multiple consumer queue, which > can be done lock free fairly easily ). Whereas in the case of each > process doing the splitting, the exclusively locked part is splitting > along lines - which takes considerably longer than just popping off a > few offsets. I see the benefit of having one process responsible for splitting as being able to run ahead of the workers to queue up work when many of them need new data at the same time. I don't think the locking benefits of a ring are important in this case. At current rather conservative chunk sizes we are looking at ~100k chunks per second at best, normal locking should be perfectly adequate. And chunk size can easily be increased. I see the main value in it being simple. But there is a point that having a layer of indirection instead of a linear buffer allows for some workers to fall behind. Either because the kernel scheduled them out for a time slice, or they need to do I/O or because inserting some tuple hit an unique conflict and needs to wait for a tx to complete or abort to resolve. With a ring buffer reading has to wait on the slowest worker reading its chunk. Having workers copy the data to a local buffer as the first step would reduce the probability of hitting any issues. But still, at GB/s rates, hiding a 10ms timeslice of delay would need 10's of megabytes of buffer. FWIW. I think just increasing the buffer is good enough - the CPUs processing this workload are likely to have tens to hundreds of megabytes of cache on board.
Re: Parallel copy
On Tue, 14 Apr 2020 at 22:40, Kuntal Ghosh wrote: > 1. Each worker scans a distinct fixed sized chunk of the CSV file and > collects the following three stats from the chunk: > a) number of quotes > b) position of the first new line after even number of quotes > c) position of the first new line after odd number of quotes > 2. Once stats from all the chunks are collected, the leader identifies > the adjusted chunk boundaries by iterating over the stats linearly: > - For the k-th chunk, the leader adds the number of quotes in k-1 chunks. > - If the number is even, then the k-th chunk does not start in the > middle of a quoted field, and the first newline after an even number > of quotes (the second collected information) is the first record > delimiter in this chunk. > - Otherwise, if the number is odd, the first newline after an odd > number of quotes (the third collected information) is the first record > delimiter. > - The end position of the adjusted chunk is obtained based on the > starting position of the next adjusted chunk. The trouble is that, at least with current coding, the number of quotes in a chunk can depend on whether the chunk started in a quote or not. That's because escape characters only count inside quotes. See for example the following csv: foo,\"bar baz",\"xyz" This currently parses as one line and the number of parsed quotes doesn't change if you add a quote in front. But the general approach of doing the tokenization in parallel and then a serial pass over the tokenization would still work. The quote counting and new line finding just has to be done for both starting in quote and not starting in quote case. Using phases doesn't look like the correct approach - the tokenization can be prepared just in time for the serial pass and processing the chunk can proceed immediately after. This could all be done by having the data in a single ringbuffer with a processing pipeline where one process does the reading, then workers grab tokenization chunks as they become available, then one process handles determining the chunk boundaries, after which the chunks are processed. But I still don't think this is something to worry about for the first version. Just a better line splitting algorithm should go a looong way in feeding a large number of workers, even when inserting to an unindexed unlogged table. If we get the SIMD line splitting in, it will be enough to overwhelm most I/O subsystems available today. Regards, Ants Aasma
Re: Parallel copy
On Wed, 8 Apr 2020 at 22:30, Robert Haas wrote: > - If we're unable to supply data to the COPY process as fast as the > workers could load it, then speed will be limited at that point. We > know reading the file from disk is pretty fast compared to what a > single process can do. I'm not sure we've tested what happens with a > network socket. It will depend on the network speed some, but it might > be useful to know how many MB/s we can pump through over a UNIX > socket. This raises a good point. If at some point we want to minimize the amount of memory copies then we might want to allow for RDMA to directly write incoming network traffic into a distributing ring buffer, which would include the protocol level headers. But at this point we are so far off from network reception becoming a bottleneck I don't think it's worth holding anything up for not allowing for zero copy transfers. > - The portion of the time that is used to split the lines is not > easily parallelizable. That seems to be a fairly small percentage for > a reasonably wide table, but it looks significant (13-18%) for a > narrow table. Such cases will gain less performance and be limited to > a smaller number of workers. I think we also need to be careful about > files whose lines are longer than the size of the buffer. If we're not > careful, we could get a significant performance drop-off in such > cases. We should make sure to pick an algorithm that seems like it > will handle such cases without serious regressions and check that a > file composed entirely of such long lines is handled reasonably > efficiently. I don't have a proof, but my gut feel tells me that it's fundamentally impossible to ingest csv without a serial line-ending/comment tokenization pass. The current line splitting algorithm is terrible. I'm currently working with some scientific data where on ingestion CopyReadLineText() is about 25% on profiles. I prototyped a replacement that can do ~8GB/s on narrow rows, more on wider ones. For rows that are consistently wider than the input buffer I think parallelism will still give a win - the serial phase is just memcpy through a ringbuffer, after which a worker goes away to perform the actual insert, letting the next worker read the data. The memcpy is already happening today, CopyReadLineText() copies the input buffer into a StringInfo, so the only extra work is synchronization between leader and worker. > - There could be index contention. Let's suppose that we can read data > super fast and break it up into lines super fast. Maybe the file we're > reading is fully RAM-cached and the lines are long. Now all of the > backends are inserting into the indexes at the same time, and they > might be trying to insert into the same pages. If so, lock contention > could become a factor that hinders performance. Different data distribution strategies can have an effect on that. Dealing out input data in larger or smaller chunks will have a considerable effect on contention, btree page splits and all kinds of things. I think the common theme would be a push to increase chunk size to reduce contention.. > - There could also be similar contention on the heap. Say the tuples > are narrow, and many backends are trying to insert tuples into the > same heap page at the same time. This would lead to many lock/unlock > cycles. This could be avoided if the backends avoid targeting the same > heap pages, but I'm not sure there's any reason to expect that they > would do so unless we make some special provision for it. I thought there already was a provision for that. Am I mis-remembering? > - What else? I bet the above list is not comprehensive. I think parallel copy patch needs to concentrate on splitting input data to workers. After that any performance issues would be basically the same as a normal parallel insert workload. There may well be bottlenecks there, but those could be tackled independently. Regards, Ants Aasma Cybertec
Re: Parallel copy
On Tue, 7 Apr 2020 at 08:24, vignesh C wrote: > Leader will create a circular queue > and share it across the workers. The circular queue will be present in > DSM. Leader will be using a fixed size queue to share the contents > between the leader and the workers. Currently we will have 100 > elements present in the queue. This will be created before the workers > are started and shared with the workers. The data structures that are > required by the parallel workers will be initialized by the leader, > the size required in dsm will be calculated and the necessary keys > will be loaded in the DSM. The specified number of workers will then > be launched. Leader will read the table data from the file and copy > the contents to the queue element by element. Each element in the > queue will have 64K size DSA. This DSA will be used to store tuple > contents from the file. The leader will try to copy as much content as > possible within one 64K DSA queue element. We intend to store at least > one tuple in each queue element. There are some cases where the 64K > space may not be enough to store a single tuple. Mostly in cases where > the table has toast data present and the single tuple can be more than > 64K size. In these scenarios we will extend the DSA space accordingly. > We cannot change the size of the dsm once the workers are launched. > Whereas in case of DSA we can free the dsa pointer and reallocate the > dsa pointer based on the memory size required. This is the very reason > for choosing DSA over DSM for storing the data that must be inserted > into the relation. I think the element based approach and requirement that all tuples fit into the queue makes things unnecessarily complex. The approach I detailed earlier allows for tuples to be bigger than the buffer. In that case a worker will claim the long tuple from the ring queue of tuple start positions, and starts copying it into its local line_buf. This can wrap around the buffer multiple times until the next start position shows up. At that point this worker can proceed with inserting the tuple and the next worker will claim the next tuple. This way nothing needs to be resized, there is no risk of a file with huge tuples running the system out of memory because each element will be reallocated to be huge and the number of elements is not something that has to be tuned. > We had a couple of options for the way in which queue elements can be stored. > Option 1: Each element (DSA chunk) will contain tuples such that each > tuple will be preceded by the length of the tuple. So the tuples will > be arranged like (Length of tuple-1, tuple-1), (Length of tuple-2, > tuple-2), Or Option 2: Each element (DSA chunk) will contain only > tuples (tuple-1), (tuple-2), . And we will have a second > ring-buffer which contains a start-offset or length of each tuple. The > old design used to generate one tuple of data and process tuple by > tuple. In the new design, the server will generate multiple tuples of > data per queue element. The worker will then process data tuple by > tuple. As we are processing the data tuple by tuple, I felt both of > the options are almost the same. However Design1 was chosen over > Design 2 as we can save up on some space that was required by another > variable in each element of the queue. With option 1 it's not possible to read input data into shared memory and there needs to be an extra memcpy in the time critical sequential flow of the leader. With option 2 data could be read directly into the shared memory buffer. With future async io support, reading and looking for tuple boundaries could be performed concurrently. Regards, Ants Aasma Cybertec
Re: Parallel copy
On Tue, 25 Feb 2020 at 18:00, Tomas Vondra wrote: > Perhaps. I guess it'll depend on the CSV file (number of fields, ...), > so I still think we need to do some measurements first. I'm willing to > do that, but (a) I doubt I'll have time for that until after 2020-03, > and (b) it'd be good to agree on some set of typical CSV files. I agree that getting a nice varied dataset would be nice. Including things like narrow integer only tables, strings with newlines and escapes in them, extremely wide rows. I tried to capture a quick profile just to see what it looks like. Grabbed a random open data set from the web, about 800MB of narrow rows CSV [1]. Script: CREATE TABLE census (year int,age int,ethnic int,sex int,area text,count text); COPY census FROM '.../Data8277.csv' WITH (FORMAT 'csv', HEADER true); Profile: # Samples: 59K of event 'cycles:u' # Event count (approx.): 57644269486 # # Overhead Command Shared Object Symbol # .. ... # 18.24% postgres postgres[.] CopyReadLine 9.23% postgres postgres[.] NextCopyFrom 8.87% postgres postgres[.] NextCopyFromRawFields 5.82% postgres postgres[.] pg_verify_mbstr_len 5.45% postgres postgres[.] pg_strtoint32 4.16% postgres postgres[.] heap_fill_tuple 4.03% postgres postgres[.] heap_compute_data_size 3.83% postgres postgres[.] CopyFrom 3.78% postgres postgres[.] AllocSetAlloc 3.53% postgres postgres[.] heap_form_tuple 2.96% postgres postgres[.] InputFunctionCall 2.89% postgres libc-2.30.so[.] __memmove_avx_unaligned_erms 1.82% postgres libc-2.30.so[.] __strlen_avx2 1.72% postgres postgres[.] AllocSetReset 1.72% postgres postgres[.] RelationPutHeapTuple 1.47% postgres postgres[.] heap_prepare_insert 1.31% postgres postgres[.] heap_multi_insert 1.25% postgres postgres[.] textin 1.24% postgres postgres[.] int4in 1.05% postgres postgres[.] tts_buffer_heap_clear 0.85% postgres postgres[.] pg_any_to_server 0.80% postgres postgres[.] pg_comp_crc32c_sse42 0.77% postgres postgres[.] cstring_to_text_with_len 0.69% postgres postgres[.] AllocSetFree 0.60% postgres postgres[.] appendBinaryStringInfo 0.55% postgres postgres[.] tts_buffer_heap_materialize.part.0 0.54% postgres postgres[.] palloc 0.54% postgres libc-2.30.so[.] __memmove_avx_unaligned 0.51% postgres postgres[.] palloc0 0.51% postgres postgres[.] pg_encoding_max_length 0.48% postgres postgres[.] enlargeStringInfo 0.47% postgres postgres[.] ExecStoreVirtualTuple 0.45% postgres postgres[.] PageAddItemExtended So that confirms that the parsing is a huge chunk of overhead with current splitting into lines being the largest portion. Amdahl's law says that splitting into tuples needs to be made fast before parallelizing makes any sense. Regards, Ants Aasma [1] https://www3.stats.govt.nz/2018census/Age-sex-by-ethnic-group-grouped-total-responses-census-usually-resident-population-counts-2006-2013-2018-Censuses-RC-TA-SA2-DHB.zip
Re: Parallel copy
On Thu, 20 Feb 2020 at 18:43, David Fetter wrote:> > On Thu, Feb 20, 2020 at 02:36:02PM +0100, Tomas Vondra wrote: > > I think the wc2 is showing that maybe instead of parallelizing the > > parsing, we might instead try using a different tokenizer/parser and > > make the implementation more efficient instead of just throwing more > > CPUs on it. > > That was what I had in mind. > > > I don't know if our code is similar to what wc does, maytbe parsing > > csv is more complicated than what wc does. > > CSV parsing differs from wc in that there are more states in the state > machine, but I don't see anything fundamentally different. The trouble with a state machine based approach is that the state transitions form a dependency chain, which means that at best the processing rate will be 4-5 cycles per byte (L1 latency to fetch the next state). I whipped together a quick prototype that uses SIMD and bitmap manipulations to do the equivalent of CopyReadLineText() in csv mode including quotes and escape handling, this runs at 0.25-0.5 cycles per byte. Regards, Ants Aasma #include #include #include #include #include #include #include #define likely(x) __builtin_expect((x),1) #define unlikely(x) __builtin_expect((x),0) /* * Create a bitmap of matching characters in the next 64 bytes **/ static inline uint64_t find_chars(__m256i *data, char c) { const __m256i mask = _mm256_set1_epi8(c); uint64_t result = (uint32_t) _mm256_movemask_epi8(_mm256_cmpeq_epi8(data[0], mask)); result |= ((uint64_t) _mm256_movemask_epi8(_mm256_cmpeq_epi8(data[1], mask))) << 32; return result; } /* * Creates a bitmap of unpaired escape characters **/ static inline uint64_t find_unpaired_escapes(uint64_t escapes) { // TODO: handle unpaired escape from end of last iteration uint64_t p, e, r; p = escapes; e = escapes; r = escapes; while (e) { p = e; e = (e << 1) & escapes; r ^= e; } return r & p; } /* * Creates a bitmap mask of quoted sections given locations of * quote chatacters. **/ static inline uint64_t find_quote_mask(uint64_t quote_bits, uint64_t *prev_inside_quote) { uint64_t mask = _mm_cvtsi128_si64(_mm_clmulepi64_si128( _mm_set_epi64x(0ULL, quote_bits), _mm_set1_epi8(0xFF), 0)); mask ^= *prev_inside_quote; *prev_inside_quote = ((int64_t) mask) >> 63; return mask; } /* * Parses len bytes from buf according to csv rules and writes start positions of * records to output. Returns number of rows found. **/ int64_t parseIntoLines(char *buf, size_t len, size_t *output) { __m256i* input = (__m256i*) buf; uint64_t prev_inside_quote = 0; size_t pos = 0; uint64_t numfound = 0; *output++ = 0; numfound++; while (pos < len - 64) { uint64_t quotes = find_chars(input, '"'); uint64_t escapes = find_chars(input, '\\'); uint64_t unpaired_escapes = find_unpaired_escapes(escapes); uint64_t unescaped_quotes = quotes & ~(unpaired_escapes << 1); uint64_t newlines = find_chars(input, '\n'); uint64_t quote_mask = find_quote_mask(unescaped_quotes, _inside_quote); uint64_t tokenpositions = newlines & ~quote_mask; uint64_t carriages = find_chars(input, '\r') & ~quote_mask; if (unlikely(carriages != 0)) exit(1); uint64_t offset = 0; while (tokenpositions > 0) { int numchars = __builtin_ctzll(tokenpositions); tokenpositions >>= numchars; tokenpositions >>= 1; offset += numchars + 1; *output++ = pos + offset; numfound++; } pos += 64; input += 2; } // TODO: handle tail return numfound; } int main(int argc, char *argv[]) { char *buf; uint64_t *lines; uint64_t iters = 1; if (argc < 2) { printf("Usage: simdcopy csvfile [iterations]\n"); return 1; } if (argc > 2) { iters = atol(argv[2]); } buf = aligned_alloc(64, 1024*1024*1024); lines = aligned_alloc(8, 128*1024*1024*sizeof(uint64_t)); if (!buf || !lines) return 1; FILE *f = fopen(argv[1], "r"); if (!f) return 1; #define READBLOCK (1024*1024) size_t len = 0; while (len < sizeof(buf) - READBLOCK) { size_t result = fread(buf + len, 1, READBLOCK, f); if (!result) break; len += result; } fclose(f); struct timespec start; struct timespec end; printf("Parsing %lu bytes, %lu times\n", len, iters); uint64_t numfound; clock_gettime(CLOCK_MONOTONIC, ); for (uint64_t i = 0; i < iters; i++) { numfound = parseIntoLines(buf, len, lines); } clock_gettime(CLOCK_MONOTONIC, ); double delta = (end.tv_sec - start.tv_sec) + (1.e-9)*(end.tv_nsec - start.tv_nsec); printf("Found %lu rows in %lu bytes in %f milliseconds\n", numfound, len*iters, delta*1000); printf(" Speed: %0.3f GB/s\n", len/delta/1e9*iters); return 0; }
Re: Parallel copy
On Wed, 19 Feb 2020 at 06:22, Amit Kapila wrote: > > On Tue, Feb 18, 2020 at 8:08 PM Ants Aasma wrote: > > > > On Tue, 18 Feb 2020 at 15:21, Amit Kapila wrote: > > > > > > On Tue, Feb 18, 2020 at 5:59 PM Ants Aasma wrote: > > > > > > > > On Tue, 18 Feb 2020 at 12:20, Amit Kapila > > > > wrote: > > > > > This is something similar to what I had also in mind for this idea. I > > > > > had thought of handing over complete chunk (64K or whatever we > > > > > decide). The one thing that slightly bothers me is that we will add > > > > > some additional overhead of copying to and from shared memory which > > > > > was earlier from local process memory. And, the tokenization (finding > > > > > line boundaries) would be serial. I think that tokenization should be > > > > > a small part of the overall work we do during the copy operation, but > > > > > will do some measurements to ascertain the same. > > > > > > > > I don't think any extra copying is needed. > > > > > > > > > > I am talking about access to shared memory instead of the process > > > local memory. I understand that an extra copy won't be required. > > > > > > > The reader can directly > > > > fread()/pq_copymsgbytes() into shared memory, and the workers can run > > > > CopyReadLineText() inner loop directly off of the buffer in shared > > > > memory. > > > > > > > > > > I am slightly confused here. AFAIU, the for(;;) loop in > > > CopyReadLineText is about finding the line endings which we thought > > > that the reader process will do. > > > > Indeed, I somehow misread the code while scanning over it. So > > CopyReadLineText > > currently copies data from cstate->raw_buf to the StringInfo in > > cstate->line_buf. In parallel mode it would copy it from the shared data > > buffer > > to local line_buf until it hits the line end found by the data reader. The > > amount of copying done is still exactly the same as it is now. > > > > Yeah, on a broader level it will be something like that, but actual > details might vary during implementation. BTW, have you given any > thoughts on one other approach I have shared above [1]? We might not > go with that idea, but it is better to discuss different ideas and > evaluate their pros and cons. > > [1] - > https://www.postgresql.org/message-id/CAA4eK1LyAyPCtBk4rkwomeT6%3DyTse5qWws-7i9EFwnUFZhvu5w%40mail.gmail.com It seems to be that at least for the general CSV case the tokenization to tuples is an inherently serial task. Adding thread synchronization to that path for coordinating between multiple workers is only going to make it slower. It may be possible to enforce limitations on the input (e.g. no quotes allowed) or do some speculative tokenization (e.g. if we encounter quote before newline assume the chunk started in a quoted section) to make it possible to do the tokenization in parallel. But given that the simpler and more featured approach of handling it in a single reader process looks to be fast enough, I don't see the point. I rather think that the next big step would be to overlap reading input and tokenization, hopefully by utilizing Andres's work on asyncio. Regards, Ants Aasma
Re: Parallel copy
On Tue, 18 Feb 2020 at 15:21, Amit Kapila wrote: > > On Tue, Feb 18, 2020 at 5:59 PM Ants Aasma wrote: > > > > On Tue, 18 Feb 2020 at 12:20, Amit Kapila wrote: > > > This is something similar to what I had also in mind for this idea. I > > > had thought of handing over complete chunk (64K or whatever we > > > decide). The one thing that slightly bothers me is that we will add > > > some additional overhead of copying to and from shared memory which > > > was earlier from local process memory. And, the tokenization (finding > > > line boundaries) would be serial. I think that tokenization should be > > > a small part of the overall work we do during the copy operation, but > > > will do some measurements to ascertain the same. > > > > I don't think any extra copying is needed. > > > > I am talking about access to shared memory instead of the process > local memory. I understand that an extra copy won't be required. > > > The reader can directly > > fread()/pq_copymsgbytes() into shared memory, and the workers can run > > CopyReadLineText() inner loop directly off of the buffer in shared memory. > > > > I am slightly confused here. AFAIU, the for(;;) loop in > CopyReadLineText is about finding the line endings which we thought > that the reader process will do. Indeed, I somehow misread the code while scanning over it. So CopyReadLineText currently copies data from cstate->raw_buf to the StringInfo in cstate->line_buf. In parallel mode it would copy it from the shared data buffer to local line_buf until it hits the line end found by the data reader. The amount of copying done is still exactly the same as it is now. Regards, Ants Aasma
Re: Parallel copy
On Tue, 18 Feb 2020 at 12:20, Amit Kapila wrote: > This is something similar to what I had also in mind for this idea. I > had thought of handing over complete chunk (64K or whatever we > decide). The one thing that slightly bothers me is that we will add > some additional overhead of copying to and from shared memory which > was earlier from local process memory. And, the tokenization (finding > line boundaries) would be serial. I think that tokenization should be > a small part of the overall work we do during the copy operation, but > will do some measurements to ascertain the same. I don't think any extra copying is needed. The reader can directly fread()/pq_copymsgbytes() into shared memory, and the workers can run CopyReadLineText() inner loop directly off of the buffer in shared memory. For serial performance of tokenization into lines, I really think a SIMD based approach will be fast enough for quite some time. I hacked up the code in the simdcsv project to only tokenize on line endings and it was able to tokenize a CSV file with short lines at 8+ GB/s. There are going to be many other bottlenecks before this one starts limiting. Patch attached if you'd like to try that out. Regards, Ants Aasma diff --git a/src/main.cpp b/src/main.cpp index 9d33a85..2cf775c 100644 --- a/src/main.cpp +++ b/src/main.cpp @@ -185,7 +185,6 @@ bool find_indexes(const uint8_t * buf, size_t len, ParsedCSV & pcsv) { #endif simd_input in = fill_input(buf+internal_idx); uint64_t quote_mask = find_quote_mask(in, prev_iter_inside_quote); -uint64_t sep = cmp_mask_against_input(in, ','); #ifdef CRLF uint64_t cr = cmp_mask_against_input(in, 0x0d); uint64_t cr_adjusted = (cr << 1) | prev_iter_cr_end; @@ -195,7 +194,7 @@ bool find_indexes(const uint8_t * buf, size_t len, ParsedCSV & pcsv) { #else uint64_t end = cmp_mask_against_input(in, 0x0a); #endif -fields[b] = (end | sep) & ~quote_mask; +fields[b] = (end) & ~quote_mask; } for(size_t b = 0; b < SIMDCSV_BUFFERSIZE; b++){ size_t internal_idx = 64 * b + idx; @@ -211,7 +210,6 @@ bool find_indexes(const uint8_t * buf, size_t len, ParsedCSV & pcsv) { #endif simd_input in = fill_input(buf+idx); uint64_t quote_mask = find_quote_mask(in, prev_iter_inside_quote); - uint64_t sep = cmp_mask_against_input(in, ','); #ifdef CRLF uint64_t cr = cmp_mask_against_input(in, 0x0d); uint64_t cr_adjusted = (cr << 1) | prev_iter_cr_end; @@ -226,7 +224,7 @@ bool find_indexes(const uint8_t * buf, size_t len, ParsedCSV & pcsv) { // then outside the quotes with LF so it's OK to "and off" // the quoted bits here. Some other quote convention would // need to be thought about carefully - uint64_t field_sep = (end | sep) & ~quote_mask; + uint64_t field_sep = (end) & ~quote_mask; flatten_bits(base_ptr, base, idx, field_sep); } #undef SIMDCSV_BUFFERSIZE
Re: Parallel copy
On Tue, 18 Feb 2020 at 04:40, Thomas Munro wrote: > +1. That sort of two-queue scheme is exactly how I sketched out a > multi-consumer queue for a hypothetical Parallel Scatter node. It > probably gets a bit trickier when the payload has to be broken up into > fragments to wrap around the "data" buffer N times. At least for copy it should be easy enough - it already has to handle reading data block by block. If worker updates its position while doing so the reader can wrap around the data buffer. There will be no parallelism while one worker is buffering up a line larger than the data buffer, but that doesn't seem like a major issue. Once the line is buffered and begins inserting next worker can start buffering the next tuple. Regards, Ants Aasma
Re: Parallel copy
On Sat, 15 Feb 2020 at 14:32, Amit Kapila wrote: > Good point and I agree with you that having a single process would > avoid any such stuff. However, I will think some more on it and if > you/anyone else gets some idea on how to deal with this in a > multi-worker system (where we can allow each worker to read and > process the chunk) then feel free to share your thoughts. I think having a single process handle splitting the input into tuples makes most sense. It's possible to parse csv at multiple GB/s rates [1], finding tuple boundaries is a subset of that task. My first thought for a design would be to have two shared memory ring buffers, one for data and one for tuple start positions. Reader process reads the CSV data into the main buffer, finds tuple start locations in there and writes those to the secondary buffer. Worker processes claim a chunk of tuple positions from the secondary buffer and update their "keep this data around" position with the first position. Then proceed to parse and insert the tuples, updating their position until they find the end of the last tuple in the chunk. Buffer size, maximum and minimum chunk size could be tunable. Ideally the buffers would be at least big enough to absorb one of the workers getting scheduled out for a timeslice, which could be up to tens of megabytes. Regards, Ants Aasma [1] https://github.com/geofflangdale/simdcsv/
Re: Do we need to handle orphaned prepared transactions in the server?
On Wed, 22 Jan 2020 at 09:02, Hamid Akhtar wrote: > > At this stage, I'm not sure of the scale of changes this will require, > however, I wanted to get an understanding and consensus on whether (a) this > is something we should work on, and (b) whether an approach to implementing a > timeout makes sense. > > Please feel free to share your thoughts here. The intended use case of two phase transactions is ensuring atomic durability of transactions across multiple database systems. This necessarily means that there needs to be a failure tolerant agent that ensures there is consensus about the status of the transaction and then executes that consensus across all systems. In other words, there needs to be a transaction manager for prepared statements to actually fulfil their purpose. Therefore I think that unilaterally timing out prepared statements is just shifting the consequences of a broken client from availability to durability. But if durability was never a concern, why is the client even using prepared statements? Citing the documentation: > PREPARE TRANSACTION is not intended for use in applications or interactive > sessions. Its purpose is to allow an external transaction manager to perform > atomic global transactions across multiple databases or other transactional > resources. Unless you're writing a transaction manager, you probably > shouldn't be using PREPARE TRANSACTION. Regards, Ants Aasma
Re: Remove size limitations of vacuums dead_tuples array
On Thu, 10 Oct 2019 at 17:05, Tomas Vondra wrote: > There already was a attempt to make this improvement, see [1]. There was > a fairly long discussion about how to best do that (using other data > structure, not just a simple array). It kinda died about a year ago, but > I suppose there's a lot of relevant info in that thread. > > [1] > https://www.postgresql.org/message-id/CAGTBQpbDCaR6vv9%3DscXzuT8fSbckf%3Da3NgZdWFWZbdVugVht6Q%40mail.gmail.com Thanks for the pointer, wow that's a long thread. For some reason it did not consider lifting the INT_MAX tuples/12GB limitation. I'll see if I can pick up where that thread left off and push it along. Regards, Ants Aasma Web: https://www.cybertec-postgresql.com
Remove size limitations of vacuums dead_tuples array
When dealing with a case where a 2TB table had 3 billion dead tuples I discovered that vacuum currently can't make use of more than 1GB of maintenance_work_mem - 179M tuples. This caused excessive amounts of index scanning even though there was plenty of memory available. I didn't see any good reason for having this limit, so here is a patch that makes use of MemoryContextAllocHuge, and converts the array indexing to use size_t to lift a second limit at 12GB. One potential problem with allowing larger arrays is that bsearch might no longer be the best way of determining if a ctid was marked dead. It might pay off to convert the dead tuples array to a hash table to avoid O(n log n) runtime when scanning indexes. I haven't done any profiling yet to see how big of a problem this is. Second issue I noticed is that the dead_tuples array is always allocated max allowed size, unless the table can't possibly have that many tuples. It may make sense to allocate it based on estimated number of dead tuples and resize if needed. Regards, Ants Aasma Web: https://www.cybertec-postgresql.com From 6101b360ea85a66aba093f98a83ae335983aa4a5 Mon Sep 17 00:00:00 2001 From: Ants Aasma Date: Wed, 2 Oct 2019 20:11:20 +0300 Subject: [PATCH] Allow vacuum to use more than 1GB of memory Use huge allocation for vacuum dead tuples list and lift the 1GB limitation that caps maximum number of dead tuples to approximately 179M rows. Now that huge allocations are supported INT_MAX limitation of array indexing can plausibly be hit (at maintenance_work_mem 12GB). Use size_t to index the dead tuples array. --- src/backend/access/heap/vacuumlazy.c | 34 +--- 1 file changed, 16 insertions(+), 18 deletions(-) diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c index a3c4a1df3b4..612b2f51cd7 100644 --- a/src/backend/access/heap/vacuumlazy.c +++ b/src/backend/access/heap/vacuumlazy.c @@ -130,8 +130,8 @@ typedef struct LVRelStats BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */ /* List of TIDs of tuples we intend to delete */ /* NB: this list is ordered by TID address */ - int num_dead_tuples; /* current # of entries */ - int max_dead_tuples; /* # slots allocated in array */ + size_t num_dead_tuples; /* current # of entries */ + size_t max_dead_tuples; /* # slots allocated in array */ ItemPointer dead_tuples; /* array of ItemPointerData */ int num_index_scans; TransactionId latestRemovedXid; @@ -161,8 +161,8 @@ static void lazy_vacuum_index(Relation indrel, static void lazy_cleanup_index(Relation indrel, IndexBulkDeleteResult *stats, LVRelStats *vacrelstats); -static int lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer, - int tupindex, LVRelStats *vacrelstats, Buffer *vmbuffer); +static size_t lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer, + size_t tupindex, LVRelStats *vacrelstats, Buffer *vmbuffer); static bool should_attempt_truncation(VacuumParams *params, LVRelStats *vacrelstats); static void lazy_truncate_heap(Relation onerel, LVRelStats *vacrelstats); @@ -1525,7 +1525,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats, static void lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats) { - int tupindex; + size_t tupindex; int npages; PGRUsage ru0; Buffer vmbuffer = InvalidBuffer; @@ -1571,7 +1571,7 @@ lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats) } ereport(elevel, - (errmsg("\"%s\": removed %d row versions in %d pages", + (errmsg("\"%s\": removed %zu row versions in %d pages", RelationGetRelationName(onerel), tupindex, npages), errdetail_internal("%s", pg_rusage_show(; @@ -1587,9 +1587,9 @@ lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats) * tuple for this page. We assume the rest follow sequentially. * The return value is the first tupindex after the tuples of this page. */ -static int +static size_t lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer, - int tupindex, LVRelStats *vacrelstats, Buffer *vmbuffer) + size_t tupindex, LVRelStats *vacrelstats, Buffer *vmbuffer) { Page page = BufferGetPage(buffer); OffsetNumber unused[MaxOffsetNumber]; @@ -1762,7 +1762,7 @@ lazy_vacuum_index(Relation indrel, lazy_tid_reaped, (void *) vacrelstats); ereport(elevel, - (errmsg("scanned index \"%s\" to remove %d row versions", + (errmsg("scanned index \"%s\" to remove %zu row versions", RelationGetRelationName(indrel), vacrelstats->num_dead_tuples), errdetail_internal("%s", pg_rusage_show(; @@ -2141,7 +2141,7 @@ count_nondeletable_pages(Relation onerel, LVRelStats *vacrelstats) static void lazy_space_alloc(LVRelStats *vacrelstats, BlockNumber relblocks) { - long maxtuples
Re: Transparent Data Encryption (TDE) and encrypted files
On Mon, 7 Oct 2019 at 18:02, Bruce Momjian wrote: > Well, do to encryption properly, there is the requirement of the nonce. > If you ever rewrite a bit, you technically have to have a new nonce. > For WAL, since it is append-only, you can use the WAL file name. For > heap/index files, we change the LSN on every rewrite (with > wal_log_hints=on), and we never use the same LSN for writing multiple > relations, so LSN+page-offset is a sufficient nonce. > > For clog, it is not append-only, and bytes are rewritten (from zero to > non-zero), so there would have to be a new nonce for every clog file > write to the file system. We can store the nonce in a separate file, > but the clog contents and nonce would have to be always synchronized or > the file could not be properly read. Basically every file we want to > encrypt, needs this kind of study. > Yes. That is the reason why our current version doesn't encrypt SLRU's. There is some security in encrypting without a nonce when considering an attack vector that only sees one version of the encrypted page. But I think to make headway on this we need to figure out if TDE feature is useful withour SLRU encryption (I think yes), and how hard would it be to properly encrypt SLRU's? Would the solution be acceptable for inclusion? I can think of 3 options: a) A separate nonce storage. Seems pretty bad complexity wise. New data-structures would need to be created. SLRU writes would need to be WAL logged with a full page image. b) Inline nonces, number of items per SLRU page is variable depending on if encryption is enabled or not. c) Inline nonces we reserve a header structure on all SLRU pages. pg_upgrade needs to rewrite persistent SLRUs. None of the options seem great, but c) has the benefit of also carving out the space for SLRU checksums. > As I also said to Stephen, the people who are discussing this here > > should *really really really* be looking at the Cybertec patch instead > > of trying to invent everything from scratch - unless that patch has, > > Someone from Cybertec is on the voice calls we have, and is actively > involved. > As far as I can tell no-one from us is on the call. I personally missed the invitation when it was sent out. I would gladly share our learnings, a lot of what I see here is retreading what we already went through with our patch. However, I think that at the very least the conclusions, problems to work on and WIP patch should be shared on list. It's hard for anybody outside to have any input if there are no concrete design proposals or code to review. Moreover, I think e-mail is a much better media for having a reasoned discussion about technical design decisions. > > In other words: maybe I'm wrong here, but it looks to me like we're > > laboriously reinventing the wheel when we could be working on > > improving the working prototype. > > The work being done is building on that prototype. > We would like to help on that front. Regards, Ants Aasma Web: https://www.cybertec-postgresql.com
Re: Enable data checksums by default
On Thu, Mar 28, 2019 at 10:38 AM Christoph Berg wrote: > Re: Ants Aasma 2019-03-27 < > ca+csw_twxdrzdn2xsszbxej63dez+f6_hs3qf7hmxfenxsq...@mail.gmail.com> > > Can you try with postgres compiled with CFLAGS="-O2 -march=native"? > There's > > a bit of low hanging fruit there to use a runtime CPU check to pick a > > better optimized checksum function. > > Frankly, no. This is with the apt.pg.o packages which are supposed to > be usable by everyone. If there is a better per-CPU checksum function, > PG should pick it at runtime. Special compiler flags are a no-go here. > I went ahead and tested it on the count(*) test, same settings as upthread. Median of 5 runs of 20txs on Intel i5-2500k @ 4GHz. No checksum: 344ms Checksums: 384ms (+12%) No checksum march=native: 344ms Checksums march=native: 369ms (+7%) The checksum code was written to be easily auto-vectorized by the compiler. So if we just compile the same function with different compiler flags and pick between them at runtime the overhead can be approximately halved. Not saying that this needs to be done before enabling checksums by default, just that when considering overhead, we can foresee it being much lower in future versions. Regards, Ants Aasma
Re: Enable data checksums by default
On Wed, Mar 27, 2019, 15:57 Christoph Berg wrote: > Re: To Tom Lane 2019-03-26 <20190326151446.gg3...@msg.df7cb.de> > > I run a benchmark with checksums disabled/enabled. shared_buffers is > > 512kB to make sure almost any read will fetch the page from the OS > > cache; scale factor is 50 (~750MB) to make sure the whole cluster fits > > into RAM. > [...] > > So the cost is 5% in this very contrived case. In almost any other > > setting, the cost would be lower, I'd think. > > (That was on 12devel, btw.) > > That was about the most extreme OLTP read-only workload. After > thinking about it some more, I realized that exercising large seqscans > might be an even better way to test it because of less per-query > overhead. > > Same setup again, shared_buffers = 16 (128kB), jit = off, > max_parallel_workers_per_gather = 0: > > select count(bid) from pgbench_accounts; > > no checksums: ~456ms > with checksums: ~489ms > > 456.0/489 = 0.9325 > > The cost of checksums is about 6.75% here. > Can you try with postgres compiled with CFLAGS="-O2 -march=native"? There's a bit of low hanging fruit there to use a runtime CPU check to pick a better optimized checksum function. Regards, Ants Aasma >
Re: CPU costs of random_zipfian in pgbench
On Sun, Feb 17, 2019 at 10:52 AM Fabien COELHO wrote: > > I'm trying to use random_zipfian() for benchmarking of skewed data sets, > > and I ran head-first into an issue with rather excessive CPU costs. > > [...] This happens because generalizedHarmonicNumber() does this: > > > > for (i = n; i > 1; i--) > > ans += pow(i, -s); > > > > where n happens to be 10 (range passed to random_zipfian), so > > the loop takes quite a bit of time. > > If you find a better formula for the harmonic number, you are welcome > and probably get your name on it:-) > There are pretty good approximations for s > 1.0 using Riemann zeta function and Euler derived a formula for the s = 1 case. I also noticed that i is int in this function, but n is int64. That seems like an oversight. Regards, Ants Aasma
Re: WAL insert delay settings
On Thu, Feb 21, 2019 at 12:50 PM Stephen Frost wrote: > > Rate limit in front of WAL insertion would allow for allocating the > > throughput between foreground and background tasks, and even allow for > > priority inheritance to alleviate priority inversion due to locks. > > I'm not sure how much we have to worry about priority inversion here as > you need to have conflicts for that and if there's actually a conflict, > then it seems like we should just press on. > > That is, a non-concurrent REINDEX is going to prevent an UPDATE from > modifying anything in the table, which if the UPDATE is a higher > priority than the REINDEX would be priority inversion, but that doesn't > mean we should slow down the REINDEX to allow the UPDATE to happen > because the UPDATE simply can't happen until the REINDEX is complete. > Now, we might slow down the REINDEX because there's UPDATEs against > *other* tables that aren't conflicting and we want those UPDATEs to be > prioritized over the REINDEX but then that isn't priority inversion. > I was thinking along the lines that each backend gets a budget of WAL insertion credits per time interval, and when the credits run out the process sleeps. With this type of scheme it would be reasonably straightforward to let UPDATEs being blocked by REINDEX to transfer their WAL insertion budgets to the REINDEX, making it get a larger piece of the total throughput pie. Regards, Ants Aasma
Re: WAL insert delay settings
On Thu, Feb 21, 2019 at 2:20 AM Stephen Frost wrote: > * Andres Freund (and...@anarazel.de) wrote: > > On 2019-02-20 18:46:09 -0500, Stephen Frost wrote: > > > * Tomas Vondra (tomas.von...@2ndquadrant.com) wrote: > > > > On 2/20/19 10:43 PM, Stephen Frost wrote: > > > > > Just to share a few additional thoughts after pondering this for a > > > > > while, but the comment Andres made up-thread really struck a > chord- we > > > > > don't necessairly want to throttle anything, what we'd really > rather do > > > > > is *prioritize* things, whereby foreground work (regular queries > and > > > > > such) have a higher priority than background/bulk work (VACUUM, > REINDEX, > > > > > etc) but otherwise we use the system to its full capacity. We > don't > > > > > actually want to throttle a VACUUM run any more than a CREATE > INDEX, we > > > > > just don't want those to hurt the performance of regular queries > that > > > > > are happening. > > > > > > > > I think you're forgetting the motivation of this very patch was to > > > > prevent replication lag caused by a command generating large amounts > of > > > > WAL (like CREATE INDEX / ALTER TABLE etc.). That has almost nothing > to > > > > do with prioritization or foreground/background split. > > > > > > > > I'm not arguing against ability to prioritize stuff, but I disagree > it > > > > somehow replaces throttling. > > > > > > Why is replication lag an issue though? I would contend it's an issue > > > because with sync replication, it makes foreground processes wait, and > > > with async replication, it makes the actions of foreground processes > > > show up late on the replicas. > > > > I think reaching the bandwidth limit of either the replication stream, > > or of the startup process is actually more common than these. And for > > that prioritization doesn't help, unless it somehow reduces the total > > amount of WAL. > > The issue with hitting those bandwidth limits is that you end up with > queues outside of your control and therefore are unable to prioritize > the data going through them. I agree, that's an issue and it might be > necessary to ask the admin to provide what the bandwidth limit is, so > that we could then avoid running into issues with downstream queues that > are outside of our control causing unexpected/unacceptable lag. > If there is a global rate limit on WAL throughput it could be adjusted by a control loop, measuring replication queue length and/or apply delay. I don't see any sane way how one would tune a per command rate limit, or even worse, a cost-delay parameter. It would have the same problems as work_mem settings. Rate limit in front of WAL insertion would allow for allocating the throughput between foreground and background tasks, and even allow for priority inheritance to alleviate priority inversion due to locks. There is also an implicit assumption here that a maintenance command is a background task and a normal DML query is a foreground task. This is not true for all cases, users may want to throttle transactions doing lots of DML to keep synchronous commit latencies for smaller transactions within reasonable limits. As a wild idea for how to handle the throttling, what if when all our wal insertion credits are used up XLogInsert() sets InterruptPending and the actual sleep is done inside ProcessInterrupts()? Regards, Ants Aasma
Re: Checkpoint start logging is done inside critical section
On Thu, Oct 18, 2018 at 9:02 AM Amit Kapila wrote: > > On Thu, Oct 18, 2018 at 10:27 AM Andres Freund wrote: > > (that's why we mark the ctx as being ok with that). > > > > Yeah, as the palloc for log message would be called in an ErrorContext > where it is safe to do the allocation, so ideally this shouldn't be a > problem. So, it seems to me that this is not a problem, Ants, do you > see any problem in any particular scenario or was this based on > theoretical analysis? This was purely theoretical, as also evidenced by lack of complaints even though the code has been like that for a very long time. I was actually mostly worried about extension code run by logging hook causing the panic. Regards, Ants Aasma
Checkpoint start logging is done inside critical section
The LogCheckpointStart() call inside CreateCheckPoint() is done while inside a critical section. The elog call could trigger errors due to memory allocations or from a logging hook, resulting in a panic. It seems better to postpone the logging until after the critical section is done. It's only a few lwlock acquisitions away and shouldn't make any material difference. Patch to do so is attached. Regards, Ants Aasma diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 7375a78ffc..faa9690e48 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -8907,15 +8907,6 @@ CreateCheckPoint(int flags) XLogCtl->RedoRecPtr = checkPoint.redo; SpinLockRelease(>info_lck); - /* - * If enabled, log checkpoint start. We postpone this until now so as not - * to log anything if we decided to skip the checkpoint. - */ - if (log_checkpoints) - LogCheckpointStart(flags, false); - - TRACE_POSTGRESQL_CHECKPOINT_START(flags); - /* * Get the other info we need for the checkpoint record. * @@ -8962,6 +8953,15 @@ CreateCheckPoint(int flags) */ END_CRIT_SECTION(); + /* + * If enabled, log checkpoint start. We postpone this until now so as not + * to log anything if we decided to skip the checkpoint. + */ + if (log_checkpoints) + LogCheckpointStart(flags, false); + + TRACE_POSTGRESQL_CHECKPOINT_START(flags); + /* * In some cases there are groups of actions that must all occur on one * side or the other of a checkpoint record. Before flushing the
Re: Skylake-S warning
On Thu, Oct 4, 2018 at 9:50 AM Adrien Nayrat wrote: > > On 10/3/18 11:29 PM, Daniel Wood wrote: > > If running benchmarks or you are a customer which is currently impacted by > > GetSnapshotData() on high end multisocket systems be wary of Skylake-S. > > > > > > Performance differences of nearly 2X can be seen on select only pgbench due > > to > > nothing else but unlucky choices for max_connections. Scale 1000, 192 local > > clients on a 2 socket 48 core Skylake-S(Xeon Platinum 8175M @ 2.50-GHz) > > system. > > pgbench -S > > Could it be related to : > https://www.postgresql.org/message-id/D2B9F2A20670C84685EF7D183F2949E2373E66%40gigant.nidsa.net > ? Unlikely. I understood from Daniel's email that profiling shows a different hot-spot. In the cited .NET issue the problem was mostly due to issuing PAUSE in a loop without attempting to grab the lock. In PostgreSQL it's called only once per retry attempt. Regards, Ants Aasma -- PostgreSQL Senior Consultant www.cybertec-postgresql.com Austria (HQ), Wiener Neustadt | Switzerland, Zürich | Estonia, Tallinn | Uruguay, Montevideo Facebook: www.fb.com/cybertec.postgresql Twitter: www.twitter.com/PostgresSupport
Re: Recovery performance of standby for multiple concurrent truncates on large tables
On Tue, Jul 10, 2018 at 10:05 AM Jamison, Kirk wrote: > Since in the current implementation, the replay of each TRUNCATE/DROP > TABLE scans the whole shared buffer. > > One approach (though idea is not really developed yet) is to improve the > recovery by delaying the shared buffer scan and invalidation > (DropRelFileNodeBuffers) and to put it after the next checkpoint (after > failover completion). The replay of TRUNCATE/DROP TABLE just make the > checkpointer process remember what relations should be invalidated in the > shared buffer during subsequent checkpoint. The checkpointer then scans the > shared buffer only once to invalidate the buffers of relations that was > dropped and truncated. > How about using the background writer for this? It seems to me that the main reason to invalidate buffers would be to free them up for buffer allocation, which is precisely the task of background writer. When adding a filenode to be invalidated, take note of bgwriter position and add it to a queue. When bgwriter is advancing, check each buffer tag against a hash table of filenodes being invalidated. When background writer has completed a loop it can remove the invalidated filenode. When bgwriter falls behind the clock sweep and there are filenodes to invalidate it should run the invalidation scan instead of skipping ahead. If there are already too many filenodes being invalidated, then whoever is trying to add a new one gets to run the invalidation scan until something can be evicted. -- Ants Aasma Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: https://www.cybertec-postgresql.com/
Re: WAL prefetch
On Tue, Jun 19, 2018 at 4:04 PM Tomas Vondra wrote: > Right. My point is that while spawning bgworkers probably helps, I don't > expect it to be enough to fill the I/O queues on modern storage systems. > Even if you start say 16 prefetch bgworkers, that's not going to be > enough for large arrays or SSDs. Those typically need way more than 16 > requests in the queue. > > Consider for example [1] from 2014 where Merlin reported how S3500 > (Intel SATA SSD) behaves with different effective_io_concurrency values: > > [1] > > https://www.postgresql.org/message-id/CAHyXU0yiVvfQAnR9cyH=HWh1WbLRsioe=mzRJTHwtr=2azs...@mail.gmail.com > > Clearly, you need to prefetch 32/64 blocks or so. Consider you may have > multiple such devices in a single RAID array, and that this device is > from 2014 (and newer flash devices likely need even deeper queues).' > For reference, a typical datacenter SSD needs a queue depth of 128 to saturate a single device. [1] Multiply that appropriately for RAID arrays. Regards, Ants Aasma [1] https://www.anandtech.com/show/12435/the-intel-ssd-dc-p4510-ssd-review-part-1-virtual-raid-on-cpu-vroc-scalability/3
Re: All Taxi Services need Index Clustered Heap Append
On Mon, Mar 5, 2018 at 2:11 PM, Darafei "Komяpa" Praliaskouski <m...@komzpa.net> wrote: >> This approach mixes well with hash >> partitioning. It would be neat indeed if PostgreSQL do something >> equivalent on its own, and pluggable storage work being done could >> enable index organized tables that would help. But you probably need >> something right now. > > > Fixing glaring issues (no vacuum and thus no Index-Only Scan on append-only > tables, vacuum processing all of the eternity of btree) by 11 will get most > of spike-nails out of the microservice code, and we can probably live with > them until 11 gets to RDS. > > I also don't see why a pluggable storage is a must for the clustered write. > Postgres does have a mechanism for selecting the next page to write tuple > to, right now it's just looking at FSM - but what if it just peeked at > existing index that already has enough the data to route tuple to correct > page on write? The mechanism you outlined would likely work for your use case, but it has many issues that prevent it from being universally useful. From the top of my head: * One extra index descent per insertion (I/O for this is necessary anyway, but CPU work is duplicated). * We don't currently track the amount of bloat. A mechanism that does this needs to be added. * If table hits the bloat limit there will be a sudden change in behavior. This is pretty nasty from an operations point of view. * With your (id,ts) clustering and data coming in mostly ordered by timestamp, after initial warmup, each page will contain rows from a single id, but different ids are arbitrarily interleaved. This is better than current state, but people might want to have an interleaving step bigger than 8kB to better utilize storage hardware. * It seems that with a common (ts) clustering and age of timestamp coming from an exponential distribution, this will quickly bloat to threshold and then insert data in a rather arbitrary order. This is much worse than the default behavior. At least in my opinion these problems make it a special case optimization that is hard to justify in core. A decent alternative would be a plugin mechanism for locating free space for a tuple where you can write your extension to find a suitable location for the row. >> I guess I don't have to tell you that it looks like your needs have >> outgrown what RDS works well with and you are in for a painful move >> sooner or later. > > > Painful move where to? If we just run a Postgres instance without RDS we'll > get the pain of setting up Postgres and replication and backups and > autofailover, with no visible gain except if we get some private / > unaccepted patches applied to it. If we can get these things right upstream > why would we want to switch? EC2 for example. Mainly because I3 instances and ephemeral provide an order of magnitude or two of performance improvement while costing less. Being able to run custom extensions and patches if necessary is a nice bonus. Yes, setting up replication, autofailover and backups is extra work that you have to weigh against the benefits. But don't overestimate the effort - there are some pretty nice tools available that make a proper cluster relatively simple to set up. > Per my colleagues, MySQL offers clustered index, also MySQL is available on > RDS without the need of "painful move", which is doable by writing to two > locations for a day and then pointing readers to new DB. But if we can > instead do no move and be sure the issues are gone upstream before we hit > the limit of spike-nails we're running on currently, wouldn't that be > better? :) The move off of RDS is painful because getting data out of RDS involves either downtime or building an ad-hoc logical replication solution. You need to solve that regardless of where you move to. Providing an out-of-the-box solution in core PostgreSQL would of course be best, but realistically you will be waiting at least 2 years to get it on RDS. In the meanwhile either the buffer partition approach I described, or a buffering microservice in front of PostgreSQL like Aleksander recommended should fix data locality for you. If you weren't running on RDS I would even propose using Redis as the buffer with one key per driver and redis_fdw to make the data accessible from within PostgreSQL. Regards, Ants Aasma -- +43-670-6056265 Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26, A-2700 Wiener Neustadt Web: https://www.cybertec-postgresql.com
Re: All Taxi Services need Index Clustered Heap Append
On Sat, Mar 3, 2018 at 4:53 PM, David Rowley <david.row...@2ndquadrant.com> wrote: > On 3 March 2018 at 05:30, Darafei "Komяpa" Praliaskouski <m...@komzpa.net> > wrote: >> Our options were: >> >> - partitioning. Not entirely trivial when your id is uuid. To get visible >> gains, we need to make sure each driver gets their own partition. That would >> leave us with 50 000(+) tables, and rumors say that in that's what is done >> in some bigger taxi service, and relcache then eats up all the RAM and >> system OOMs. > > It's a good job someone invented HASH partitioning then. > > It would be interesting to hear how your benchmarks go using current > master + the faster partition pruning patchset [1]. Currently, HASH > partitioning does exist in master, just there's no partition pruning > for the non-matching partitions, which is why you need [1]. > > I think trying with something like 500-1000 partitions might be a good > place to start. I don't think that will actually help much. 1000 partitions means each partition gets data from ~50 vehicles. A 60 tuples per page each page in the partitioned able will contain on average 1.2 interesting tuples. So you still have almost one page read per row. Regards, Ants Aasma -- +43-670-6056265 Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26, A-2700 Wiener Neustadt Web: https://www.cybertec-postgresql.com
Re: All Taxi Services need Index Clustered Heap Append
On Fri, Mar 2, 2018 at 6:30 PM, Darafei "Komяpa" Praliaskouski <m...@komzpa.net> wrote: > I gave this all some thought and it looks like it all could have not > happened if Postgres was able to cluster heap insertions by (id, ts) index. > We're ok with synchronuous_commit=off, so amplified write won't immediately > hit disk and can get cooled down in progress. Clustering doesn't require > perfect sorting: we need to minimize number of pages fetched, it's ok if the > pages are not consecutive on disk. Data locality is indeed the key here. Specifically for non-cached data. It is possible to manually implement some approximation of clustering on SQL level with current PostgreSQL features. Insert incoming data into new data partitions and have a background job swap input to a new partition and then insert data from the previous new data partition to main storage sorting it by vehicle in the process. If you do this every few minutes or so you should be able to tune the system in a way that the new partition data isn't even written to disk, you only have to pay the cost of double WAL for insertion and the CPU work to perform the move. This approach mixes well with hash partitioning. It would be neat indeed if PostgreSQL do something equivalent on its own, and pluggable storage work being done could enable index organized tables that would help. But you probably need something right now. I guess I don't have to tell you that it looks like your needs have outgrown what RDS works well with and you are in for a painful move sooner or later. Regards, Ants Aasma -- +43-670-6056265 Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26, A-2700 Wiener Neustadt Web: https://www.cybertec-postgresql.com
Re: RTLD_GLOBAL (& JIT inlining)
On Mon, Feb 26, 2018 at 11:28 PM, Andres Freund <and...@anarazel.de> wrote: > So RTLD_LOCAL is out of the question, but I think we can get a good bit > of the benefit by either specifying -Wl,-Bsymbolic at shlib build time, > or RTLD_DEEPBIND at dlopen() time. Either leads to the opened shared > library effectively being put at the beginning of the search path, > therefore avoiding the issue that an earlier loaded shared library or > symbols from the main binary can accidentally overwrite things in the > shared library itself. Which incidentally also makes loading a bit > faster. I think this would also fix oracle_fdw crashing when postgres is compiled with --with-ldap. At least RTLD_DEEPBIND helped. [1] [1] https://www.postgresql.org/message-id/CA%2BCSw_tPDYgnzCYW0S4oU0mTUoUhZ9pc7MRBPXVD-3Zbiwni9w%40mail.gmail.com Ants Aasma