Re: [HACKERS] make async slave to wait for lsn to be replayed
On Mon, Oct 30, 2017 at 7:25 PM, Ivan Kartyshov wrote: > It sounds reasonable. I can offer the following version. > > WAIT LSN lsn_number; > WAIT LSN lsn_number TIMEOUT delay; > WAIT LSN lsn_number INFINITE; > WAIT LSN lsn_number NOWAIT; > > > WAIT [token] wait_value [option]; > > token - [LSN, TIME | TIMESTAMP | CSN | XID] > option - [TIMEOUT | INFINITE | NOWAIT] > > Ready to listen to your suggestions. Making the interface more specific about the mechanism is not what I had in mind, quite the opposite. I would like to see the interface describe the desired effect of the wait. Stepping back for a while, from what I understand the reason we want to waiting is to prevent observation of database state going backwards. To limit the amount of waiting needed we tell the database what we have observed. For example "I just observed my transaction commit", or "the last time I observed state was X", and then have the database provide us with a snapshot that is causally dependent on those states. This does not give us linearizability, for that we still need at the very least serializable transactions on standby. But it seems to provide a form of sequential consistency, which (if it can be proved to hold) makes reasoning about concurrency a lot nicer. For lack of a better proposal I would like something along the lines of: WAIT FOR state_id[, state_id] [ OPTIONS (..)] And to get the tokens maybe a function pg_last_commit_state(). Optionally, to provide read-to-read causality, pg_snapshot_state() could return for example replay_lsn at the start of the current transaction. This makes sure that things don't just appear and disappear when load balancing across many standby servers. WAIT FOR semantics is to ensure that next snapshot is causally dependent (happens after) each of the specified observed states. The state_id could simply be a LSN, or to allow for future expansion something like 'lsn:/1234'. Example extension would be to allow for waiting on xids. On master that would be just a ShareLock on the transactionid. On standby it would wait for the commit or rollback record for that transaction to be replayed. Robert made a good point that people will still rely on the token being an LSN, but perhaps they will be slightly less angry when we explicitly tell them that this might change in the future. Regards, Ants Aasma [1] https://www.postgresql.org/docs/devel/static/functions-admin.html#functions-snapshot-synchronization -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] make async slave to wait for lsn to be replayed
On Mon, Oct 23, 2017 at 12:29 PM, Ivan Kartyshov wrote: > Ants Aasma писал 2017-09-26 13:00: >> >> Exposing this interface as WAITLSN will encode that visibility order >> matches LSN order. This removes any chance of fixing for example >> visibility order of async/vs/sync transactions. Just renaming it so >> the token is an opaque commit visibility token that just happens to be >> a LSN would still allow for progress in transaction management. For >> example, making PostgreSQL distributed will likely want timestamp >> and/or vector clock based visibility rules. > > > I'm sorry I did not understand exactly what you meant. > Please explain this in more detail. Currently transactions on the master become visible when xid is removed from proc array. This depends on the order of acquiring ProcArrayLock, which can happen in a different order from inserting the commit record to wal. Whereas on the standby the transactions will become visible in the same order that commit records appear in wal. The difference can be quite large when transactions are using different values for synchronous_commit. Fixing this is not easy, but we may want to do it someday. IIRC CSN threads contained more discussion on this topic. If we do fix it, it seems likely that what we need to wait on is not LSN, but some other kind of value. If the UI were something like "WAITVISIBILITY token", then we can later change the token to something other than LSN. Regards, Ants Aasma -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Discussion on missing optimizations
On Sun, Oct 8, 2017 at 7:24 PM, Tom Lane wrote: > Petr Jelinek writes: >> Okay, that makes sense, thanks for explanation. Your patch is the way to >> go then. > > Hearing no further comment, pushed. Thanks for reviewing it. The tautological col = col comparison on is occasionally used as a planner "hint" to correct for row count overestimates. Not a great solution, but PostgreSQL doesn't really have a better way to guide the planner. Those queries will now have to do something else, like col = col + 0, which still works. Regards, Ants Aasma -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] JIT compiling - v4.0
On Wed, Oct 4, 2017 at 9:48 AM, Andres Freund wrote: > Here's an updated version of the patchset. There's some substantial > changes here, but it's still very obviously very far from committable as > a whole. There's some helper commmits that are simple and independent > enough to be committable earlier on. Looks pretty impressive already. I wanted to take it for a spin, but got errors about the following symbols being missing: LLVMOrcUnregisterPerf LLVMOrcRegisterGDB LLVMOrcRegisterPerf LLVMOrcGetSymbolAddressIn LLVMLinkModules2Needed As far as I can tell these are not in mainline LLVM. Is there a branch or patchset of LLVM available somewhere that I need to use this? Regards, Ants Aasma -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] make async slave to wait for lsn to be replayed
On Tue, Aug 15, 2017 at 5:00 AM, Craig Ringer wrote: > On 22 March 2017 at 01:17, Robert Haas wrote: >> >> On Sun, Mar 12, 2017 at 10:20 PM, Thomas Munro >> wrote: >> > Maybe someone can think of a clever way for an extension to insert a >> > wait for a user-supplied LSN *before* acquiring a snapshot so it can >> > work for the higher levels, or maybe the hooks should go into core >> > PostgreSQL so that the extension can exist as an external project not >> > requiring a patched PostgreSQL installation, or maybe this should be >> > done with new core syntax that extends transaction commands. Do other >> > people have views on this? >> >> IMHO, trying to do this using a function-based interface is a really >> bad idea for exactly the reasons you mention. I don't see why we'd >> resist the idea of core syntax here; transactions are a core part of >> PostgreSQL. >> >> There is, of course, the question of whether making LSNs such a >> user-visible thing is a good idea in the first place, but that's a >> separate question from issue of what syntax for such a thing is best. > > > (I know this is old, but): > > That ship sailed a long time ago unfortunately, they're all over > pg_stat_replication and pg_replication_slots and so on. They're already > routinely used for monitoring replication lag in bytes, waiting for a peer > to catch up, etc. (continuing the trend of resurrecting old topics) Exposing this interface as WAITLSN will encode that visibility order matches LSN order. This removes any chance of fixing for example visibility order of async/vs/sync transactions. Just renaming it so the token is an opaque commit visibility token that just happens to be a LSN would still allow for progress in transaction management. For example, making PostgreSQL distributed will likely want timestamp and/or vector clock based visibility rules. Regards, Ants Aasma -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hooks to track changed pages for backup purposes
On Thu, Aug 31, 2017 at 9:02 AM, Andrey Borodin wrote: > When we have accumulated diff blocknumbers for most of segments we can > significantly speed up method of WAL scanning. If we have blocknumbers for > all segments we can skip WAL scanning at all. Have you measured that the WAL scanning is actually a significant issue? As a quick experiment I hacked up pg_waldump to just dump block references to stdout in binary format. It scanned 2.8GB of WAL in 3.17 seconds, outputting 9.3M block refs per second. WAL was generated with pgbench, synchronous commit off, using 4 cores for 10 minutes - making the ratio of work from generating WAL to parsing it be about 750:1. Regards, Ants Aasma -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] emergency outage requiring database restart
On Wed, Jan 18, 2017 at 4:33 PM, Merlin Moncure wrote: > On Wed, Jan 18, 2017 at 4:11 AM, Ants Aasma wrote: >> On Wed, Jan 4, 2017 at 5:36 PM, Merlin Moncure wrote: >>> Still getting checksum failures. Over the last 30 days, I see the >>> following. Since enabling checksums FWICT none of the damage is >>> permanent and rolls back with the transaction. So creepy! >> >> The checksums still only differ in least significant digits which >> pretty much means that there is a block number mismatch. So if you >> rule out filesystem not doing its job correctly and transposing >> blocks, it could be something else that is resulting in blocks getting >> read from a location that happens to differ by a small multiple of >> page size. Maybe somebody is racily mucking with table fd's between >> seeking and reading. That would explain the issue disappearing after a >> retry. >> >> Maybe you can arrange for the RelFileNode and block number to be >> logged for the checksum failures and check what the actual checksums >> are in data files surrounding the failed page. If the requested block >> number contains something completely else, but the page that follows >> contains the expected checksum value, then it would support this >> theory. > > will do. Main challenge is getting hand compiled server to swap in > so that libdir continues to work. Getting access to the server is > difficult as is getting a maintenance window. I'll post back ASAP. As a new datapoint, we just had a customer with an issue that I think might be related. The issue was reasonably repeatable by running a report on the standby system. Issue manifested itself by first "could not open relation" and/or "column is not in index" errors, followed a few minutes later by a PANIC from startup process due to "specified item offset is too large", "invalid max offset number" or "page X of relation base/16384/1259 is uninitialized". I took a look at the xlog dump and it was completely fine. For instance in the "specified item offset is too large" case there was a INSERT_LEAF redo record inserting the preceding offset just a couple hundred kilobytes back. Restarting the server sometimes successfully applied the offending WAL, sometimes it failed with other corruption errors. The offending relations were always pg_class or pg_class_oid_index. Replacing plsh functions with dummy plpgsql functions made the problem go away, reintroducing plsh functions made it reappear. The only thing I came up with that is consistent with the symptoms is that a page got thrown out of shared_buffers between the two xlog records referencing it (shared_buffers was default 128MB), and then read back by a backend process, where in the time between FileSeek and FileRead calls in mdread a subprocess mucked with the fd's offset so that a different page than intended got read in. Or basically the same race condition, but on the write side. Maybe somebody else has a better imagination than me... Regards, Ants Aasma -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reducing pg_ctl's reaction time
On Wed, Jun 28, 2017 at 8:31 PM, Tom Lane wrote: > Andres Freund writes: >> On 2017-06-27 14:59:18 -0400, Tom Lane wrote: >>> However, it's certainly arguable that this is too much change for an >>> optional post-beta patch. > >> Yea, I think there's a valid case to be made for that. I'm still >> inclined to go along with this, it seems we're otherwise just adding >> complexity we're going to remove in a bit again. > > I'm not hearing anyone speaking against doing this now, so I'm going > to go ahead with it. +1 for going for it. Besides pg_ctl it would also help cluster management software. In Patroni we basically reimplement pg_ctl to get at the started postmaster pid to detect a crash during postmaster startup earlier and to be able to reliably cancel start. Getting the current state from the pid file sounds better than having a loop poke the server with pg_isready. To introduce feature creep, I would like to see more detailed states than proposed here. Specifically I'm interested in knowing when PM_WAIT_BACKENDS ends. When we lose quorum we restart PostgreSQL as a standby. We use a regular fast shutdown, but that can take a long time due to the shutdown checkpoint. The leader lease may run out during this time so we would have to escalate to immediate shutdown or have a watchdog fence the node. If we knew that no user backends are left we can let the shutdown checkpoint complete at leisure without risk for split brain. Regards, Ants Aasma -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: Data at rest encryption
On Wed, Jun 14, 2017 at 6:26 PM, Bruce Momjian wrote: > Are you checking the CPU type or if AES instructions are enabled on the > CPU? I ask this because I just realized in researching my new TLS talk > that my BIOS defaults to AES instructions disabled, and I had to > manually enable it. There is zero code for that now, but the plan was to check the CPUID instruction. My understanding is that it should report what is currently enabled on the CPU. Will double check when actually writing the code for the check. >> > I anticipate that one of the trickier problems here will be handling >> > encryption of the write-ahead log. Suppose you encrypt WAL a block at >> > a time. In the current system, once you've written and flushed a >> > block, you can consider it durably committed, but if that block is >> > encrypted, this is no longer true. A crash might tear the block, >> > making it impossible to decrypt. Replay will therefore stop at the >> > end of the previous block, not at the last record actually flushed as >> > would happen today. >> >> My patch is currenly doing a block at a time for WAL. The XTS mode > > Uh, how are you writing partial writes to the WAL. I assume you are > doing a streaming cipher so you can write in increments, right? We were doing 8kB page aligned writes to WAL anyway. I just encrypt the block before it gets written out. >> I think we need to require wal_log_hints=on when encryption is >> enabled. Currently I have not considered tearing on CLOG bits. Other >> SLRUs probably have similar issues. I need to think a bit about how to >> solve that. > > I am not sure if clog even needs to be encrypted. Me neither, but it currently is, and it looks like that's broken in a "silently corrupts your data" way in face of torn writes. Using OFB mode (xor plaintext with pseudorandom stream for cipher) looks like it might help here, if other approaches fail. Regards, Ants Aasma -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: Data at rest encryption
On Tue, Jun 13, 2017 at 6:35 PM, Robert Haas wrote: > Of course, what would be even more useful is fine-grained encryption - > encrypt these tables (and the corresponding indexes, toast tables, and > WAL records related to any of that) with this key, encrypt these other > tables (and the same list of associated stuff) with this other key, > and leave the rest unencrypted. The problem with that is that you > probably can't run recovery without all of the keys, and even on a > clean startup there would be a good deal of engineering work involved > in refusing access to tables whose key hadn't been provided yet. That's pretty much the reason why I decided to skip anything more complicated for now. Anything that would be able to run recovery without knowing the encryption key looks like an order of magnitude more complicated to implement. > Performance is likely to be poor on large databases, > because every time a page transits between shared_buffers and the > buffer cache we've got to en/decrypt, but as long as it's only poor > for the people who opt into the feature I don't see a big problem with > that. It would make sense to tune the database with large shared buffers if encryption is enabled. That should make sure that most shared buffers traffic is going to disk anyway. As for performance, I have a prototype assembly implementation of AES that does 3GB/s/core on my laptop. If we add that behind a CPUID check the overhead should be quite reasonable. > I anticipate that one of the trickier problems here will be handling > encryption of the write-ahead log. Suppose you encrypt WAL a block at > a time. In the current system, once you've written and flushed a > block, you can consider it durably committed, but if that block is > encrypted, this is no longer true. A crash might tear the block, > making it impossible to decrypt. Replay will therefore stop at the > end of the previous block, not at the last record actually flushed as > would happen today. My patch is currenly doing a block at a time for WAL. The XTS mode used to encrypt has the useful property that blocks that share identical prefix unencrypted also have identical prefix when encrypted. It requires that the tearing is 16B aligned, but I think that is true for pretty much all storage systems. That property of course has security downsides, but for table/index storage we have a nonce in the form of LSN in the page header eliminating the issue. > So, your synchronous_commit suddenly isn't. A > similar problem will occur any other page where we choose not to > protect against torn pages using full page writes. For instance, > unless checksums are enabled or wal_log_hints=on, we'll write a data > page where a single bit has been flipped and assume that the bit will > either make it to disk or not; the page can't really be torn in any > way that hurts us. But with encryption that's no longer true, because > the hint bit will turn into much more than a single bit flip, and > rereading that page with half old and half new contents will be the > end of the world (TM). I don't know off-hand whether we're > protecting, say, CLOG page writes with FPWs.: because setting a couple > of bits is idempotent and doesn't depend on the existing page > contents, we might not need it currently, but with encryption, every > bit in the page depends on every other bit in the page, so we > certainly would. I don't know how many places we've got assumptions > like this baked into the system, but I'm guessing there are a bunch. I think we need to require wal_log_hints=on when encryption is enabled. Currently I have not considered tearing on CLOG bits. Other SLRUs probably have similar issues. I need to think a bit about how to solve that. Regards, Ants Aasma -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: Data at rest encryption
On Mon, Jun 12, 2017 at 10:38 PM, Robert Haas wrote: > On Mon, Jun 13, 2016 at 11:07 AM, Peter Eisentraut > wrote: >> On 6/7/16 9:56 AM, Ants Aasma wrote: >>> >>> Similar things can be achieved with filesystem level encryption. >>> However this is not always optimal for various reasons. One of the >>> better reasons is the desire for HSM based encryption in a storage >>> area network based setup. >> >> Could you explain this in more detail? > > I don't think Ants ever responded to this point. > > I'm curious whether this is something that is likely to be pursued for > PostgreSQL 11. Yes, the plan is to pick it up again, Real Soon Now(tm). There are a couple of loose ends for stuff that should be encrypted, but in the current state of the patch aren't yet (from the top of my head, logical decoding and pg_stat_statements write some files). The code handling keys could really take better precautions as Peter pointed out in another e-mail. And I expect there to be a bunch of polishing work to make the APIs as good as they can be. To answer Peter's question about HSMs, many enterprise deployments are on top of shared storage systems. For regulatory reasons or to limit security clearance of storage administrators, the data on shared storage should be encrypted. Now for there to be any point to this endeavor, the key needs to be stored somewhere else. This is where hardware security modules come in. They are basically hardware key storage appliances that can either output the key when requested, or for higher security hold onto the key and perform encryption/decryption on behalf of the user. The patch enables the user to use a custom shell command to go and fetch the key from the HSM, for example using the KMIP protocol. Or a motivated person could write an extension that implements the encryption hooks to delegate encryption/decryption of blocks to an HSM. Fundamentally there doesn't seem to be a big benefit of implementing the encryption at PostgreSQL level instead of the filesystem. The patch doesn't take any real advantage from the higher level knowledge of the system, nor do I see much possibility for it to do that. The main benefit for us is that it's much easier to get a PostgreSQL based solution deployed. I'm curious if the community thinks this is a feature worth having? Even considering that security experts would classify this kind of encryption as a checkbox feature. Regards, Ants Aasma -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: Faster Expression Processing v4
On Sun, Mar 26, 2017 at 12:22 AM, Andres Freund wrote: >> At least with current gcc (6.3.1 on Fedora 25) at -O2, >> what I see is multiple places jumping to the same indirect jump >> instruction :-(. It's not a total disaster: as best I can tell, all the >> uses of EEO_JUMP remain distinct. But gcc has chosen to implement about >> 40 of the 71 uses of EEO_NEXT by jumping to the same couple of >> instructions that increment the "op" register and then do an indirect >> jump :-(. > > Yea, I see some of that too - "usually" when there's more than just the > jump in common. I think there's some gcc variables that influence this > (min-crossjump-insns (5), max-goto-duplication-insns (8)). Might be > worthwhile experimenting with setting them locally via a pragma or such. > I think Aants wanted to experiment with that, too. I haven't had the time to research this properly, but initial tests show that with GCC 6.2 adding #pragma GCC optimize ("no-crossjumping") fixes merging of the op tail jumps. Some quick and dirty benchmarking suggests that the benefit for the interpreter is about 15% (5% speedup on a workload that spends 1/3 in ExecInterpExpr). My idea of prefetching op->resnull/resvalue to local vars before the indirect jump is somewhere between a tiny benefit and no effect, certainly not worth introducing extra complexity. Clang 3.8 does the correct thing out of the box and is a couple of percent faster than GCC with the pragma. Regards, Ants Aasma -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Checksums by default?
On Fri, Feb 24, 2017 at 10:30 PM, Bruce Momjian wrote: > On Fri, Feb 24, 2017 at 10:09:50PM +0200, Ants Aasma wrote: >> On Fri, Feb 24, 2017 at 9:37 PM, Bruce Momjian wrote: >> > Oh, that's why we will hopefully eventually change the page checksum >> > algorithm to use the special CRC32 instruction, and set a new checksum >> > version --- got it. I assume there is currently no compile-time way to >> > do this. >> >> Using CRC32 as implemented now for the WAL would be significantly >> slower than what we have now due to instruction latency. Even the best >> theoretical implementation using the CRC32 instruction would still be >> about the same speed than what we have now. I haven't seen anybody >> working on swapping out the current algorithm. And I don't really see >> a reason to, it would introduce a load of headaches for no real gain. > > Uh, I am confused. I thought you said we were leaving some performance > on the table. What is that? I though CRC32 was SSE4.1. Why is CRC32 > good for the WAL but bad for the page checksums? What about the WAL > page images? The page checksum algorithm was designed to take advantage of CPUs that provide vectorized 32bit integer multiplication. On x86 this was introduced with SSE4.1 extensions. This means that by default we can't take advantage of the design. The code is written in a way that compiler auto vectorization works on it, so only using appropriate compilation flags are needed to compile a version that does use vector instructions. However to enable it on generic builds, a runtime switch between different levels of vectorization support is needed. This is what is leaving the performance on the table. The page checksum algorithm we have is extremely fast, memcpy fast. Even without vectorization it is right up there with Murmurhash3a and xxHash. With vectorization it's 4x faster. And it works this fast on most modern CPUs, not only Intel. The downside is that it only works well for large blocks, and only fixed power-of-2 size with the current implementation. WAL page images have the page hole removed so can't easily take advantage of this. That said, I haven't really seen either the hardware accelerated CRC32 calculation nor the non-vectorized page checksum take a noticeable amount of time on real world workloads. The benchmarks presented in this thread seem to corroborate this observation. Regards, Ants Aasma -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Checksums by default?
On Fri, Feb 24, 2017 at 9:49 PM, Jim Nasby wrote: > On 2/24/17 12:30 PM, Tomas Vondra wrote: >> >> In any case, we can't just build x86-64 packages with compile-time >> SSE4.1 checks. > > > Dumb question... since we're already discussing llvm for the executor, would > that potentially be an option here? AIUI that also opens the possibility of > using the GPU as well. Just transferring the block to the GPU would be slower than what we have now. Theoretically LLVM could be used to JIT the checksum calculation, but just precompiling a couple of versions and swithcing between them at runtime would be simpler and would give the same speedup. Regards, Ants saasma -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Checksums by default?
On Fri, Feb 24, 2017 at 9:37 PM, Bruce Momjian wrote: > Oh, that's why we will hopefully eventually change the page checksum > algorithm to use the special CRC32 instruction, and set a new checksum > version --- got it. I assume there is currently no compile-time way to > do this. Using CRC32 as implemented now for the WAL would be significantly slower than what we have now due to instruction latency. Even the best theoretical implementation using the CRC32 instruction would still be about the same speed than what we have now. I haven't seen anybody working on swapping out the current algorithm. And I don't really see a reason to, it would introduce a load of headaches for no real gain. Regards, Ants Aasma -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Checksums by default?
On Fri, Feb 24, 2017 at 10:02 PM, Bruce Momjian wrote: > Uh, as far as I know, the best you are going to get from llvm is > standard assembly, while the SSE4.1 instructions use special assembly > instructions, so they would be faster, and in a way they are a GPU built > into CPUs. Both LLVM and GCC are capable of compiling the code that we have to a vectorized loop using SSE4.1 or AVX2 instructions given the proper compilation flags. This is exactly what was giving the speedup in the test I showed in my e-mail. Regards, Ants Aasma -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Checksums by default?
On Fri, Feb 24, 2017 at 7:47 PM, Bruce Momjian wrote: > On Sat, Jan 21, 2017 at 09:02:25PM +0200, Ants Aasma wrote: >> > It might be worth looking into using the CRC CPU instruction to reduce this >> > overhead, like we do for the WAL checksums. Since that is a different >> > algorithm it would be a compatibility break and we would need to support >> > the >> > old algorithm for upgraded clusters.. >> >> We looked at that when picking the algorithm. At that point it seemed >> that CRC CPU instructions were not universal enough to rely on them. >> The algorithm we ended up on was designed to be fast on SIMD hardware. >> Unfortunately on x86-64 that required SSE4.1 integer instructions, so >> with default compiles there is a lot of performance left on table. A >> low hanging fruit would be to do CPU detection like the CRC case and >> enable a SSE4.1 optimized variant when those instructions are >> available. IIRC it was actually a lot faster than the naive hardware >> CRC that is used for WAL and about on par with interleaved CRC. > > Uh, I thought already did compile-time testing for SSE4.1 and used them > if present. Why do you say "with default compiles there is a lot of > performance left on table?" Compile time checks don't help because the compiled binary could be run on a different host that does not have SSE4.1 (as extremely unlikely as it is at this point of time). A runtime check is done for WAL checksums that use a special CRC32 instruction. Block checksums predate that and use a different algorithm that was picked because it could be accelerated with vectorized execution on non-Intel architectures. We just never got around to adding runtime checks for the architecture to enable this speedup. The attached test runs 1M iterations of the checksum about 3x faster when compiled with SSE4.1 and vectorization, 4x if AVX2 is added into the mix. Test: gcc $CFLAGS -Isrc/include -DN=100 testchecksum.c -o testchecksum && time ./testchecksum Results: CFLAGS="-O2": 2.364s CFLAGS="-O2 -msse4.1 -ftree-vectorize": 0.752s CFLAGS="-O2 -mavx2 -ftree-vectorize": 0.552s That 0.552s is 15GB/s per core on a 3 year old laptop. Regards, Ants Aasma #include "postgres.h" #include "storage/checksum_impl.h" void main() { char page[8192] = {0}; uint32 i, sum = 0; for (i = 0; i < N; i++) sum ^= pg_checksum_page(page, i); } -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Checksums by default?
On Wed, Jan 25, 2017 at 8:18 PM, Robert Haas wrote: > Also, it's not as if there are no other ways of checking whether your > disks are failing. SMART, for example, is supposed to tell you about > incipient hardware failures before PostgreSQL ever sees a bit flip. > Surely an average user would love to get a heads-up that their > hardware is failing even when that hardware is not being used to power > PostgreSQL, yet many people don't bother to configure SMART (or > similar proprietary systems provided by individual vendors). You really can't rely on SMART to tell you about hardware failures. 1 in 4 drives fail completely with 0 SMART indication [1]. And for the 1 in 1000 annual checksum failure rate other indicators except system restarts only had a weak correlation[2]. And this is without filesystem and other OS bugs that SMART knows nothing about. My view may be biased by mostly seeing the cases where things have already gone wrong, but I recommend support clients to turn checksums on unless it's known that write IO is going to be an issue. Especially because I know that if it turns out to be a problem I can go in and quickly hack together a tool to help them turn it off. I do agree that to change the PostgreSQL default at least some tool turn it off online should be included. [1] https://www.backblaze.com/blog/what-smart-stats-indicate-hard-drive-failures/ [2] https://www.usenix.org/legacy/event/fast08/tech/full_papers/bairavasundaram/bairavasundaram.pdf Regards, Ants Aasma -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Checksums by default?
On Tue, Jan 24, 2017 at 4:07 AM, Tom Lane wrote: > Peter Geoghegan writes: >> I thought that checksums went in in part because we thought that there >> was some chance that they'd find bugs in Postgres. > > Not really. AFAICS the only point is to catch storage-system malfeasance. This matches my understanding. Actual physical media errors are caught by lower level checksums/error correction codes, and memory errors are caught by ECC RAM. Checksums do very little for PostgreSQL bugs, which leaves only filesystem and storage firmware bugs. However the latter are still reasonably common faults. I have seen multiple cases where, after reviewing the corruption with a hex editor, the only reasonable conclusion was a bug in the storage system. Data shifted around by non-page size amounts, non-page aligned extents that are zeroed out, etc. Unfortunately none of those customers had checksums turned on at the time. I feel that reacting to such errors with a non-cryptic and easily debuggable checksum error is much better than erroring out with huge memory allocations, crashing or returning bogus data. Timely reaction to data corruption is really important for minimizing data loss. Regards, Ants Aasma -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Checksums by default?
On Sat, Jan 21, 2017 at 10:16 PM, Michael Banck wrote: > On Sat, Jan 21, 2017 at 09:02:25PM +0200, Ants Aasma wrote: >> On Sat, Jan 21, 2017 at 6:41 PM, Andreas Karlsson wrote: >> > It might be worth looking into using the CRC CPU instruction to reduce this >> > overhead, like we do for the WAL checksums. Since that is a different >> > algorithm it would be a compatibility break and we would need to support >> > the >> > old algorithm for upgraded clusters.. >> >> We looked at that when picking the algorithm. At that point it seemed >> that CRC CPU instructions were not universal enough to rely on them. >> The algorithm we ended up on was designed to be fast on SIMD hardware. >> Unfortunately on x86-64 that required SSE4.1 integer instructions, so >> with default compiles there is a lot of performance left on table. A >> low hanging fruit would be to do CPU detection like the CRC case and >> enable a SSE4.1 optimized variant when those instructions are >> available. IIRC it was actually a lot faster than the naive hardware >> CRC that is used for WAL and about on par with interleaved CRC. > > I am afraid that won't fly with most end-user packages, cause > distributions can't just built packages on their newest machine and then > users get SIGILL or whatever cause their 2014 server doesn't have that > instruction, or would they still work? > > So you would have to do runtime detection of the CPU features, and use > the appropriate code if they are available. Which also makes regression > testing harder, as not all codepaths would get exercised all the time. Runtime detection is exactly what I had in mind. The code path would also be the same as at least the two most important compilers only need a compilation flag change. And the required instruction was introduced in 2007 so I think anybody who is concerned about performance is covered. Regards, Ants Aasma -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Checksums by default?
On Sat, Jan 21, 2017 at 7:39 PM, Petr Jelinek wrote: > So in summary "postgresql.conf options are easy to change" while "initdb > options are hard to change", I can see this argument used both for > enabling or disabling checksums by default. As I said I would be less > worried if it was easy to turn off, but we are not there afaik. And even > then I'd still want benchmark first. Adding support for disabling checksums is almost trivial as it only requires flipping a value in the control file. And I have somewhere sitting around a similarly simple tool for turning on checksums while the database is offline. FWIW, based on customers and fellow conference goers I have talked to most would gladly take the performance hit, but not the downtime to turn it on on an existing database. Regards, Ants Aasma -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Checksums by default?
On Sat, Jan 21, 2017 at 6:41 PM, Andreas Karlsson wrote: > On 01/21/2017 04:48 PM, Stephen Frost wrote: >> >> * Fujii Masao (masao.fu...@gmail.com) wrote: >>> >>> If the performance overhead by the checksums is really negligible, >>> we may be able to get rid of wal_log_hints parameter, as well. >> >> >> Prior benchmarks showed it to be on the order of a few percent, as I >> recall, so I'm not sure that we can say it's negligible (and that's not >> why Magnus was proposing changing the default). > > > It might be worth looking into using the CRC CPU instruction to reduce this > overhead, like we do for the WAL checksums. Since that is a different > algorithm it would be a compatibility break and we would need to support the > old algorithm for upgraded clusters.. We looked at that when picking the algorithm. At that point it seemed that CRC CPU instructions were not universal enough to rely on them. The algorithm we ended up on was designed to be fast on SIMD hardware. Unfortunately on x86-64 that required SSE4.1 integer instructions, so with default compiles there is a lot of performance left on table. A low hanging fruit would be to do CPU detection like the CRC case and enable a SSE4.1 optimized variant when those instructions are available. IIRC it was actually a lot faster than the naive hardware CRC that is used for WAL and about on par with interleaved CRC. That said the actual checksum calculation was not a big issue for performance. The only way to make it really matter was with a larger than shared buffers smaller than RAM workload with tiny per page execution overhead. My test case was SELECT COUNT(*) on wide rows with a small fill factor. Having to WAL log hint bits is the major issue. Regards, Ants Aasma -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Causal reads take II
On Thu, Jan 19, 2017 at 2:22 PM, Thomas Munro wrote: > On Thu, Jan 19, 2017 at 8:11 PM, Ants Aasma wrote: >> On Tue, Jan 3, 2017 at 3:43 AM, Thomas Munro >> wrote: >>> Long term, I think it would be pretty cool if we could develop a set >>> of features that give you distributed sequential consistency on top of >>> streaming replication. Something like (this | causality-tokens) + >>> SERIALIZABLE-DEFERRABLE-on-standbys[3] + >>> distributed-dirty-read-prevention[4]. >> >> Is it necessary that causal writes wait for replication before making >> the transaction visible on the master? I'm asking because the per tx >> variable wait time between logging commit record and making >> transaction visible makes it really hard to provide matching >> visibility order on master and standby. > > Yeah, that does seem problematic. Even with async replication or no > replication, isn't there already a race in CommitTransaction() where > two backends could reach RecordTransactionCommit() in one order but > ProcArrayEndTransaction() in the other order? AFAICS using > synchronous replication in one of the transactions just makes it more > likely you'll experience such a visibility difference between the DO > and REDO histories (!), by making RecordTransactionCommit() wait. > Nothing prevents you getting a snapshot that can see t2 but not t1 in > the DO history, while someone doing PITR or querying an asynchronous > standby gets a snapshot that can see t1 but not t2 because those > replay the REDO history. Yes there is a race even with all transactions having the same synchronization level. But nobody will mind if we some day fix that race. :) With different synchronization levels it is much trickier to fix as either async commits must wait behind sync commits before becoming visible, return without becoming visible or visibility order must differ from commit record LSN order. The first makes the async commit feature useless, second seems a no-go for semantic reasons, third requires extra information sent to standby's so they know the actual commit order. >> In CSN based snapshot >> discussions we came to the conclusion that to make standby visibility >> order match master while still allowing for async transactions to >> become visible before they are durable we need to make the commit >> sequence a vector clock and transmit extra visibility ordering >> information to standby's. Having one more level of delay between wal >> logging of commit and making it visible would make the problem even >> worse. > > I'd like to read that... could you please point me at the right bit of > that discussion? Some of that discussion was face to face at pgconf.eu, some of it is here: https://www.postgresql.org/message-id/CA+CSw_vbt=cwluogr7gxdpnho_y4cz7x97+o_bh-rfo7ken...@mail.gmail.com Let me know if you have any questions. >> It seems that fixing that would >> require either keeping some per client state or a global agreement on >> what snapshots are safe to provide, both of which you tried to avoid >> for this feature. > > Agreed. You briefly mentioned this problem in the context of pairs of > read-only transactions a while ago[1]. As you said then, it does seem > plausible to do that with a token system that gives clients the last > commit LSN from the snapshot used by a read only query, so that you > can ask another standby to make sure that LSN has been replayed before > running another read-only transaction. This could be handled > explicitly by a motivated client that is talking to multiple nodes. A > more general problem is client A telling client B to go and run > queries and expecting B to see all transactions that A has seen; it > now has to pass the LSN along with that communication, or rely on some > kind of magic proxy that sees all transactions, or a radically > different system with a GTM. If/when we do CSN based snapshots, adding a GTM could be relatively straightforward. It's basically not all that far from what Spanner is doing by using a timestamp as the snapshot. But this is all relatively independent of this patch. Regards, Ants Aasma -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Causal reads take II
On Tue, Jan 3, 2017 at 3:43 AM, Thomas Munro wrote: > Here is a new version of my "causal reads" patch (see the earlier > thread from the 9.6 development cycle[1]), which provides a way to > avoid stale reads when load balancing with streaming replication. Thanks for working on this. It will let us do something a lot of people have been asking for. > Long term, I think it would be pretty cool if we could develop a set > of features that give you distributed sequential consistency on top of > streaming replication. Something like (this | causality-tokens) + > SERIALIZABLE-DEFERRABLE-on-standbys[3] + > distributed-dirty-read-prevention[4]. Is it necessary that causal writes wait for replication before making the transaction visible on the master? I'm asking because the per tx variable wait time between logging commit record and making transaction visible makes it really hard to provide matching visibility order on master and standby. In CSN based snapshot discussions we came to the conclusion that to make standby visibility order match master while still allowing for async transactions to become visible before they are durable we need to make the commit sequence a vector clock and transmit extra visibility ordering information to standby's. Having one more level of delay between wal logging of commit and making it visible would make the problem even worse. One other thing that might be an issue for some users is that this patch only ensures that clients observe forwards progress of database state after a writing transaction. With two consecutive read only transactions that go to different servers a client could still observe database state going backwards. It seems that fixing that would require either keeping some per client state or a global agreement on what snapshots are safe to provide, both of which you tried to avoid for this feature. Regards, Ants Aasma -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] emergency outage requiring database restart
On Wed, Jan 4, 2017 at 5:36 PM, Merlin Moncure wrote: > Still getting checksum failures. Over the last 30 days, I see the > following. Since enabling checksums FWICT none of the damage is > permanent and rolls back with the transaction. So creepy! The checksums still only differ in least significant digits which pretty much means that there is a block number mismatch. So if you rule out filesystem not doing its job correctly and transposing blocks, it could be something else that is resulting in blocks getting read from a location that happens to differ by a small multiple of page size. Maybe somebody is racily mucking with table fd's between seeking and reading. That would explain the issue disappearing after a retry. Maybe you can arrange for the RelFileNode and block number to be logged for the checksum failures and check what the actual checksums are in data files surrounding the failed page. If the requested block number contains something completely else, but the page that follows contains the expected checksum value, then it would support this theory. Regards, Ants Aasma -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Replication slot xmin is not reset if HS feedback is turned off while standby is shut down
On 5 Jan 2017 2:54 a.m., "Craig Ringer" wrote: > Ants, do you think you'll have a chance to convert your shell script > test into a TAP test in src/test/recovery? > > Simon has said he would like to commit this fix. I'd personally be > happier if it had a test to go with it. > > You could probably just add to src/test/recover/t/001 which now > contains my additions for hot standby. Do you feel the test in the attached patch is enough or would you like to see anything else covered? Regards, Ants Aasma diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c index c6b54ec..0ab936f 100644 --- a/src/backend/replication/walreceiver.c +++ b/src/backend/replication/walreceiver.c @@ -1157,7 +1157,10 @@ XLogWalRcvSendReply(bool force, bool requestReply) * in case they don't have a watch. * * If the user disables feedback, send one final message to tell sender - * to forget about the xmin on this standby. + * to forget about the xmin on this standby. We also send this message + * on first connect because a previous connection might have set xmin + * on a replication slot. (If we're not using a slot it's harmless to + * send a feedback message explicitly setting InvalidTransactionId). */ static void XLogWalRcvSendHSFeedback(bool immed) @@ -1167,7 +1170,8 @@ XLogWalRcvSendHSFeedback(bool immed) uint32 nextEpoch; TransactionId xmin; static TimestampTz sendTime = 0; - static bool master_has_standby_xmin = false; + /* initially true so we always send at least one feedback message */ + static bool master_has_standby_xmin = true; /* * If the user doesn't want status to be reported to the master, be sure @@ -1192,14 +1196,17 @@ XLogWalRcvSendHSFeedback(bool immed) } /* - * If Hot Standby is not yet active there is nothing to send. Check this - * after the interval has expired to reduce number of calls. + * If Hot Standby is not yet accepting connections there is nothing to + * send. Check this after the interval has expired to reduce number of + * calls. + * + * Bailing out here also ensures that we don't send feedback until we've + * read our own replication slot state, so we don't tell the master to + * discard needed xmin or catalog_xmin from any slots that may exist + * on this replica. */ if (!HotStandbyActive()) - { - Assert(!master_has_standby_xmin); return; - } /* * Make the expensive call to get the oldest xmin once we are certain diff --git a/src/test/recovery/t/001_stream_rep.pl b/src/test/recovery/t/001_stream_rep.pl index eef512d..7fb2e9e 100644 --- a/src/test/recovery/t/001_stream_rep.pl +++ b/src/test/recovery/t/001_stream_rep.pl @@ -3,7 +3,7 @@ use strict; use warnings; use PostgresNode; use TestLib; -use Test::More tests => 22; +use Test::More tests => 24; # Initialize master node my $node_master = get_new_node('master'); @@ -161,3 +161,22 @@ is($catalog_xmin, '', 'non-cascaded slot xmin still null with hs_feedback reset' ($xmin, $catalog_xmin) = get_slot_xmins($node_standby_1, $slotname_2); is($xmin, '', 'cascaded slot xmin null with hs feedback reset'); is($catalog_xmin, '', 'cascaded slot xmin still null with hs_feedback reset'); + +diag "re-enabling hot_standby_feedback and disabling while stopped"; +$node_standby_2->safe_psql('postgres', 'ALTER SYSTEM SET hot_standby_feedback = on;'); +$node_standby_2->reload; + +$node_master->safe_psql('postgres', qq[INSERT INTO tab_int VALUES (11000);]); +replay_check(); + +$node_standby_2->safe_psql('postgres', 'ALTER SYSTEM SET hot_standby_feedback = off;'); +$node_standby_2->stop; + +($xmin, $catalog_xmin) = get_slot_xmins($node_standby_1, $slotname_2); +isnt($xmin, '', 'cascaded slot xmin non-null with postgres shut down'); + +# Xmin from a previous run should be cleared on startup. +$node_standby_2->start; + +($xmin, $catalog_xmin) = get_slot_xmins($node_standby_1, $slotname_2); +is($xmin, '', 'cascaded slot xmin reset after startup with hs feedback reset'); -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Replication slot xmin is not reset if HS feedback is turned off while standby is shut down
On 5 Jan 2017 2:54 a.m., "Craig Ringer" wrote: On 2 January 2017 at 22:24, Craig Ringer wrote: > > > On 2 Jan. 2017 20:20, "Simon Riggs" wrote: > > On 21 December 2016 at 13:23, Simon Riggs wrote: > >> Fix it up and I'll commit. Thanks for the report. > > I was hoping for some more effort from Ants to correct this. > > I'll commit Craig's new tests for hs feedback before this, so it can > go in with a Tap test, so I imagine we're about a week away from > commit on this. > > > I posted a revised version of Ants's patch that passes the shell script > test. > > A TAP test would likely make sene though, I agree. Ants, do you think you'll have a chance to convert your shell script test into a TAP test in src/test/recovery? Simon has said he would like to commit this fix. I'd personally be happier if it had a test to go with it. You could probably just add to src/test/recover/t/001 which now contains my additions for hot standby. I'm travelling right now, but I should be able to give it a shot next week. Regards, Ants Aasma
Re: [HACKERS] Replication slot xmin is not reset if HS feedback is turned off while standby is shut down
On Wed, Dec 21, 2016 at 2:09 PM, Craig Ringer wrote: > On 21 December 2016 at 15:40, Ants Aasma wrote: > >>> So -1 on this part of the patch, unless there's something I've >>> misunderstood. >> >> Currently there was no feedback sent if hot standby was not active. I >> was not sure if it was safe to call GetOldestXmin() in that case. >> However I did not consider cascading replica slots wanting to hold >> back xmin, where resetting the parents xmin is indeed wrong. Do you >> know if GetOldestXmin() is safe at this point and we can just remove >> the HotStandbyActive() check? Otherwise I think the correct approach >> is to move the check and return inside the hot_standby_feedback case >> like this: >> >> if (hot_standby_feedback) >> { >> if (!HotStandbyActive()) >>return; > > I feel like I'm missing something obvious here. If we force sending > hot standby feedback at least once, by assuming > master_has_standby_xmin = true at startup, why isn't that sufficient? > We'll send InvalidTransactionId if hot_standby_feedback is off. Isn't > that the point? > > It's safe to call GetOldestXmin pretty much as soon as we load the > recovery start checkpoint. It won't consider the state of replication > slots until later in startup, but that's a pre-existing flaw that > should be addressed separately. > > Why do we need to do more than master_has_standby_xmin = true ? There was a !HotStandbyActive() check there previously, I was not sure if it was only to avoid sending useless messages or was necessary because something was not initialized otherwise. Looks like it is not needed and can be removed. Revised patch that does so attached. Regards, Ants Aasma diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c index cc3cf7d..84ffa91 100644 --- a/src/backend/replication/walreceiver.c +++ b/src/backend/replication/walreceiver.c @@ -1157,7 +1157,9 @@ XLogWalRcvSendReply(bool force, bool requestReply) * in case they don't have a watch. * * If the user disables feedback, send one final message to tell sender - * to forget about the xmin on this standby. + * to forget about the xmin on this standby. We also send this message + * on first connect because a previous connection might have set xmin + * on a replication slot. */ static void XLogWalRcvSendHSFeedback(bool immed) @@ -1167,7 +1169,8 @@ XLogWalRcvSendHSFeedback(bool immed) uint32 nextEpoch; TransactionId xmin; static TimestampTz sendTime = 0; - static bool master_has_standby_xmin = false; + /* initially true so we always send at least one feedback message */ + static bool master_has_standby_xmin = true; /* * If the user doesn't want status to be reported to the master, be sure @@ -1192,16 +1195,6 @@ XLogWalRcvSendHSFeedback(bool immed) } /* - * If Hot Standby is not yet active there is nothing to send. Check this - * after the interval has expired to reduce number of calls. - */ - if (!HotStandbyActive()) - { - Assert(!master_has_standby_xmin); - return; - } - - /* * Make the expensive call to get the oldest xmin once we are certain * everything else has been checked. */ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Replication slot xmin is not reset if HS feedback is turned off while standby is shut down
On Wed, Dec 21, 2016 at 4:14 AM, Craig Ringer wrote: > Re the patch, I don't like > > - static bool master_has_standby_xmin = false; > + static bool master_has_standby_xmin = true; > > without any comment. It's addressed in the comment changes on the > header func, but the link isn't obvious. Maybe a oneliner to say > "ensure we always send at least one feedback message" ? Will fix. > I think this part of the patch is correct and useful. > > > > I don't see the point of > > +* > +* If Hot Standby is not yet active we reset the xmin value. Check > this > +* after the interval has expired to reduce number of calls. > */ > - if (hot_standby_feedback) > + if (hot_standby_feedback && HotStandbyActive()) > > though. Forcing feedback to send InvalidTransactionId until hot > standby feedback is actively running seems counterproductive; we want > to lock in feedback as soon as possible, not wait until we're > accepting transactions. Simon and I are in fact working on changes to > do the opposite of what you've got here and ensure that we don't allow > hot standby connections until we know hot_standby_feedback is in > effect and a usable xmin is locked in. That way the master won't > remove tuples we need as soon as we start an xact and cause a conflict > with recovery. > > If there are no running xacts, GetOldestXmin() will return the slot > xmin if any. We should definitely not be clearing that just because > we're not accepting hot standby connections yet; we want to make very > sure it remains in effect unless the user explicitly de-configures hot > standby. > > (There's another pre-exisitng problem there, where we can start the > walsender before slots are initialized, that I'll be addressing > separately). > > If there's no slot xmin either, we'll send nextXid. That's a sensible > starting point for hot standby feedback until we start some > transactions. > > So -1 on this part of the patch, unless there's something I've misunderstood. Currently there was no feedback sent if hot standby was not active. I was not sure if it was safe to call GetOldestXmin() in that case. However I did not consider cascading replica slots wanting to hold back xmin, where resetting the parents xmin is indeed wrong. Do you know if GetOldestXmin() is safe at this point and we can just remove the HotStandbyActive() check? Otherwise I think the correct approach is to move the check and return inside the hot_standby_feedback case like this: if (hot_standby_feedback) { if (!HotStandbyActive()) return; Regards, Ants Aasma -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Replication slot xmin is not reset if HS feedback is turned off while standby is shut down
I just had a client issue with table bloat that I traced back to a stale xmin value in a replication slot. xmin value from hot standby feedback is stored in replication slot and used for vacuum xmin calculation. If hot standby feedback is turned off while walreceiver is active then the xmin gets reset by HS feedback message containing InvalidTransactionId. However, if feedback gets turned off while standby is shut down this message never gets sent and a stale value gets left behind in the replication slot holding back vacuum. The simple fix seems to be to always send out at least one feedback message on each connect regardless of hot_standby_feedback setting. Patch attached. Looks like this goes back to version 9.4. It could conceivably cause issues for replication middleware that does not know how to handle hot standby feedback messages. Not sure if any exist and if that is a concern. A shell script to reproduce the problem is also attached, adjust the PGPATH variable to your postgres install and run in an empty directory. Regards, Ants Aasma diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c index cc3cf7d..31333ec 100644 --- a/src/backend/replication/walreceiver.c +++ b/src/backend/replication/walreceiver.c @@ -1157,7 +1157,9 @@ XLogWalRcvSendReply(bool force, bool requestReply) * in case they don't have a watch. * * If the user disables feedback, send one final message to tell sender - * to forget about the xmin on this standby. + * to forget about the xmin on this standby. We also send this message + * on first connect because a previous connection might have set xmin + * on a replication slot. */ static void XLogWalRcvSendHSFeedback(bool immed) @@ -1167,7 +1169,7 @@ XLogWalRcvSendHSFeedback(bool immed) uint32 nextEpoch; TransactionId xmin; static TimestampTz sendTime = 0; - static bool master_has_standby_xmin = false; + static bool master_has_standby_xmin = true; /* * If the user doesn't want status to be reported to the master, be sure @@ -1192,20 +1194,13 @@ XLogWalRcvSendHSFeedback(bool immed) } /* - * If Hot Standby is not yet active there is nothing to send. Check this - * after the interval has expired to reduce number of calls. - */ - if (!HotStandbyActive()) - { - Assert(!master_has_standby_xmin); - return; - } - - /* * Make the expensive call to get the oldest xmin once we are certain * everything else has been checked. + * + * If Hot Standby is not yet active we reset the xmin value. Check this + * after the interval has expired to reduce number of calls. */ - if (hot_standby_feedback) + if (hot_standby_feedback && HotStandbyActive()) xmin = GetOldestXmin(NULL, false); else xmin = InvalidTransactionId; slot-xmin-not-reset-reproduce.sh Description: Bourne shell script -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] emergency outage requiring database restart
On Wed, Oct 26, 2016 at 8:43 PM, Merlin Moncure wrote: > /var/lib/pgsql/9.5/data/pg_log/postgresql-26.log | grep "page > verification" > 2016-10-26 11:26:42 CDT [postgres@castaging]: WARNING: page > verification failed, calculated checksum 37251 but expected 37244 > 2016-10-26 11:27:55 CDT [postgres@castaging]: WARNING: page > verification failed, calculated checksum 37249 but expected 37244 > 2016-10-26 12:16:44 CDT [postgres@castaging]: WARNING: page > verification failed, calculated checksum 44363 but expected 44364 > 2016-10-26 12:18:58 CDT [postgres@castaging]: WARNING: page > verification failed, calculated checksum 49525 but expected 49539 > 2016-10-26 12:19:12 CDT [postgres@castaging]: WARNING: page > verification failed, calculated checksum 37345 but expected 37340 The checksum values are improbably close. The checksum algorithm has decently good mixing of all bits in the page. Having the first byte match in 5 checksums makes this 1:2^40 improbable. What is not mixed in properly is the block number, it only gets xor'ed before packing the value into 16bits using modulo 0x. So I'm pretty sure different block numbers were used for writing out and reading in the page. Either the blocknum gets corrupted between calculating the checksum and writing the page out (unlikely given the proximity), or the pages are somehow getting transposed in the storage. Regards, Ants Aasma -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pluggable storage
On Tue, Aug 16, 2016 at 9:46 PM, Andres Freund wrote: > On 2016-08-15 12:02:18 -0400, Robert Haas wrote: >> I am somewhat inclined to >> believe that we need to restructure the executor in a bigger way so >> that it passes around datums instead of tuples; I'm inclined to >> believe that the current tuple-centric model is probably not optimal >> even for the existing storage format. > > I actually prototyped that, and it's not an easy win so far. Column > extraction cost, even after significant optimization, is still often a > significant portion of the runtime. And e.g. projection only extracting > all columns, after evaluating a restrictive qual referring to an "early" > column, can be a significant win. We'd definitely have to give up on > extracting columns 0..n when accessing later columns... Hm. What about going even further than [1] in converting the executor to being opcode based and merging projection and qual evaluation to a single pass? Optimizer would then have some leeway about how to order column extraction and qual evaluation. Might even be worth it to special case some functions as separate opcodes (e.g. int4eq, timestamp_lt). Regards, Ants Aasma [1] https://www.postgresql.org/message-id/20160714011850.bd5zhu35szle3...@alap3.anarazel.de -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal for CSN based snapshots
On Wed, Aug 10, 2016 at 7:54 PM, Alexander Korotkov wrote: > Oh, I found that I underestimated complexity of async commit... :) > > Do I understand right that now async commit right as follows? > 1) Async transaction confirms commit before flushing WAL. > 2) Other transactions sees effect of async transaction only when its WAL > flushed. > 3) In the session which just committed async transaction, effect of this > transaction is visible immediately (before WAL flushed). Is it true? Current code simplified: XactLogCommitRecord() if (synchronous_commit) XLogFlush() ProcArrayEndTransaction() // Become visible The issue we are discussing is that with CSNs, the "become visible" portion must occur in CSN order. If CSN == LSN, then async transactions that have their commit record after a sync record must wait for the sync record to flush and become visible. Simplest solution is to not require CSN == LSN and just assign a CSN value immediately before becoming visible. Regards, Ants Aasma -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal for CSN based snapshots
On Wed, Aug 10, 2016 at 6:09 PM, Heikki Linnakangas wrote: > Hmm. There's one more possible way this could all work. Let's have CSN == > LSN, also for asynchronous commits. A snapshot is the current insert > position, but also make note of the current flush position, when you take a > snapshot. Now, when you use the snapshot, if you ever see an XID that > committed between the snapshot's insert position and the flush position, > wait for the WAL to be flushed up to the snapshot's insert position at that > point. With that scheme, an asynchronous commit could return to the > application without waiting for a flush, but if someone actually looks at > the changes the transaction made, then that transaction would have to wait. > Furthermore, we could probably skip that waiting too, if the reading > transaction is also using synchronous_commit=off. > > That's slightly different from the current behaviour. A transaction that > runs with synchronous_commit=on, and reads data that was modified by an > asynchronous transaction, would take a hit. But I think that would be > acceptable. My proposal of vector clocks would allow for pretty much exactly current behavior. To simplify, there would be lastSyncCommitSeqNo and lastAsyncCommitSeqNo variables in ShmemVariableCache. Transaction commit would choose which one to update based on synchronous_commit setting and embed the value of the setting into CSN log. Snapshots would contain both values, when checking for CSN visibility use the value of the looked up synchronous_commit setting to decide which value to compare against. Standby's replaying commit records would just update both values, resulting in transactions becoming visible in xlog order, as they do today. The scheme would allow for inventing a new xlog record/replication message communicating visibility ordering. However I don't see why inventing a separate CSN concept is a large problem. Quite the opposite, unless there is a good reason that I'm missing, it seems better to not unnecessarily conflate commit record durability and transaction visibility ordering. Not having them tied together allows for an external source to provide CSN values, allowing for interesting distributed transaction implementations. E.g. using a timestamp as the CSN a'la Google Spanner and the TrueTime API. Regards, Ants Aasma -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multivariate statistics (v19)
On Wed, Aug 3, 2016 at 4:58 AM, Tomas Vondra wrote: > 2) combining multiple statistics > > I think the ability to combine multivariate statistics (covering different > subsets of conditions) is important and useful, but I'm starting to think > that the current implementation may not be the correct one (which is why I > haven't written the SGML docs about this part of the patch series yet). While researching this topic a few years ago I came across a paper on this exact topic called "Consistently Estimating the Selectivity of Conjuncts of Predicates" [1]. While effective it seems to be quite heavy-weight, so would probably need support for tiered optimization. [1] https://courses.cs.washington.edu/courses/cse544/11wi/papers/markl-vldb-2005.pdf Regards, Ants Aasma -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal for CSN based snapshots
sary, and on the other side, calculate optimistic oldest snapshot, publish and then recheck, if necessary, republish. * While commits are not as performance critical it may still be beneficial to defer clog updates and perform them in larger batches to reduce clog lock traffic. I think I have achieved some clarity on my idea of snapshots that migrate between being CSN and XID based. The essential problem with old CSN based snapshots is that the CSN log for their xmin-xmax interval needs to be kept around in a quick to access datastructure. And the problem with long running transactions is twofold, first is that they need a well defined location where to publish the visibility info for their xids and secondly, they enlarge the xmin-xmax interval of all concurrent snapshots, needing a potentially huge amount of CSN log to be kept around. One way to deal with it is to just accept it and use a SLRU as in this patch. My new insight is that a snapshot doesn't need to be either-or CSN or XIP (xid in progress) array based, but it can also be a hybrid. There would be a sorted array of in progress xids and a non-overlapping CSN xmin-xmax interval where CSN log needs to be consulted. As snapshots age they scan the CSN log and incrementally build their XIP array, basically lazily constructing same data structure used in snapshots today. Old in progress xids need a slot for themselves to be kept around, but all new snapshots being taken would immediately classify them as old, store them in XIP and not include them in their CSN xmin. Under this scheme the amount of CSN log strictly needed is a reasonably sized ring buffer for recent xids (probably sized based on max conn), a sparse map for long transactions (bounded by max conn) and some slack for snapshots still being converted. A SLRU or similar is still needed because there is no way to ensure timely conversion of all snapshots, but by properly timing the notification the probability of actually using it should be extremely low. Datastructure maintenance operations occur naturally on xid assignment and are easily batched. Long running transactions still affect all snapshots, but the effect is small and not dependent on transaction age, old snapshots only affect themselves. Subtransactions are still a pain, and would need something similar to the current suboverflowed scheme. On the plus side, it seems like the number of subxids to overflow could be global, not per backend. In short: * A directly mapped ring buffer for recent xids could be accessed lock free, would take care most of the traffic in most of the cases. Looks like it would be a good trade-off for complexity/performance. * To keep workloads with wildly varying transaction lengths in bounded amount of memory, a significantly more complex hybrid snapshot scheme is needed. It remains to be seen if these workloads are actually a significant issue. But if they are, the approach described might provide a way out. Regards, Ants Aasma -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: Data at rest encryption
On Mon, Jun 13, 2016 at 5:17 AM, Michael Paquier wrote: > On Sun, Jun 12, 2016 at 4:13 PM, Ants Aasma wrote: >>> I feel separate file is better to include the key data instead of pg_control >>> file. >> >> I guess that would be more flexible. However I think at least the fact >> that the database is encrypted should remain in the control file to >> provide useful error messages for faulty backup procedures. > > Another possibility could be always to do some encryption at data-type > level for text data. For example I recalled the following thing while > going through this thread: > https://github.com/nec-postgres/tdeforpg > Though I don't quite understand the use for encrypt.enable in this > code... This has the advantage to not patch upstream. While certainly possible, this does not cover the requirements I want to satisfy - user data never gets stored on disk unencrypted without making changes to the application or schema. This seems to be mostly about separating administrator roles, specifically that centralised storage and backup administrators should not have access to database contents. I see this as orthogonal to per column encryption, which in my opinion is better done in the application. Regards, Ants Aasma -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: Data at rest encryption
On Fri, Jun 10, 2016 at 5:23 AM, Haribabu Kommi wrote: > 1. Instead of doing the entire database files encryption, how about > providing user an option to protect only some particular tables that > wants the encryption at table/tablespace level. This not only provides > an option to the user, it reduces the performance impact on tables > that doesn't need any encryption. The problem with this approach > is that every xlog record needs to validate to handle the encryption > /decryption, instead of at page level. Is there a real need for this? The customers I have talked to want to encrypt the whole database and my goal is to make the feature fast enough to make that feasible for pretty much everyone. I guess switching encryption off per table would be feasible, but the key setup would still need to be done at server startup. Per record encryption would result in some additional information leakage though. Overall I thought it would not be worth it, but I'm willing to have my mind changed on this. > 2. Instead of depending on a contrib module for the encryption, how > about integrating pgcrypto contrib in to the core and add that as a > default encryption method. And also provide an option to the user > to use a different encryption methods if needs. Technically that would be simple enough, this is more of a policy decision. I think having builtin encryption provided by pgcrypto is completely fine. If a consensus emerges that it needs to be integrated, it would need to be a separate patch anyway. > 3. Currently entire xlog pages are encrypted and stored in the file. > can pg_xlogdump works with those files? Technically yes, with the patch as it stands, no. Added this to my todo list. > 4. For logical decoding, how about the adding a decoding behavior > based on the module to decide whether data to be encrypted/decrypted. The data to be encrypted does not depend on the module used, so I don't think it should be module controlled. The reorder buffer contains pretty much the same stuff as the xlog, so not encrypting it does not look like a valid choice. For logical heap rewrites it could be argued that nothing useful is leaked in most cases, but encrypting it is not hard. Just a small matter of programming. > 5. Instead of providing passphrase through environmental variable, > better to provide some options to pg_ctl etc. That looks like it would be worse from a security perspective. Integrating a passphrase prompt would be an option, but a way for scripts to provide passphrases would still be needed. > 6. I don't have any idea whether is it possible to integrate the checksum > and encryption in a single shot to avoid performance penalty. Currently no, the checksum gets stored in the page header and for any decent cipher mode the encryption of the rest of the page will depend on it. However, the performance difference should be negligible because both algorithms are compute bound for cached data. The data is very likely to be completely in L1 cache as the operations are done in quick succession. The non-cryptographic checksum algorithm could actually be an attack vector for an adversary that can trigger repeated encryption by tweaking a couple of bytes at the end of the page to see when the checksum matches and try to infer the data from that. Similarly to the CRIME attack. However the LSN stored at the beginning of the page header basically provides a nonce that makes this impossible. This also means that encryption needs to imply wal_log_hints. Will include this in the next version of the patch. >> I would also like to incorporate some database identifier as a salt in >> key setup. However, system identifier stored in control file doesn't >> fit this role well. It gets initialized somewhat too late in the >> bootstrap process, and more importantly, gets changed on pg_upgrade. >> This will make link mode upgrades impossible, which seems like a no >> go. I'm torn whether to add a new value for this purpose (perhaps >> stored outside the control file) or allow setting of system identifier >> via initdb. The first seems like a better idea, the file could double >> as a place to store additional encryption parameters, like key length >> or different cipher primitive. > > I feel separate file is better to include the key data instead of pg_control > file. I guess that would be more flexible. However I think at least the fact that the database is encrypted should remain in the control file to provide useful error messages for faulty backup procedures. Thanks for your input. Regards, Ants Aasma -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] WIP: Data at rest encryption
Hi all, I have been working on data-at-rest encryption support for PostgreSQL. In my experience this is a common request that customers make. The short of the feature is that all PostgreSQL data files are encrypted with a single master key and are decrypted when read from the OS. It does not provide column level encryption which is an almost orthogonal feature, arguably better done client side. Similar things can be achieved with filesystem level encryption. However this is not always optimal for various reasons. One of the better reasons is the desire for HSM based encryption in a storage area network based setup. Attached to this mail is a work in progress patch that adds an extensible encryption mechanism. There are some loose ends left to tie up, but the general concept and architecture is at a point where it's ready for some feedback, fresh ideas and bikeshedding. Usage = Set up database like so: (read -sp "Postgres passphrase: " PGENCRYPTIONKEY; echo; export PGENCRYPTIONKEY initdb -k -K pgcrypto $PGDATA ) Start PostgreSQL: (read -sp "Postgres passphrase: " PGENCRYPTIONKEY; echo; export PGENCRYPTIONKEY postgres $PGDATA ) Design == The patch adds a new GUC called encryption_library, when specified the named library is loaded before shared_preload_libraries and is expected to register its encryption routines. For now the API is pretty narrow, one parameterless function that lets the extension do key setup on its own terms, and two functions for encrypting/decrypting an arbitrary sized block of data with tweak. The tweak should alter the encryption function so that identical block contents are encrypted differently based on their location. The GUC needs to be set at bootstrap time, so it gets set by a new option for initdb. During bootstrap an encryption sample gets stored in the control file, enabling useful error messages. The library name is not stored in controldata. I'm not quite sure about this decision. On one hand it would be very useful to tell the user what he needs to get at his data if the configuration somehow goes missing and it would get rid of the extra GUC. On the other hand I don't really want to bloat control data, and the same encryption algorithm could be provided by different implementations. For now the encryption is done for everything that goes through md, xlog and slru. Based on a review of read/write/fread/fwrite calls this list is missing: * BufFile - needs refactoring * Logical reorder buffer serialization - probably needs a stream mode cipher API addition. * logical_heap_rewrite - can be encrypted as one big block * 2PC state data - ditto * pg_stat_statements - query texts get appended so a stream mode cipher might be needed here too. copydir needed some changes too because tablespace and database oid are included in the tweak and so copying also needs to decrypt and encrypt with the new tweak value. For demonstration purposes I imported Brian Gladman's AES-128-XTS mode implementation into pgcrypto and used an environment variable for key setup. This part is not really in any reviewable state, the XTS code needs heavy cleanup to bring it up to PostgreSQL coding standards, keysetup needs something secure, like PBKDF2 or scrypt. Performance with current AES implementation is not great, but not horrible either, I'm seeing around 2x slowdown for larger than shared_buffers, smaller than free memory workloads. However the plan is to fix this - I have a prototype AES-NI implementation that does 3GB/s per core on my Haswell based laptop (1.25 B/cycle). Open questions == The main questions is what to do about BufFile? It currently provides both unaligned random access and a block based interface. I wonder if it would be a good idea to refactor it to be fully block based under the covers. I would also like to incorporate some database identifier as a salt in key setup. However, system identifier stored in control file doesn't fit this role well. It gets initialized somewhat too late in the bootstrap process, and more importantly, gets changed on pg_upgrade. This will make link mode upgrades impossible, which seems like a no go. I'm torn whether to add a new value for this purpose (perhaps stored outside the control file) or allow setting of system identifier via initdb. The first seems like a better idea, the file could double as a place to store additional encryption parameters, like key length or different cipher primitive. Regards, Ants Aasma diff --git a/contrib/pgcrypto/Makefile b/contrib/pgcrypto/Makefile index 18bad1a..04ce887 100644 --- a/contrib/pgcrypto/Makefile +++ b/contrib/pgcrypto/Makefile @@ -20,7 +20,7 @@ SRCS = pgcrypto.c px.c px-hmac.c px-crypt.c \ mbuf.c pgp.c pgp-armor.c pgp-cfb.c pgp-compress.c \ pgp-decrypt.c pgp-encrypt.c pgp-info.c pgp-mpi.c \ pgp-pubdec.c pgp-pubenc.c pgp-pubkey.c pgp-s2k.c \ - pgp-pgsql.c + pgp-pgsql
Re: [HACKERS] Does people favor to have matrix data type?
On Wed, May 25, 2016 at 10:38 AM, Simon Riggs wrote: > On 25 May 2016 at 03:52, Kouhei Kaigai wrote: >> >> In a few days, I'm working for a data type that represents matrix in >> mathematical area. Does people favor to have this data type in the core, >> not only my extension? > > > If we understood the use case, it might help understand whether to include > it or not. > > Multi-dimensionality of arrays isn't always useful, so this could be good. Many natural language and image processing methods extract feature vectors that then use some simple distance metric, like dot product to calculate vector similarity. For example we presented a latent semantic analysis prototype at pgconf.eu 2015 that used real[] to store the features and a dotproduct(real[], real[]) real function to do similarity matching. However using real[] instead of a hypothetical realvector or realmatrix did not prove to be a huge overhead, so overall I'm on the fence for the usefulness of a special type. Maybe a helper function or two to validate the additional restrictions in a check constraint would be enough. Regards, Ants Aasma -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Is the unfair lwlock behavior intended?
On Tue, May 24, 2016 at 9:03 AM, Tsunakawa, Takayuki wrote: > I encountered a strange behavior of lightweight lock in PostgreSQL 9.2. That > appears to apply to 9.6, too, as far as I examine the code. Could you tell > me if the behavior is intended or needs fix? > > Simply put, the unfair behavior is that waiters for exclusive mode are > overtaken by share-mode lockers who arrive later. 9.5 had significant LWLock scalability improvements. This might improve performance enough so that exclusive lockers don't get completely starved. It would be helpful if you could test if it's still possible to trigger starvation with the new code. Regards, Ants Aasma -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Is the unfair lwlock behavior intended?
On Tue, May 24, 2016 at 1:29 PM, Alexander Korotkov wrote: > On Tue, May 24, 2016 at 9:03 AM, Tsunakawa, Takayuki > wrote: >> Is this intentional, or should we make the later share-lockers if someone >> is in the wait queue? > > I've already observed such behavior, see [1]. I think that now there is no > consensus on how to fix that. For instance, Andres express opinion that > this shouldn't be fixed from LWLock side [2]. > FYI, I'm planning to pickup work on CSN patch [3] for 10.0. CSN should fix > various scalability issues including high ProcArrayLock contention. Some amount of non-fairness is ok, but degrading to the point of complete denial of service is not very graceful. I don't think it's realistic to hope that all lwlock contention issues will be fixed any time soon. Some fallback mechanism would be extremely nice until then. It seems to me that we could fix complete starvation with no noticeable effect on common cases. The general idea would be to have a separate flag bit for starving exclusive lockers. When lock is taken in shared mode an exclusive locker puts itself on the wait queue like it does now, and then proceeds to wait on its semaphore. But differently from now, the semaphore wait has a (configurable?) timeout after which the exclusive locker wakes up and sets the new LW_FLAG_STARVING flag. When this flag is set new shared lockers will now consider the lock taken and will queue up to wait. The exclusive locker that marked itself as starving will get a chance to run once the existing shared lockers complete and wake it up. If there are other nearly starved exclusive lockers they are likely to be at the head of the wait queue and will also get a chance to run. If the wait queue is empty or has a shared locker at the head the starvation flag gets reset and all of the queued up shared lockers get their turn. With the proposed scheme non-starved cases will execute pretty much exactly what they do now, except for a slightly different check for shared lock availability. In starvation scenarios the exclusive lockers will get a chance to run once per starvation grace period. That might still not be ideal or "fair", but it is a lot better than the status quo of indefinitely blocking. PS: if/when you are picking up the CSN work, ping me to write up some of the insights and ideas I have had while thinking about this topic. Regards, Ants Aasma -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] asynchronous and vectorized execution
On Wed, May 11, 2016 at 3:52 AM, Andres Freund wrote: > On 2016-05-11 03:20:12 +0300, Ants Aasma wrote: >> On Tue, May 10, 2016 at 7:56 PM, Robert Haas wrote: >> > On Mon, May 9, 2016 at 8:34 PM, David Rowley >> > wrote: >> > I don't have any at the moment, but I'm not keen on hundreds of new >> > vector functions that can all have bugs or behavior differences versus >> > the unvectorized versions of the same code. That's a substantial tax >> > on future development. I think it's important to understand what >> > sorts of queries we are targeting here. KaiGai's GPU-acceleration >> > stuff does great on queries with complex WHERE clauses, but most >> > people don't care not only because it's out-of-core but because who >> > actually looks for the records where (a + b) % c > (d + e) * f / g? >> > This seems like it has the same issue. If we can speed up common >> > queries people are actually likely to run, OK, that's interesting. >> >> I have seen pretty complex expressions in the projection and >> aggregation. Couple dozen SUM(CASE WHEN a THEN b*c ELSE MIN(d,e)*f >> END) type of expressions. In critical places had to replace them with >> a C coded function that processed a row at a time to avoid the >> executor dispatch overhead. > > I've seen that as well, but Was it the actual fmgr indirection causing > the overhead, or was it ExecQual/ExecMakeFunctionResultNoSets et al? I don't remember what the exact profile looked like, but IIRC it was mostly Exec* stuff with advance_aggregates also up there. Regards, Ants Aasma -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] asynchronous and vectorized execution
On Tue, May 10, 2016 at 7:56 PM, Robert Haas wrote: > On Mon, May 9, 2016 at 8:34 PM, David Rowley > wrote: > I don't have any at the moment, but I'm not keen on hundreds of new > vector functions that can all have bugs or behavior differences versus > the unvectorized versions of the same code. That's a substantial tax > on future development. I think it's important to understand what > sorts of queries we are targeting here. KaiGai's GPU-acceleration > stuff does great on queries with complex WHERE clauses, but most > people don't care not only because it's out-of-core but because who > actually looks for the records where (a + b) % c > (d + e) * f / g? > This seems like it has the same issue. If we can speed up common > queries people are actually likely to run, OK, that's interesting. I have seen pretty complex expressions in the projection and aggregation. Couple dozen SUM(CASE WHEN a THEN b*c ELSE MIN(d,e)*f END) type of expressions. In critical places had to replace them with a C coded function that processed a row at a time to avoid the executor dispatch overhead. > By the way, I think KaiGai's GPU-acceleration stuff points to another > pitfall here. There's other stuff somebody might legitimately want to > do that requires another copy of each function. For example, run-time > code generation likely needs that (a function to tell the code > generator what to generate for each of our functions), and > GPU-acceleration probably does, too. If fixing a bug in numeric_lt > requires changing not only the regular version and the vectorized > version but also the GPU-accelerated version and the codegen version, > Tom and Dean are going to kill us. And justifiably so! Granted, > nobody is proposing those other features in core right now, but > they're totally reasonable things to want to do. My thoughts in this area have been circling around getting LLVM to do the heavy lifting. LLVM/clang could compile existing C functions to IR and bundle those with the DB. At query planning time or maybe even during execution the functions can be inlined into the compiled query plan, LLVM can then be coaxed to copy propagate, constant fold and dead code eliminate the bejeezus out of the expression tree. This way duplication of the specialized code can be kept to a minimum while at least the common cases can completely avoid the fmgr overhead. This approach would also mesh together fine with batching. Given suitably regular data structures and simple functions, LLVM will be able to vectorize the code. If not it will still end up with a nice tight loop that is an order of magnitude or two faster than the current executor. The first cut could take care of ExecQual, ExecTargetList and friends. Later improvements could let execution nodes provide basic blocks that would then be threaded together into the main execution loop. If some node does not implement the basic block interface a default implementation is used that calls the current interface. It gets a bit handwavy at this point, but the main idea would be to enable data marshaling so that values can be routed directly to the code that needs them without being written to intermediate storage. > I suspect the number of queries that are being hurt by fmgr overhead > is really large, and I think it would be nice to attack that problem > more directly. It's a bit hard to discuss what's worthwhile in the > abstract, without performance numbers, but when you vectorize, how > much is the benefit from using SIMD instructions and how much is the > benefit from just not going through the fmgr every time? My feeling is the same, fmgr overhead and data marshaling, dynamic dispatch through the executor is the big issue. This is corroborated by what I have seen found by other VM implementations. Once you get the data into an uniform format where vectorized execution could be used, the CPU execution resources are no longer the bottleneck. Memory bandwidth gets in the way, unless each input value is used in multiple calculations. And even then, we are looking at a 4x speedup at best. Regards, Ants Aasma -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reviewing freeze map code
On Mon, May 9, 2016 at 10:53 PM, Robert Haas wrote: > On Sun, May 8, 2016 at 10:42 PM, Masahiko Sawada > wrote: >> Attached draft patch adds SCANALL option to VACUUM in order to scan >> all pages forcibly while ignoring visibility map information. >> The option name is SCANALL for now but we could change it after got >> consensus. > > If we're going to go that way, I'd say it should be scan_all rather > than scanall. Makes it clearer, at least IMHO. Just to add some diversity to opinions, maybe there should be a separate command for performing integrity checks. Currently the best ways to actually verify database correctness do so as a side effect. The question that I get pretty much every time after I explain why we have data checksums, is "how do I check that they are correct" and we don't have a nice answer for that now. We could also use some ways to sniff out corrupted rows that don't involve crashing the server in a loop. Vacuuming pages that supposedly don't need vacuuming just to verify integrity seems very much in the same vein. I know right now isn't exactly the best time to hastily slap on such a feature, but I just wanted the thought to be out there for consideration. Regards, Ants Aasma -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] what to revert
5. mai 2016 6:14 AM kirjutas kuupäeval "Andres Freund" : > > On 2016-05-05 06:08:39 +0300, Ants Aasma wrote: > > On 5 May 2016 1:28 a.m., "Andres Freund" wrote: > > > On 2016-05-04 18:22:27 -0400, Robert Haas wrote: > > > > How would the semantics change? > > > > > > Right now the time for computing the snapshot is relevant, if > > > maintenance of xids is moved, it'll likely be tied to the time xids are > > > assigned. That seems perfectly alright, but it'll change behaviour. > > > > FWIW moving the maintenance to a clock tick process will not change user > > visible semantics in any significant way. The change could easily be made > > in the next release. > > I'm not convinced of that - right now the timeout is computed as a > offset to the time a snapshot with a certain xmin horizon is > taken. Moving the computation to GetNewTransactionId() or a clock tick > process will make it relative to the time an xid has been generated > (minus a fuzz factor). That'll behave differently in a number of cases, no? Timeout is calculated in TransactionIdLimitedForOldSnapshots() as GetSnapshotCurrentTimestamp() minus old_snapshot_timeout rounded down to previous minute. If the highest seen xmin in that minute is useful in raising cleanup xmin the threshold is bumped. Snapshots switch behavior when their whenTaken is past this threshold. Xid generation time has no effect on this timing, it's completely based on passage of time. The needed invariant is, that no snapshot having whenTaken after timeout timestamp can have xmin less than calculated bound. Moving the xmin calculation and storage to clock tick actually makes the bound tighter. The ordering between xmin calculation and clok update/read needs to be correct to ensure the invariant. Regards, Ants Aasma
Re: [HACKERS] what to revert
On 5 May 2016 1:28 a.m., "Andres Freund" wrote: > On 2016-05-04 18:22:27 -0400, Robert Haas wrote: > > How would the semantics change? > > Right now the time for computing the snapshot is relevant, if > maintenance of xids is moved, it'll likely be tied to the time xids are > assigned. That seems perfectly alright, but it'll change behaviour. FWIW moving the maintenance to a clock tick process will not change user visible semantics in any significant way. The change could easily be made in the next release. Regards, Ants Aasma
Re: [HACKERS] what to revert
On Tue, May 3, 2016 at 9:57 PM, Tomas Vondra wrote: > If you tell me how to best test it, I do have a 4-socket server sitting idly > in the corner (well, a corner reachable by SSH). I can get us some numbers, > but I haven't been following the snapshot_too_old so I'll need some guidance > on what to test. I worry about two contention points with the current implementation. The main one is the locking within MaintainOldSnapshotTimeMapping() that gets called every time a snapshot is taken. AFAICS this should show up by setting old_snapshot_threshold to any positive value and then running a simple within shared buffers scale factor read only pgbench at high concurrency (number of CPUs or a small multiple). On a single socket system this does not show up. The second one is probably a bit harder to hit, GetOldSnapshotThresholdTimestamp() has a spinlock that gets hit everytime a scan sees a page that has been modified after the snapshot was taken. A workload that would tickle this is something that uses a repeatable read snapshot, builds a non-temporary table and runs reporting on it. Something like this would work: BEGIN ISOLATION LEVEL REPEATABLE READ; DROP TABLE IF EXISTS test_:client_id; CREATE TABLE test_:client_id (x int, filler text); INSERT INTO test_:client_id SELECT x, repeat(' ', 1000) AS filler FROM generate_series(1,1000) x; SELECT (SELECT COUNT(*) FROM test_:client_id WHERE x != y) FROM generate_series(1,1000) y; COMMIT; With this script running with -c4 on a 4 core workstation I'm seeing the following kind of contention and a >2x loss in throughput: + 14.77% postgres postgres [.] GetOldSnapshotThresholdTimestamp -8.01% postgres postgres [.] s_lock - s_lock + 88.15% GetOldSnapshotThresholdTimestamp + 10.47% TransactionIdLimitedForOldSnapshots + 0.71% TestForOldSnapshot_impl + 0.57% GetSnapshotCurrentTimestamp Now this is kind of an extreme example, but I'm willing to bet that on multi socket hosts similar issues can crop up with common real world use cases. Regards, Ants Aasma -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: [COMMITTERS] pgsql: Avoid extra locks in GetSnapshotData if old_snapshot_threshold
On Mon, May 2, 2016 at 5:21 PM, Andres Freund wrote: > On 2016-05-02 09:03:19 -0400, Robert Haas wrote: >> On Fri, Apr 29, 2016 at 6:08 PM, Kevin Grittner wrote: >> > Now to continue with the performance benchmarks. I'm pretty sure >> > we've fixed the problems when the feature is disabled >> > (old_snapshot_threshold = -1), and there are several suggestions >> > for improving performance while it is on that need to be compared >> > and benchmarked. >> >> If anyone thinks that the issue with the feature disabled is NOT >> fixed, please speak up! I'm moving the corresponding open item to >> CLOSE_WAIT status, meaning that it will be closed if nobody shows up >> to say that there is still an issue. > > Well, I don't agree that the feature is in a releaseable state. The > datastructure is pretty much non-scalable, and maintained on the wrong > side (every read, instead of once in writing writing xacts). There's no > proposal actually addressing the scalability issues. Unless I'm missing something fundamental the feature only requires tracking an upper bound on xmin observed by snapshots between clock ticks. The simplest way to do this would be a periodic process that increments a clock counter (32bit counter would be plenty) and then calculates xmin for the preceding range. With this scheme GetSnapshotData would need two atomic fetches to get current LSN and the timestamp. Test for old snapshot can also run completely lock free with a single atomic fetch of threshold timestamp. The negative side is that we need to have a process running that runs the clock ticks and the ticks may sometimes be late. Giving something like autovacuum launcher this task doesn't seem too bad and the consequence of falling behind is just delayed timing out of old snapshots. As far as I can see this approach would get rid of any scalability issues, but it is a pretty significant change and requires 64bit atomic reads to get rid of contention on xlog insert lock. Regards, Ants Aasma -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: [COMMITTERS] pgsql: Avoid extra locks in GetSnapshotData if old_snapshot_threshold
On Thu, Apr 21, 2016 at 5:16 PM, Kevin Grittner wrote: > On Wed, Apr 20, 2016 at 8:08 PM, Ants Aasma wrote: > >> However, while checking out if my proof of concept patch actually >> works I hit another issue. I couldn't get my test for the feature to >> actually work. The test script I used is attached. > > Could you provide enough to make that a self-contained reproducible > test case (i.e., that I don't need to infer or re-write any steps > or guess how to call it)? In previous cases people have given me > where they felt that the feature wasn't working there have have > been valid reasons for it to behave as it was (e.g., a transaction > with a transaction ID and an xmin which prevented cleanup from > advancing). I'll be happy to look at your case and see whether > it's another such case or some bug, but it seems a waste to reverse > engineer or rewrite parts of the test case to do so. Just to be sure I didn't have anything screwy in my build environment I redid the test on a freshly installed Fedora 23 VM. Steps to reproduce: 1. Build postgresql from git. I used ./configure --enable-debug --enable-cassert --prefix=/home/ants/pg-master 2. Set up database: cat << EOF > test-settings.conf old_snapshot_threshold = 1min logging_collector = on log_directory = 'pg_log' log_filename = 'postgresql.log' log_line_prefix = '[%m] ' log_autovacuum_min_duration = 0 EOF pg-master/bin/initdb data/ cat test-settings.conf >> data/postgresql.conf pg-master/bin/pg_ctl -D data/ start pg-master/bin/createdb 3. Install python-psycopg2 and get the test script from my earlier e-mail [1] 4. Run the test: python test_oldsnapshot.py "host=/tmp" 5. Observe that the table keeps growing even after the old snapshot threshold is exceeded and autovacuum has run. Autovacuum log shows 0 tuples removed. Only the write workload has a xid assigned, the other two backends only have snapshot held: [ants@localhost ~]$ pg-master/bin/psql -c "SELECT application_name, backend_xid, backend_xmin, NOW()-xact_start AS tx_age, state FROM pg_stat_activity" application_name | backend_xid | backend_xmin | tx_age | state --+-+--+-+- write-workload | 95637 | | 00:00:00.009314 | active long-unrelated-query | | 1806 | 00:04:33.914048 | active interfering-query| | 2444 | 00:04:32.910742 | idle in transaction psql | |95637 | 00:00:00| active Output from the test tool attached. After killing the test tool and the long running query autovacuum cleans stuff as expected. I'm too tired right now to chase this down myself. The mental toll that two small kids can take is pretty staggering. But I might find the time to fire up a debugger sometime tomorrow. Regards, Ants Aasma [1] http://www.postgresql.org/message-id/attachment/43859/test_oldsnapshot.py [ants@localhost ~]$ python test_oldsnapshot.py "host=/tmp" [21:37:56] Starting 1800s long query [21:37:56] old_snapshot_threshold = 1min [21:37:56] High throughput table size @ 0s. Size176kB Last vacuum None ago [21:37:57] Counted 1000 rows with max 1637 in high_throughput table [21:38:06] High throughput table size @10s. Size952kB Last vacuum None ago [21:38:07] Counted 1000 rows with max 1637 in high_throughput table [21:38:16] High throughput table size @20s. Size 1768kB Last vacuum None ago [21:38:17] Counted 1000 rows with max 1637 in high_throughput table [21:38:26] High throughput table size @30s. Size 2640kB Last vacuum None ago [21:38:27] Counted 1000 rows with max 1637 in high_throughput table [21:38:36] High throughput table size @40s. Size 3488kB Last vacuum None ago [21:38:37] Counted 1000 rows with max 1637 in high_throughput table [21:38:46] High throughput table size @50s. Size 4328kB Last vacuum 0:00:02.339652 ago [21:38:47] Counted 1000 rows with max 1637 in high_throughput table [21:38:56] High throughput table size @60s. Size 4832kB Last vacuum 0:00:12.342278 ago [21:38:57] Counted 1000 rows with max 1637 in high_throughput table [21:39:06] High throughput table size @70s. Size 4920kB Last vacuum 0:00:22.425250 ago [21:39:07] Counted 1000 rows with max 1637 in high_throughput table [21:39:16] High throughput table size @80s. Size 5016kB Last vacuum 0:00:32.431971 ago [21:39:17] Counted 1000 rows with max 1637 in high_throughput table [21:39:26] High throughput table size @90s. Size 5112kB Last vacuum 0:00:42.431730 ago [21:39:27] Counted 1000 rows with max 1637 in high_throughput table [21:39:36] High throughput table size @ 100s. Size 5200kB Last vacuum 0:00:52.448369 ago [21:39:37
Re: [HACKERS] Re: [COMMITTERS] pgsql: Avoid extra locks in GetSnapshotData if old_snapshot_threshold
On Tue, Apr 19, 2016 at 6:11 PM, Kevin Grittner wrote: > On Tue, Apr 19, 2016 at 9:57 AM, Amit Kapila wrote: >> On Sun, Apr 17, 2016 at 2:26 AM, Andres Freund wrote: >>> >>> On 2016-04-16 16:44:52 -0400, Noah Misch wrote: >>> > That is more controversial than the potential ~2% regression for >>> > old_snapshot_threshold=-1. Alvaro[2] and Robert[3] are okay releasing >>> > that way, and Andres[4] is not. >>> >>> FWIW, I could be kinda convinced that it's temporarily ok, if there'd be >>> a clear proposal on the table how to solve the scalability issue around >>> MaintainOldSnapshotTimeMapping(). >> >> It seems that for read-only workloads, MaintainOldSnapshotTimeMapping() >> takes EXCLUSIVE LWLock which seems to be a probable reason for a performance >> regression. Now, here the question is do we need to acquire that lock if >> xmin is not changed since the last time value of >> oldSnapshotControl->latest_xmin is updated or xmin is lesser than equal to >> oldSnapshotControl->latest_xmin? >> If we don't need it for above cases, I think it can address the performance >> regression to a good degree for read-only workloads when the feature is >> enabled. > > Thanks, Amit -- I think something along those lines is the right > solution to the scaling issues when the feature is enabled. For > now I'm focusing on the back-patching issues and the performance > regression when the feature is disabled, but I'll shift focus to > this once the "killer" issues are in hand. I had an idea I wanted to test out. The gist of it is to effectively have the last slot of timestamp to xid map stored in the latest_xmin field and only update the mapping when slot boundaries are crossed. See attached WIP patch for details. This way the exclusive lock only needs to be acquired once per minute. The common case is a spinlock that could be replaced with atomics later. And it seems to me that the mutex_threshold taken in TestForOldSnapshot() can also get pretty hot under some workloads, so that may also need some tweaking. I think a better approach would be to base the whole mechanism on a periodically updated counter, instead of timestamps. Autovacuum launcher looks like a good candidate to play the clock keeper, without it the feature has little point anyway. AFAICS only the clock keeper needs to have the timestamp xid mapping, others can make do with a couple of periodically updated values. I haven't worked it out in detail, but it feels like the code would be simpler. But this was a larger change than I felt comfortable trying out, so I went with the simple change first. However, while checking out if my proof of concept patch actually works I hit another issue. I couldn't get my test for the feature to actually work. The test script I used is attached. Basically I have a table with 1000 rows, one high throughput worker deleting old rows and inserting new ones, one long query that acquires a snapshot and sleeps for 30min, and one worker that has a repeatable read snapshot and periodically does count(*) on the table. Based on documentation I would expect the following: * The interfering query gets cancelled * The long running query gets to run * Old rows will start to be cleaned up after the threshold expires. However, testing on commit 9c75e1a36b6b2f3ad9f76ae661f42586c92c6f7c, I'm seeing that the old rows do not get cleaned up, and that I'm only seeing the interfering query get cancelled when old_snapshot_threshold = 0. Larger values do not result in cancellation. Am I doing something wrong or is the feature just not working at all? Regards, Ants Aasma diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c index 8aa1f49..dc00d91 100644 --- a/src/backend/utils/time/snapmgr.c +++ b/src/backend/utils/time/snapmgr.c @@ -80,8 +80,11 @@ typedef struct OldSnapshotControlData */ slock_t mutex_current; /* protect current timestamp */ int64 current_timestamp; /* latest snapshot timestamp */ - slock_t mutex_latest_xmin; /* protect latest snapshot xmin */ + slock_t mutex_latest_xmin; /* protect latest snapshot xmin + * and next_map_update + */ TransactionId latest_xmin; /* latest snapshot xmin */ + int64 next_map_update; /* latest snapshot valid up to */ slock_t mutex_threshold; /* protect threshold fields */ int64 threshold_timestamp; /* earlier snapshot is old */ TransactionId threshold_xid; /* earlier xid may be gone */ @@ -95,7 +98,9 @@ typedef struct OldSnapshotControlData * count_used value of old_snapshot_threshold means that the buffer is * full and the head must be advanced to add new entries. Use timestamps * aligned to minute boundaries, since that seems less surprising than - * aligning based on
Re: [HACKERS] raw output from copy
On 8 Apr 2016 9:14 pm, "Pavel Stehule" wrote: > 2016-04-08 20:54 GMT+02:00 Andrew Dunstan : >> I should add that I've been thinking about this some more, and that I now >> agree that something should be done to support this at the SQL level, mainly >> so that clients can manage very large pieces of data in a stream-oriented >> fashion rather than having to marshall the data in memory to load/unload via >> INSERT/SELECT. Anything that is client-side only is likely to have this >> memory issue. >> >> At the same time I'm still not entirely convinced that COPY is a good >> vehicle for this. It's designed for bulk records, and already quite complex. >> Maybe we need something new that uses the COPY protocol but is more >> specifically tailored for loading or sending large singleton pieces of data. > > > Now it is little bit more time to think more about. But It is hard to design > some more simpler than is COPY syntax. What will support both directions. Sorry for arriving late and adding to the bikeshedding. Maybe the answer is to make COPY pluggable. It seems to me that it would be relatively straightforward to add an extension mechanism for copy output and input plugins that could support any format expressible as a binary stream. Raw output would then be an almost trivial plugin. Others could implement JSON, protocol buffers, Redis bulk load, BSON, ASN.1 or whatever else serialisation format du jour. It will still have the same backwards compatibility issues as adding the raw output, but the payoff is greater. Regards, Ants Aasma -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using quicksort for every external sort run
On Sat, Dec 12, 2015 at 12:41 AM, Greg Stark wrote: > On Wed, Dec 9, 2015 at 2:44 AM, Peter Geoghegan wrote: >> On Tue, Dec 8, 2015 at 6:39 PM, Greg Stark wrote: >> >> I guess you mean insertion sort. What's the theoretical justification >> for the change? > > Well my thinking was that hard coding a series of comparisons would be > faster than a loop doing a O(n^2) algorithm even for small constants. > And sort networks are perfect for hard coded sorts because they do the > same comparisons regardless of the results of previous comparisons so > there are no branches. And even better the comparisons are as much as > possible independent of each other -- sort networks are typically > measured by the depth which assumes any comparisons between disjoint > pairs can be done in parallel. Even if it's implemented in serial the > processor is probably parallelizing some of the work. > > So I implemented a quick benchmark outside Postgres based on sorting > actual SortTuples with datum1 defined to be random 64-bit integers (no > nulls). Indeed the sort networks perform faster on average despite > doing more comparisons. That makes me think the cpu is indeed doing > some of the work in parallel. The open coded version you shared bloats the code by 37kB, I'm not sure it is pulling it's weight, especially given relatively heavy comparators. A quick index creation test on int4's profiled with perf shows about 3% of CPU being spent in the code being replaced. Any improvement on that is going to be too small to easily quantify. As the open coding doesn't help with eliminating control flow dependencies, so my idea is to encode the sort network comparison order in an array and use that to drive a simple loop. The code size would be pretty similar to insertion sort and the loop overhead should mostly be hidden by the CPU OoO machinery. Probably won't help much, but would be interesting and simple enough to try out. Can you share you code for the benchmark so I can try it out? Regards, Ants Aasma -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] W-TinyLfu for cache eviction
On Wed, Dec 9, 2015 at 11:31 AM, Konstantin Knizhnik wrote: > I expect synchronization issues with implementation of this algorithm. It > seems to be hard to avoid some global critical section which can cause > significant performance degradation at MPP systems (see topic "Move > PinBuffer and UnpinBuffer to atomics"). The sketch updates don't really need to be globally consistent, anything that is faster than cycling of buffers through the LRU tier would be enough. In principle atomic increments could be used, but funneling all buffer hits onto a small amount of cachelines and then doing 4x the amount of writes would result in extremely nasty cache line ping-ponging. Caffeine implementation appears to solve this by batching the frequency sketch updates using a striped set of circular buffers. Re-implementing this is not exactly trivial and some prior experience with lock free programming seems well advised. Based on the paper it looks like this algorithm can work with a low quality primary eviction algorithm. Swapping our GCLOCK out for something simpler like plain CLOCK could simplify an atomics based PinBuffer approach. Another interesting avenue for research would be to use ideas in TinyLFU to implement a tier of "nailed" buffers that have backend local or striped pin-unpin accounting. But for checking if the replacement policy implemented by W-TinyLFU is good it isn't necessary to have a performant locking solution. I think a global lock would be good enough for a proof of concept that only evaluates cache hit ratios. Regards, Ants Aasma -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: Revisiting pg_stat_statements and IN() (Was: Re: pg_stat_statements fingerprinting logic and ArrayExpr)
On Tue, Nov 24, 2015 at 2:25 PM, Greg Stark wrote: > On Tue, Nov 24, 2015 at 7:53 AM, Peter Geoghegan wrote: >> * How do other people feel about this? Personally, I've seen enough >> problems of this kind in the field that "slippery slope" arguments >> against this don't seem very compelling. +1 for changing jumbling logic to consider any number of constants as identical. > I have also seen code where I would have needed *not* to have this > jumbling though. I think this is a general problem with jumbling that > there needs to be some kind of intelligence that deduces when it's > important to break out the statistics by constant. In my case it was > an IN query where specific values were very common but others very > rare. Partial indexes ended up being the solution and we had to > identify which partial indexes were needed. This is an issue with jumbling in general and very vaguely related to merging the number of constants in IN. I think a better solution for this type of issue would be a way to sample the parameter values and possibly EXPLAIN ANALYZE results to logs. Having runtime intelligence in PostgreSQL that would be capable of distinguishing interesting variations from irrelevant doesn't seem like a feasible plan. In my view the best we could do is to aim to have entries roughly correspond to application query invocation points and leave the more complex statistical analysis use cases to more specialized tools. Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] parallelism and sorting
On Tue, Nov 24, 2015 at 5:29 AM, Robert Haas wrote: > On Mon, Nov 23, 2015 at 8:45 PM, Peter Geoghegan wrote: >> Beyond that, CREATE INDEX and CLUSTER utility >> cases will also need to be parallelized without all this executor >> infrastructure. > > Or, alternatively, CREATE INDEX and CLUSTER could be refactored to use > the executor. This is might sound crazy, but maybe it's not. Perhaps > we could have the executor tree output correctly-formed index tuples > that get funneled into a new kind of DestReceiver that puts them into > the index. I don't know if that's a GOOD idea, but it's an idea. Having CREATE INDEX use the executor seems like a useful idea for reasons unrelated to parallelism. The use case I have in mind is a table containing multiple years worth of (approximately) time series data, where overwhelming majority of queries are explicitly interested in recent data. Having a partial index with WHERE tstamp > $some_recent_tstamp cutting out 90+% of tuples was extremely helpful for performance for both index size reasons and having to process less tuples. This index needs to be periodically rebuilt with a newer timestamp constant, and the rebuild would be a lot faster if it could use the existing index to perform an index only scan of 10% of data instead of scanning and sorting the full table. Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: "Causal reads" mode for load balancing reads without stale data
On Wed, Nov 11, 2015 at 11:22 AM, Thomas Munro wrote: > On Wed, Nov 11, 2015 at 9:42 PM, Heikki Linnakangas wrote: >> On 11/11/2015 10:23 AM, Simon Riggs wrote: >>> Thanks for working on this issue. >> >> +1. +1. I have seen a lot of interest for something along these lines. >> I'm thinking the client should get some kind of a token back from the >> commit, and it could use the token on the standby, to wait for that commit >> to be applied. The token could be just the XID, or the LSN of the commit >> record. Or the application could generate the token and pass it to the >> server in the commit, similar to how 2PC works. So the interaction would be >> something like: >> >> In master: >> BEGIN; >> INSERT INTO FOO ...; >> COMMIT; >> Server returns: COMMITted with token 1234 >> >> Later, in standby: >> BEGIN WAIT FOR COMMIT 1234 TO BE VISIBLE; >> SELECT * FROM foo; >> ... To avoid read anomalies (backwards timetravel) it should also be possible to receive a token from read-only transactions based on the latest snapshot used. > My thinking was that the reason for wanting to load balance over a set of > hot standbys is because you have a very read-heavy workload, so it makes > sense to tax the writers and leave the many dominant readers unburdened, so > (3) should be better than (2) for the majority of users who want such a > configuration. (Note also that it's not a requirement to tax every write; > with this proposal you can set causal_reads to off for those transactions > where you know there is no possibility of a causally dependent read). > > As for (1), my thinking was that most application developers would probably > prefer not to have to deal with that type of interface. For users who do > want to do that, it would be comparatively simple to make that possible, and > would not conflict with this proposal. This proposal could be used by > people retrofitting load balancing to an existing applications with relative > ease, or simply not wanting to have to deal with LSNs and complexity. (I > have considered proposing pg_wait_for_xlog_replay_location(lsn, timeout) > separately, which could be called on a standby with the lsn obtained from > pg_current_xlog_location() on the primary any time after a COMMIT completes, > but I was thinking of that as a different feature addressing a different > user base: people prepared to do more work to squeeze out some extra > performance.) Although I still think that 1) is the correct long term solution I must say that I agree with the reasoning presented. I think we should review the API in the light that in the future we might have a mix of clients, some clients that are able to keep track of causality tokens and either want to wait when a read request arrives, or pick a host to use based on the token, and then there are "dumb" clients that want to use write side waits. Also, it should be possible to configure which standbys are considered for waiting on. Otherwise a reporting slave will occasionally catch up enough to be considered "available" and then cause a latency peak when a long query blocks apply again. Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Allow "snapshot too old" error, to prevent bloat
On Feb 19, 2015 10:31 PM, "Kevin Grittner" wrote: > > What about having the long running snapshots declare their working > > set, and then only take them into account for global xmin for > > relations that are in the working set? Like a SET TRANSACTION WORKING > > SET command. This way the error is deterministic, vacuum on the high > > churn tables doesn't have to wait for the old transaction delay to > > expire and we avoid a hard to tune GUC (what do I need to set > > old_snapshot_threshold to, to reduce bloat while not having "normal" > > transactions abort). > > Let me make sure I understand your suggestion. You are suggesting > that within a transaction you can declare a list of tables which > should get the "snapshot too old" error (or something like it) if a > page is read which was modified after the snapshot was taken? > Snapshots within that transaction would not constrain the effective > global xmin for purposes of vacuuming those particular tables? Sorry, I should have been clearer. I'm proposing that a transaction can declare what tables it will access. After that the transaction will constrain xmin for only those tables. Accessing any other table would give an error immediately. > If I'm understanding your proposal, that would help some of the > cases the "snapshot too old" case addresses, but it would not > handle the accidental "idle in transaction" case (e.g., start a > repeatable read transaction, run one query, then sit idle > indefinitely). I don't think it would work quite as well for some > of the other cases I've seen, but perhaps it could be made to fit > if we could figure out the accidental "idle in transaction" case. Accidental idle in transaction seems better handled by just terminating transactions older than some timeout. This is already doable externally, but an integrated solution would be nice if we could figure out how the permissions for setting such a timeout work. > I also think it might be hard to develop a correct global xmin for > vacuuming a particular table with that solution. How do you see > that working? I'm imagining the long running transaction would register it's xmin in a shared map keyed by relation for each referenced relation and remove the xmin from procarray. Vacuum would access this map by relation, determine the minimum and use that if it's earlier than the global xmin. I'm being deliberately vague here about the datastructure in shared memory as I don't have a great idea what to use there. It's somewhat similar to the lock table in that in theory the size is unbounded, but in practice it's expected to be relatively tiny. Regards, Ants Aasma
Re: [HACKERS] Allow "snapshot too old" error, to prevent bloat
On Thu, Feb 19, 2015 at 6:01 PM, Kevin Grittner wrote: >> I can see how they would be, provided we can be confident that we're >> going to actually throw an error when the snapshot is out of date and >> not end up returning incorrect results. We need to be darn sure of >> that, both now and in a few years from now when many of us may have >> forgotten about this knob.. ;) > > I get that this is not zero-cost to maintain, and that it might not > be of interest to the community largely for that reason. If you > have any ideas about how to improve that, I'm all ears. But do > consider the actual scope of what was needed for the btree coverage > (above). (Note that the index-only scan issue doesn't look like > anything AM-specific; if something is needed for that it would be > in the pruning/vacuum area.) If I understood the issue correctly, you have long running snapshots accessing one part of the data, while you have high churn on a disjoint part of data. We would need to enable vacuum on the high churn data while still being able to run those long queries. The "snapshot too old" solution limits the maximum age of global xmin at the cost of having to abort transactions that are older than this and happen to touch a page that was modified after they were started. What about having the long running snapshots declare their working set, and then only take them into account for global xmin for relations that are in the working set? Like a SET TRANSACTION WORKING SET command. This way the error is deterministic, vacuum on the high churn tables doesn't have to wait for the old transaction delay to expire and we avoid a hard to tune GUC (what do I need to set old_snapshot_threshold to, to reduce bloat while not having "normal" transactions abort). Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Using RTLD_DEEPBIND to handle symbol conflicts in loaded libraries
I had to make oracle_fdw work with PostgreSQL compiled using --with-ldap. The issue there is that Oracle's client library has the delightful property of linking against a ldap library they bundle that has symbol conflicts with OpenLDAP. At PostgreSQL startup libldap is loaded, so when libclntsh.so (the Oracle client) is loaded it gets bound to OpenLDAP symbols, and unsurprisingly crashes with a segfault when those functions get used. glibc-2.3.4+ has a flag called RTLD_DEEPBIND for dlopen that prefers symbols loaded by the library to those provided by the caller. Using this flag fixes my issue, PostgreSQL gets the ldap functions from libldap, Oracle client gets them from whatever it links to. Both work fine. Attached is a patch that enables this flag on Linux when available. This specific case could also be fixed by rewriting oracle_fdw to use dlopen for libclntsh.so and pass this flag, but I think it would be better to enable it for all PostgreSQL loaded extension modules. I can't think of a sane use case where it would be correct to prefer PostgreSQL loaded symbols to those the library was actually linked against. Does anybody know of a case where this flag wouldn't be a good idea? Are there any similar options for other platforms? Alternatively, does anyone know of linker flags that would give a similar effect? Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de diff --git a/src/backend/port/dynloader/linux.h b/src/backend/port/dynloader/linux.h index db2ac66..34fd35c 100644 --- a/src/backend/port/dynloader/linux.h +++ b/src/backend/port/dynloader/linux.h @@ -25,8 +25,12 @@ /* * In some older systems, the RTLD_NOW flag isn't defined and the mode * argument to dlopen must always be 1. The RTLD_GLOBAL flag is wanted - * if available, but it doesn't exist everywhere. - * If it doesn't exist, set it to 0 so it has no effect. + * if available, but it doesn't exist everywhere. The RTLD_DEEPBIND flag + * may also be missing, but if it is available we want to enable it so + * extensions we load prefer symbols from libraries they were linked + * against to conflicting symbols from unrelated libraries loaded by + * PostgreSQL. + * If the optional flags don't exist, set them to 0 so they have no effect. */ #ifndef RTLD_NOW #define RTLD_NOW 1 @@ -34,8 +38,11 @@ #ifndef RTLD_GLOBAL #define RTLD_GLOBAL 0 #endif +#ifndef RTLD_DEEPBIND +#define RTLD_DEEPBIND 0 +#endif -#define pg_dlopen(f) dlopen((f), RTLD_NOW | RTLD_GLOBAL) +#define pg_dlopen(f) dlopen((f), RTLD_NOW | RTLD_GLOBAL | RTLD_DEEPBIND) #define pg_dlsym dlsym #define pg_dlclose dlclose #define pg_dlerror dlerror -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] What exactly is our CRC algorithm?
On Fri, Nov 21, 2014 at 12:11 PM, Abhijit Menon-Sen wrote: > If anyone has other suggestions, I'm all ears. Do you have a WIP patch I could take a look at and tweak? Maybe there's something about the compilers code generation that could be improved. Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Failback to old master
On Thu, Nov 13, 2014 at 10:05 AM, Heikki Linnakangas wrote: >> In this case the >> old master will request recovery from a point after the timeline >> switch and the new master will reply with an error. So it is safe to >> try re-adding a crashed master as a slave, but this might fail. > > > Are you sure it's guaranteed to fail, when the master had some WAL that was > not streamed before the crash? I'm not 100% sure about that. I thought the > old master might continue streaming and replaying, if there just happens to > be a record start at the same point in WAL on both timelines. I think you'll > get an error at the next checkpoint record, because its timeline ID isn't > what the old master expects, but it would've started up already. It seems to me like it's guaranteed. Given slave promotion at x1 and end of xlog on old master at x2, x1 < x2, master will request streaming at tli1.x2, wal sender does tliSwitchPoint(tli1) to lookup x1, finds that x1 < x2 and gives the error "requested starting point %X/%X on timeline %u is not in this server's history". The alignment of x2 on tli2 does not play a role here. Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Failback to old master
On Tue, Nov 11, 2014 at 11:52 PM, Maeldron T. wrote: > As far as I remember (I can’t test it right now but I am 99% sure) promoting > the slave makes it impossible to connect the old master to the new one > without making a base_backup. The reason is the timeline change. It complains. A safely shut down master (-m fast is safe) can be safely restarted as a slave to the newly promoted master. Fast shutdown shuts down all normal connections, does a shutdown checkpoint and then waits for this checkpoint to be replicated to all active streaming clients. Promoting slave to master creates a timeline switch, that prior to version 9.3 was only possible to replicate using the archive mechanism. As of version 9.3 you don't need to configure archiving to follow timeline switches, just add a recovery.conf to the old master to start it up as a slave and it will fetch everything it needs from the new master. In case of a unsafe shut down (crash) it is possible that you have WAL lying around that was not streamed out to the slave. In this case the old master will request recovery from a point after the timeline switch and the new master will reply with an error. So it is safe to try re-adding a crashed master as a slave, but this might fail. Success is more likely when the whole operating system went down, as then it's somewhat likely that any WAL got streamed out before it made it to disk. In general my suggestion is to avoid slave promotion by removal of recovery.conf, it's too easy to get confused and end up with hard to diagnose data corruption. In your example, if for example B happens to disconnect at WAL position x1 and remains disconnected while shutdown on A occurred at WAL position x2 it will be missing the WAL interval A(x1..x2). Now B is restarted as master from position x1, generates some new WAL past x2, then A is restarted as slave and starts streaming at x2 as to the best of it's knowledge that was where things left off. At this point the slave A is corrupted, you have x1..x2 changes from A that are not on the master and are also missing some changes that are on the master. Wrong data and/or crashes ensue. Always use the promotion mechanism because then you are likely to get errors when something is screwy. Unfortunately it's still possible to end up in a corrupted state with no errors, as timeline identifiers are sequential integers, not GUID's, but at least it's significantly harder. Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: dynahash replacement for buffer table
On Oct 15, 2014 7:32 PM, "Ants Aasma" wrote: > I'm imagining a bucketized cuckoo hash with 5 item buckets (5-way > associativity). This allows us to fit the bucket onto 2 regular sized > cache lines and have 8 bytes left over. Buckets would be protected by > seqlocks stored in the extra space. On the read side we would only > need 2 read barriers (basically free on x86), and we are guaranteed to > have an answer by fetching two pairs of cache lines. We can trade > memory bandwidth for latency by issuing prefetches (once we add > primitives for this). Alternatively we can trade insert speed for > lookup speed by using asymmetrically sized tables. I was thinking about this. It came to me that we might not even need BufferTag's to be present in the hash table. In the hash table itself we store just a combination of 4 byte buffer tag hash-buffer id. This way we can store 8 pairs in one cacheline. Lookup must account for the fact that due to hash collisions we may find multiple matches one of which may be the buffer we are looking for. We can make condition exceedingly unlikely by taking advantage of the fact that we have two hashes anyway, in each table we use the lookup hash of the other table for tagging. (32bit hash collision within 8 items). By having a reasonably high load factor (75% should work) and 8 bytes per key we even have a pretty good chance of fitting the hotter parts of the buffer lookup table in CPU caches. We use a separate array for the locks (spinlocks should be ok here) for fine grained locking, this may be 1:1 with the buckets or we can map many buckets to a single lock. Otherwise the scheme should work mostly the same. Using this scheme we can get by on the lookup side using 64 bit atomic reads with no barriers, we only need atomic ops for pinning the found buffer. I haven't had the chance to check out how second-chance hashing works and if this scheme would be applicable to it. Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: dynahash replacement for buffer table
On Tue, Oct 14, 2014 at 6:19 PM, Robert Haas wrote: >> With regard for using a hash table for the buffer mapping lock I'm >> doubtful that any form of separate chaining is the right one. We >> currently have a quite noticeable problem with the number of cache >> misses in the buffer mapping hash (and also the heavyweight lock mgr) - >> if we stay with hashes that's only going to be fundamentally lower than >> dynahash if we change the type of hashing. I've had good, *preliminary*, >> results using an open addressing + linear probing approach. > > I'm very skeptical of open addressing + linear probing. Such hash > tables tend to degrade over time, because you have to leave tombstones > behind to ensure that future scans advance far enough. There's really > no way to recover without rebuilding the whole hash table, and > eventually it degrades to linear search. If we're spending too much > time walking hash chains, I think the solution is to increase the > number of buckets so that the chains get shorter. What about cuckoo hashing? There was a recent paper on how to do fine grained locking with cuckoo hashes. [1] I'm imagining a bucketized cuckoo hash with 5 item buckets (5-way associativity). This allows us to fit the bucket onto 2 regular sized cache lines and have 8 bytes left over. Buckets would be protected by seqlocks stored in the extra space. On the read side we would only need 2 read barriers (basically free on x86), and we are guaranteed to have an answer by fetching two pairs of cache lines. We can trade memory bandwidth for latency by issuing prefetches (once we add primitives for this). Alternatively we can trade insert speed for lookup speed by using asymmetrically sized tables. [1] https://www.cs.princeton.edu/~mfreed/docs/cuckoo-eurosys14.pdf Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] "Value locking" Wiki page
On Wed, Oct 1, 2014 at 2:42 PM, Simon Riggs wrote: >> GiST supports exclusion constraints. That is one of the main reasons I want >> to do promise tuples, instead of locking within the indexam: to support this >> feature with exclusion constraints. > > That does sound interesting, but I am concerned the semantics may cause > issues. > > If I go to insert a row for 'UK' and find an existing row for > 'Europe', do we really want to update the population of Europe to be > the population of the UK, simply because the UK and Europe have an > exclusion conflict? > > Please give some concrete examples of a business request that might be > satisified by such a feature. The ON CONFLICT UPDATE semantics don't seem particularly useful for exclusion constraints. However, a reasonable business request for exclusion constraints would be to have a "boss mode" for the canonical room reservation example - an INSERT that is guaranteed not to fail by either deleting conflicting rows or updating them so the exclusion constraints don't overlap (e.g. truncate the time intervals) or the rows fail the index predicate (e.g. soft delete). AFAICS this is currently not possible to implement correctly without a retry loop. The hypothetical ON CONFLICT REPLACE and ON CONFLICT UPDATE-AND-THEN-INSERT modes would also make sense in the unique index case. Not saying that I view this as necessary for the first cut of the feature, just providing an example where it could be useful. Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Scaling shared buffer eviction
On Fri, Sep 26, 2014 at 3:26 PM, Andres Freund wrote: > Neither, really. The hash calculation is visible in the profile, but not > that pronounced yet. The primary thing noticeable in profiles (besides > cache efficiency) is the comparison of the full tag after locating a > possible match in a bucket. 20 byte memcmp's aren't free. I'm not arguing against a more efficient hash table, but one simple win would be to have a buffer tag specific comparison function. I mean something like the attached patch. This way we avoid the loop counter overhead, can check the blocknum first and possibly have better branch prediction. Do you have a workload where I could test if this helps alleviate the comparison overhead? Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de diff --git a/src/backend/storage/buffer/buf_table.c b/src/backend/storage/buffer/buf_table.c index 7a38f2f..ba37b5f 100644 --- a/src/backend/storage/buffer/buf_table.c +++ b/src/backend/storage/buffer/buf_table.c @@ -34,6 +34,7 @@ typedef struct static HTAB *SharedBufHash; +static int BufTableCmpBufferTag(BufferTag *a, BufferTag *b, int n); /* * Estimate space needed for mapping hashtable @@ -46,6 +47,19 @@ BufTableShmemSize(int size) } /* + * Compare contents of two buffer tags. We use this instead of the default + * memcmp to minimize comparison overhead. + */ +static int +BufTableCmpBufferTag(BufferTag *a, BufferTag *b, int n) +{ + Assert(offsetof(BufferTag, blockNum) == 2*sizeof(uint64)); + return (a->blockNum == b->blockNum + && ((uint64*)a)[0] == ((uint64*)b)[0] + && ((uint64*)a)[1] == ((uint64*)b)[1]) ? 0 : 1; +} + +/* * Initialize shmem hash table for mapping buffers * size is the desired hash table size (possibly more than NBuffers) */ @@ -61,6 +75,7 @@ InitBufTable(int size) info.entrysize = sizeof(BufferLookupEnt); info.hash = tag_hash; info.num_partitions = NUM_BUFFER_PARTITIONS; + info.match = BufTableCmpBufferTag; SharedBufHash = ShmemInitHash("Shared Buffer Lookup Table", size, size, -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes
On Tue, Sep 23, 2014 at 8:15 PM, Florian Weimer wrote: > * Ants Aasma: > >> CRC has exactly one hardware implementation in general purpose CPU's > > I'm pretty sure that's not true. Many general purpose CPUs have CRC > circuity, and there must be some which also expose them as > instructions. I must eat my words here, indeed AMD processors starting from Bulldozer do implement the CRC32 instruction. However, according to Agner Fog, AMD's implementation has a 6 cycle latency and more importantly a throughput of 1/6 per cycle. While Intel's implementation on all CPUs except the new Atom has 3 cycle latency and 1 instruction/cycle throughput. This means that there still is a significant handicap for AMD platforms, not to mention Power or Sparc with no hardware support. Some ARM's implement CRC32, but I haven't researched what their performance is. Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes
On Sat, Sep 13, 2014 at 6:59 AM, Arthur Silva wrote: > That's not entirely true. CRC-32C beats pretty much everything with the same > length quality-wise and has both hardware implementations and highly > optimized software versions. For better or for worse CRC is biased by detecting all single bit errors, the detection capability of larger errors is slightly diminished. The quality of the other algorithms I mentioned is also very good, while producing uniformly varying output. CRC has exactly one hardware implementation in general purpose CPU's and Intel has a patent on the techniques they used to implement it. The fact that AMD hasn't yet implemented this instruction shows that this patent is non-trivial to work around. The hardware CRC is about as fast as xxhash. The highly optimized software CRCs are an order of magnitude slower and require large cache trashing lookup tables. If we choose to stay with CRC we must accept that we can only solve the performance issues for Intel CPUs and provide slight alleviation for others. Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes
On Fri, Sep 12, 2014 at 10:38 PM, Heikki Linnakangas wrote: > I don't mean that we should abandon this patch - compression makes the WAL > smaller which has all kinds of other benefits, even if it makes the raw TPS > throughput of the system worse. But I'm just saying that these TPS > comparisons should be taken with a grain of salt. We probably should > consider switching to a faster CRC algorithm again, regardless of what we do > with compression. CRC is a pretty awfully slow algorithm for checksums. We should consider switching it out for something more modern. CityHash, MurmurHash3 and xxhash look like pretty good candidates, being around an order of magnitude faster than CRC. I'm hoping to investigate substituting the WAL checksum algorithm 9.5. Given the room for improvement in this area I think it would make sense to just short-circuit the CRC calculations for testing this patch to see if the performance improvement is due to less data being checksummed. Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Scaling shared buffer eviction
On Thu, Sep 11, 2014 at 4:22 PM, Andres Freund wrote: >> > Hm. Perhaps we should do a bufHdr->refcount != zero check without >> > locking here? The atomic op will transfer the cacheline exclusively to >> > the reclaimer's CPU. Even though it very shortly afterwards will be >> > touched afterwards by the pinning backend. >> >> Meh. I'm not in favor of adding more funny games with locking unless >> we can prove they're necessary for performance. > > Well, this in theory increases the number of processes touching buffer > headers regularly. Currently, if you have one read IO intensive backend, > there's pretty much only process touching the cachelines. This will make > it two. I don't think it's unreasonable to try to reduce the cacheline > pingpong caused by that... I don't think it will help much. A pinned buffer is pretty likely to be in modified state in the cache of the cpu of the pinning backend. Even taking a look at the refcount will trigger a writeback and demotion to shared state. When time comes to unpin the buffer the cacheline must again be promoted to exclusive state introducing coherency traffic. Not locking the buffer only saves transfering the cacheline back to the pinning backend, not a huge amount of savings. Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] postgresql latency & bgwriter not doing its job
On Thu, Sep 4, 2014 at 12:36 AM, Andres Freund wrote: > It's imo quite clearly better to keep it allocated. For one after > postmaster started the checkpointer successfully you don't need to be > worried about later failures to allocate memory if you allocate it once > (unless the checkpointer FATALs out which should be exceedingly rare - > we're catching ERRORs). It's much much more likely to succeed > initially. Secondly it's not like there's really that much time where no > checkpointer isn't running. In principle you could do the sort with the full sized array and then compress it to a list of buffer IDs that need to be written out. This way most of the time you only need a small array and the large array is only needed for a fraction of a second. Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] postgresql latency & bgwriter not doing its job
On Sat, Aug 30, 2014 at 8:50 PM, Tom Lane wrote: > Andres Freund writes: >> On 2014-08-27 19:23:04 +0300, Heikki Linnakangas wrote: >>> A long time ago, Itagaki Takahiro wrote a patch sort the buffers and write >>> them out in order >>> (http://www.postgresql.org/message-id/flat/20070614153758.6a62.itagaki.takah...@oss.ntt.co.jp). >>> The performance impact of that was inconclusive, but one thing that it >>> allows nicely is to interleave the fsyncs, so that you write all the buffers >>> for one file, then fsync it, then next file and so on. > >> ... >> So, *very* clearly sorting is a benefit. > > pg_bench alone doesn't convince me on this. The original thread found > cases where it was a loss, IIRC; you will need to test many more than > one scenario to prove the point. The same objection came up last time I tried to push for sorted checkpoints. I did not find any reference to where it caused a loss, nor was I able to come up with a case where writing out in arbitrary order would be better than writing out in file sequential order. In fact if we ask for low latency this means that the OS must keep the backlog small eliminating any chance of write combining writes that arrive out of order. I have a use case where the system continuously loads data into time partitioned indexed tables, at every checkpoint all of the indexes of the latest partition need to be written out. The only way I could get the write out to happen with sequential I/O was to set checkpoint_completion_target to zero and ensure OS cache allows for enough dirty pages to absorb the whole checkpoint. The fsync that followed did obviously nasty things to latency. Whereas sorted checkpoints were able to do sequential I/O with checkpoint spreading and low latency tuned OS virtual memory settings. I can create a benchmark that shows this behavior if you need additional data points to pgbench's OLTP workload to convince you that sorting checkpoint writes is a good idea. I did just come up with a case where plain sorting might cause an issue. If the writes go to different I/O devices then naive sorting will first use one device then the other, whereas arbitrary writing will load balance between the devices. Assuming that separate tablespaces are used for separate I/O devices, it should be enough to just interleave writes of each tablespace, weighed by the amount of writes per tablespace. Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On Fri, Aug 8, 2014 at 7:35 PM, Andrew Dunstan wrote: >> I took a quick look and saw that this wouldn't be that easy to get around. >> As I'd suspected upthread, there are places that do random access into a >> JEntry array, such as the binary search in findJsonbValueFromContainer(). >> If we have to add up all the preceding lengths to locate the corresponding >> value part, we lose the performance advantages of binary search. AFAICS >> that's applied directly to the on-disk representation. I'd thought >> perhaps there was always a transformation step to build a pointer list, >> but nope. > > It would be interesting to know what the performance hit would be if we > calculated the offsets/pointers on the fly, especially if we could cache it > somehow. The main benefit of binary search is in saving on comparisons, > especially of strings, ISTM, and that could still be available - this would > just be a bit of extra arithmetic. I don't think binary search is the main problem here. Objects are usually reasonably sized, while arrays are more likely to be huge. To make matters worse, jsonb -> int goes from O(1) to O(n). Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reporting the commit LSN at commit time
On Thu, Aug 7, 2014 at 4:15 AM, Craig Ringer wrote: > I'm thinking about adding a new message type in the protocol that gets > sent immediately before CommandComplete, containing the LSN of the > commit. Clients would need to enable the sending of this message with a > GUC that they set when they connect, so it doesn't confuse clients that > aren't expecting it or aware of it. > > Is this something you can see being useful for other non-BDR purposes? I have been thinking about something similar. For regular streaming replication you could keep track of the commit LSN on a per client basis and automatically redirect read queries to a standby if standby apply location is larger than the commit LSN of this particular client. This can be done mostly transparently for the application while not running into the issue that recent modifications disappear for a while after commit. Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] better atomics - v0.5
On Fri, Jun 27, 2014 at 9:00 PM, Andres Freund wrote: > Imagine the situation for the buffer header spinlock which is one of the > bigger performance issues atm. We could aim to replace all usages of > that with clever and complicated logic, but it's hard. > > The IMO reasonable (and prototyped) way to do it is to make the common > paths lockless, but fall back to the spinlock for the more complicated > situations. For the buffer header that means that pin/unpin and buffer > lookup are lockless, but IO and changing the identity of a buffer still > require the spinlock. My attempts to avoid the latter basically required > a buffer header specific reimplementation of spinlocks. There is a 2010 paper [1] that demonstrates a fully non-blocking approach to buffer management using the same generalized clock algorithm that PostgreSQL has. The site also has an implementation for Apache Derby. You may find some interesting ideas in there. [1] http://code.google.com/p/derby-nb/source/browse/trunk/derby-nb/ICDE10_conf_full_409.pdf Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] gettimeofday is at the end of its usefulness?
On Wed, May 14, 2014 at 6:34 AM, Greg Stark wrote: > I always assumed the kernel used rdtsc to implement some of the high > performance timers. It can save the current time in a mapped page when > it schedules a process and then in the vdso syscall (ie in user-space) > it can use rdtsc to calculate the offset needed to adjust that > timestamp to the current time. This seems consistent with your > calculations that showed the 40ns overhead with +/- 10ns precision. Both gettimeofday and clock_gettime do exactly that. [1] clock_gettime(CLOCK_MONOTONIC) is the mode of operation we would want to use here. > I actually think it would be more interesting if we could measure the > overhead and adjust for it. I don't think people are really concerned > with how long EXPLAIN ANALYZE takes to run if they could get accurate > numbers out of it. Measuring would also be a good idea so we can automatically turn on performance counters like IO timing when we know it's not obscenely expensive. However, subtracting the overhead will still skew the numbers somewhat by giving more breathing time for memory and IO prefetching mechanisms. Another option to consider would be to add a sampling based mechanism for low overhead time attribution. It would be even better if we could distinguish between time spent waiting on locks vs. waiting on IO vs. waiting to be scheduled vs. actually executing. [1] https://github.com/torvalds/linux/blob/master/arch/x86/vdso/vclock_gettime.c#L223 Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal for CSN based snapshots
On Mon, May 12, 2014 at 7:10 PM, Greg Stark wrote: > Would it be useful to store the current WAL insertion point along with > the "about to commit" flag so it's effectively a promise that this > transaction will commit no earlier than XXX. That should allow most > transactions to decide if those records are visible or not unless > they're very recent transactions which started in that short window > while the committing transaction was in the process of committing. I don't believe this is worth the complexity. The contention window is extremely short here. Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal for CSN based snapshots
On Mon, May 12, 2014 at 6:09 PM, Robert Haas wrote: > However, I wonder what > happens if you write the commit record and then the attempt to update > pg_clog fails. I think you'll have to PANIC, which kind of sucks. CLOG IO error while commiting is already a PANIC, SimpleLruReadPage() does SlruReportIOError(), which in turn does ereport(ERROR), while inside a critical section initiated in RecordTransactionCommit(). Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal for CSN based snapshots
On Mon, May 12, 2014 at 4:56 PM, Heikki Linnakangas wrote: > On 01/24/2014 02:10 PM, Rajeev rastogi wrote: >> >> We are also planning to implement CSN based snapshot. >> So I am curious to know whether any further development is happening on >> this. > > > I started looking into this, and plan to work on this for 9.5. It's a big > project, so any help is welcome. The design I have in mind is to use the LSN > of the commit record as the CSN (as Greg Stark suggested). I did do some coding work on this, but the free time I used to work on this basically disappeared with a child in the family. I guess what I have has bitrotted beyond recognition. However I may still have some insight that may be of use. From your comments I presume that you are going with the original, simpler approach proposed by Robert to simply keep the XID-CSN map around for ever and probe it for all visibility lookups that lie outside of the xmin-xmax range? As opposed to the more complex hybrid approach I proposed that keeps a short term XID-CSN map and lazily builds conventional list-of-concurrent-XIDs snapshots for long lived snapshots. I think that would be prudent, as the simpler approach needs mostly the same ground work and if turns out to work well enough, simpler is always better. Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
On Wed, Apr 16, 2014 at 7:44 AM, Robert Haas wrote: > I think that the basic problem here is that usage counts increase when > buffers are referenced, but they decrease when buffers are evicted, > and those two things are not in any necessary way connected to each > other. In particular, if no eviction is happening, reference counts > will converge to the maximum value. I've read a few papers about > algorithms that attempt to segregate the list of buffers into "hot" > and "cold" lists, and an important property of such algorithms is that > they mustn't be allowed to make everything hot. It's easy to be too > simplistic, here: an algorithm that requires that no more than half > the list be hot will fall over badly on a workload where the working > set exceeds the available cache and the really hot portion of the > working set is 60% of the available cache. So you need a more > sophisticated algorithm than that. But that core property that not > all buffers can be hot must somehow be preserved, and our algorithm > doesn't. FWIW in CLOCK-Pro segregating buffers between hot and cold is tied to eviction and the clock sweep, the ratio between hot and cold is dynamically adapted based on prior experience. The main downside is that it seems to require an indirection somewhere either in the clock sweep or buffer lookup. Maybe it's possible to avoid that with some clever engineering if we think hard enough. CLOCK-Pro may also have too little memory of hotness, making it too easy to blow the whole cache away with a burst of activity. It may be useful to have a (possibly tunable) notion of fairness where one query/backend can't take over the cache even though it may be an overall win in terms of total number of I/Os performed. Maybe we need to invent Generalized CLOCK-Pro with a larger number of levels, ranging from cold, hot and scalding to infernal. :) Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
On Mon, Apr 14, 2014 at 8:11 PM, Peter Geoghegan wrote: > PostgreSQL implements a clock sweep algorithm, which gets us something > approaching an LRU for the buffer manager in trade-off for less > contention on core structures. Buffers have a usage_count/"popularity" > that currently saturates at 5 (BM_MAX_USAGE_COUNT). The classic CLOCK > algorithm only has one bit for what approximates our "usage_count" (so > it's either 0 or 1). I think that at its core CLOCK is an algorithm > that has some very desirable properties that I am sure must be > preserved. Actually, I think it's more accurate to say we use a > variant of clock pro, a refinement of the original CLOCK. PostgreSQL replacement algorithm is more similar to Generalized CLOCK or GCLOCK, as described in [1]. CLOCK-Pro [2] is a different algorithm that approximates LIRS[3]. LIRS is what MySQL implements[4] and CLOCK-Pro is implemented by NetBSD [5] and there has been some work on trying it on Linux [6]. Both LIRS and CLOCK-Pro work by keeping double the cache size metadata entries and detect pages that have been recently referenced. Basically they provide an adaptive tradeoff between LRU and LFU. > In the past, various hackers have noted problems they've observed with > this scheme. A common pathology is to see frantic searching for a > victim buffer only to find all buffer usage_count values at 5. It may > take multiple revolutions of the clock hand before a victim buffer is > found, as usage_count is decremented for each and every buffer. Also, > BufFreelistLock contention is considered a serious bottleneck [1], > which is related. There's a paper on a non blocking GCLOCK algorithm, that does lock free clock sweep and buffer pinning[7]. If we decide to stay with GCLOCK it may be interesting, although I still believe that some variant of buffer nailing[8] is a better idea, my experience shows that most of the locking overhead is cache line bouncing ignoring the extreme cases where our naive spinlock implementation blows up. > Let's leave aside inner/root pages though, because they're so > dramatically useful when in a primary index on a tpb-b table that > they'll always be cached by any non-terrible algorithm. It beggars > belief that the still relatively dense (and consequently *popular*) > B+Tree leaf pages get so little credit for being of such long-term > utility (in the view of our clock sweep algorithm as implemented). The > algorithm has what could be loosely described as an excessively > short-term perspective. There is clearly a better balance to be had > here. I don't think the answer is that we have the B-Tree code give > its pages some special treatment that makes them harder to evict, > although I will admit that I briefly entertained the idea. There has been some research that indicates that for TPC-A workloads giving index pages higher weights increases hitrates[1]. I think the hardest hurdle for any changes in this area will be showing that we don't have any nasty regressions. I think the best way to do that would be to study separately the performance overhead of the replacement algorithm and optimality of the replacement choices. If we capture a bunch of buffer reference traces by instrumenting PinBuffer, we can pretty accurately simulate the behavior of different algorithm and tuning choices with different shared buffer sizes. Obviously full scale tests are still needed due to interactions with OS, controller and disk caches and other miscellaneous influences. But even so, simulation would get us much better coverage of various workloads and at least some confidence that it's a good change overall. It will be very hard and time consuming to gather equivalent evidence with full scale tests. [1] http://www.csd.uoc.gr/~hy460/pdf/p35-nicola.pdf [2] http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-05-3.pdf [3] http://www.ece.eng.wayne.edu/~sjiang/pubs/papers/jiang02_LIRS.pdf [4] http://lists.mysql.com/commits/28601 [5] http://fxr.watson.org/fxr/source/uvm/uvm_pdpolicy_clockpro.c?v=NETBSD [6] http://lwn.net/Articles/147879/ [7] http://derby-nb.googlecode.com/svn-history/r41/trunk/derby-nb/ICDE10_conf_full_409.pdf [8] http://www.postgresql.org/message-id/ca+tgmozypeyhwauejvyy9a5andoulcf33wtnprfr9sycw30...@mail.gmail.com Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Missing pfree in logical_heap_rewrite_flush_mappings()
It seems to me that when flushing logical mappings to disk, each mapping file leaks the buffer used to pass the mappings to XLogInsert. Also, it seems consistent to allocate that buffer in the RewriteState memory context. Patch attached. Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c index 4cf07ea..ae439e8 100644 --- a/src/backend/access/heap/rewriteheap.c +++ b/src/backend/access/heap/rewriteheap.c @@ -897,7 +897,7 @@ logical_heap_rewrite_flush_mappings(RewriteState state) /* write all mappings consecutively */ len = src->num_mappings * sizeof(LogicalRewriteMappingData); - waldata = palloc(len); + waldata = MemoryContextAlloc(state->rs_cxt, len); waldata_start = waldata; /* @@ -943,6 +943,7 @@ logical_heap_rewrite_flush_mappings(RewriteState state) /* write xlog record */ XLogInsert(RM_HEAP2_ID, XLOG_HEAP2_REWRITE, rdata); + pfree(waldata); } Assert(state->rs_num_rewrite_mappings == 0); } -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease
On Wed, Feb 12, 2014 at 4:04 PM, knizhnik wrote: > Even if reordering was not done by compiler, it still can happen while > execution. > There is no warranty that two subsequent assignments will be observed by all > CPU cores in the same order. The x86 memory model (total store order) provides that guarantee in this specific case. Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Negative Transition Aggregate Functions (WIP)
On Dec 15, 2013 6:44 PM, "Tom Lane" wrote: > David Rowley writes: > > I've attached an updated patch which includes some documentation. > > I've also added support for negfunc in CREATE AGGREGATE. Hopefully that's > > an ok name for the option, but if anyone has any better ideas please let > > them be known. > > I'd be a bit inclined to build the terminology around "reverse" instead of > "negative" --- the latter seems a bit too arithmetic-centric. But that's > just MHO. To contribute to the bike shedding, inverse is often used in similar contexts. -- Ants Aasma
Re: [HACKERS] better atomics
On Wed, Nov 6, 2013 at 4:54 PM, Andres Freund wrote: > FWIW, I find the requirement for atomic_init() really, really > annoying. Not that that will change anything ;) Me too, but I guess this allows them to emulate atomics on platforms that don't have native support. Whether that is a good idea is a separate question. >> However I'm mildly supportive on having a separate type for variables >> accessed by atomics. It can result in some unnecessary code churn, but >> in general if an atomic access to a variable is added, then all other >> accesses to it need to be audited for memory barriers, even if they >> were previously correctly synchronized by a lock. > > Ok, that's what I am writing right now. > >> I guess the best approach for deciding would be to try to convert a >> couple of the existing unlocked accesses to the API and see what the >> patch looks like. > > I don't think there really exist any interesting ones? I am using my > lwlock reimplementation as a testbed so far. BufferDesc management in src/backend/storage/buffer/bufmgr.c. The seqlock like thing used to provide consistent reads from PgBackendStatus src/backend/postmaster/pgstat.c (this code looks like it's broken on weakly ordered machines) The unlocked reads that are done from various procarray variables. The XLogCtl accesses in xlog.c. IMO the volatile keyword should be confined to the code actually implementing atomics, locking. The "use volatile pointer to prevent code rearrangement" comment is evil. If there are data races in the code, they need comments pointing this out and probably memory barriers too. If there are no data races, then the volatile declaration is useless and a pessimization. Currently it's more of a catch all "here be dragons" declaration for data structures. Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] better atomics
On Tue, Nov 5, 2013 at 5:51 PM, Andres Freund wrote: > So what I've previously proposed did use plain types for the > to-be-manipulated atomic integers. I am not sure about that anymore > though, C11 includes support for atomic operations, but it assumes that > the to-be-manipulated variable is of a special "atomic" type. While I am > not propose that we try to emulate C11's API completely (basically > impossible as it uses type agnostic functions, it also exposes too > much), it seems to make sense to allow PG's atomics to be implemented by > C11's. > > The API is described at: http://en.cppreference.com/w/c/atomic > > The fundamental difference to what I've suggested so far is that the > atomic variable isn't a plain integer/type but a distinct datatype that > can *only* be manipulated via the atomic operators. That has the nice > property that it cannot be accidentally manipulated via plain operators. > > E.g. it wouldn't be > uint32 pg_atomic_fetch_add_u32(uint32 *ptr, uint32 add_); > but > uint32 pg_atomic_fetch_add_u32(pg_atomic_uint32 *ptr, uint32 add_); > > where, for now, atomic_uint32 is something like: > typedef struct pg_atomic_uint32 > { > volatile uint32 value; > } pg_atomic_uint32; > a future C11 implementation would just typedef C11's respective atomic > type to pg_atomic_uint32. > > Comments? C11 atomics need to be initialized through atomic_init(), so a simple typedef will not be correct here. Also, C11 atomics are designed to have the compiler take care of memory barriers - so they might not always be a perfect match to our API's, or any API implementable without compiler support. However I'm mildly supportive on having a separate type for variables accessed by atomics. It can result in some unnecessary code churn, but in general if an atomic access to a variable is added, then all other accesses to it need to be audited for memory barriers, even if they were previously correctly synchronized by a lock. I guess the best approach for deciding would be to try to convert a couple of the existing unlocked accesses to the API and see what the patch looks like. Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add min and max execute statement time in pg_stat_statement
On Tue, Oct 22, 2013 at 4:00 AM, Gavin Flower wrote: >> I have a proof of concept patch somewhere that does exactly this. I >> used logarithmic bin widths. With 8 log10 bins you can tell the >> fraction of queries running at each order of magnitude from less than >> 1ms to more than 1000s. Or with 31 bins you can cover factor of 2 >> increments from 100us to over 27h. And the code is almost trivial, >> just take a log of the duration and calculate the bin number from that >> and increment the value in the corresponding bin. > > I suppose this has to be decided at compile time to keep the code both > simple and efficient - if so, I like the binary approach. For efficiency's sake it can easily be done at run time, one extra logarithm calculation per query will not be noticeable. Having a proper user interface to make it configurable and changeable is where the complexity is. We might just decide to go with something good enough as even the 31 bin solution would bloat the pg_stat_statements data structure only by about 10%. > Curious, why start at 100us? I suppose this might be of interest if > everything of note is in RAM and/or stuff is on SSD's. Selecting a single row takes about 20us on my computer, I picked 100us as a reasonable limit below where the exact speed doesn't matter anymore. Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add min and max execute statement time in pg_stat_statement
On Tue, Oct 22, 2013 at 1:09 AM, Alvaro Herrera wrote: > Gavin Flower wrote: > >> One way it could be done, but even this would consume far too much >> storage and processing power (hence totally impractical), would be >> to 'simply' store a counter for each value found and increment it >> for each occurence... > > An histogram? Sounds like a huge lot of code complexity to me. Not > sure the gain is enough. I have a proof of concept patch somewhere that does exactly this. I used logarithmic bin widths. With 8 log10 bins you can tell the fraction of queries running at each order of magnitude from less than 1ms to more than 1000s. Or with 31 bins you can cover factor of 2 increments from 100us to over 27h. And the code is almost trivial, just take a log of the duration and calculate the bin number from that and increment the value in the corresponding bin. Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] removing old ports and architectures
On Fri, Oct 18, 2013 at 8:04 PM, Peter Geoghegan wrote: > On Fri, Oct 18, 2013 at 9:55 AM, Ants Aasma wrote: >> FWIW, I think that if we approach coding lock free algorithms >> correctly - i.e. "which memory barriers can we avoid while being >> safe", instead of "which memory barriers we need to add to become >> safe" - then supporting Alpha isn't a huge amount of extra work. > > Alpha is completely irrelevant, so I would not like to expend the > tiniest effort on supporting it. If there is someone using a very much > legacy architecture like this, I doubt that even they will appreciate > the ability to upgrade to the latest major version. It's mostly irrelevant and I wouldn't shed a tear for Alpha support, but I'd like to point out that it's a whole lot less irrelevant than some of the architectures being discussed here. The latest Alpha machines were sold only 6 years ago and supported up to 512GB of memory with 64 1.3 GHz cores, something that can run a very reasonable database load even today. Regards, Ants Aasma -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] removing old ports and architectures
On Thu, Oct 17, 2013 at 3:10 PM, Robert Haas wrote: > On Thu, Oct 17, 2013 at 12:22 AM, Noah Misch wrote: >> Removing support for alpha is a different animal compared to removing support >> for non-gcc MIPS and most of the others in your list. A hacker wishing to >> restore support for another MIPS compiler would fill in the assembly code >> blanks, probably using code right out of an architecture manual. A hacker >> wishing to restore support for alpha would find himself auditing every >> lock-impoverished algorithm in the backend. > > I had much the same thought last night. So I reverse my vote on > Alpha: let's drop it. I had thought that perhaps there'd be some > value in keeping it to force ourselves to consider what will happen > under the weakest generally-understood memory model, but in fact > that's probably a doomed effort without having the hardware available > to test the code. As you say, any future atomics support for such a > platform will be a major undertaking. FWIW, I think that if we approach coding lock free algorithms correctly - i.e. "which memory barriers can we avoid while being safe", instead of "which memory barriers we need to add to become safe" - then supporting Alpha isn't a huge amount of extra work. We only need a couple of extra barriers after atomic reads where I think we should have a comment anyway explaining that we don't need a read barrier because a data dependency provides ordering. In general I agree that we are unlikely to provide a bug free result without a build farm animal, so I'm ±0 on removing support. We can try to support, but we are unlikely to succeed. I also find it unlikely that anyone will create a new architecture with a similarly loose memory model. The experience with Alpha and other microprocessors shows that the extra hardware needed for fast and strong memory ordering guarantees more than pays for itself in performance. Regards, Ants Aasma -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PERFORM] Cpu usage 100% on slave. s_lock problem.
On Wed, Oct 2, 2013 at 10:37 PM, Merlin Moncure wrote: > On Wed, Oct 2, 2013 at 9:45 AM, Ants Aasma wrote: >> I haven't reviewed the code in as much detail to say if there is an >> actual race here, I tend to think there's probably not, but the >> specific pattern that I had in mind is that with the following actual >> code: > > hm. I think there *is* a race. 2+ threads could race to the line: > > LocalRecoveryInProgress = xlogctl->SharedRecoveryInProgress; > > and simultaneously set the value of LocalRecoveryInProgress to false, > and both engage InitXLOGAccess, which is destructive. The move > operation is atomic, but I don't think there's any guarantee the reads > to xlogctl->SharedRecoveryInProgress are ordered between threads > without a lock. > > I don't think the memory barrier will fix this. Do you agree? If so, > my earlier patch (recovery4) is another take on this problem, and > arguable safer; the unlocked read is in a separate path that does not > engage InitXLOGAccess() InitXLOGAccess only writes backend local variables, so it can't be destructive. In fact, it calls GetRedoRecPtr(), which does a spinlock acquisition cycle, which should be a full memory barrier. A read barrier after accessing SharedRecoveryInProgress is good enough, and it seems to be necessary to avoid a race condition for XLogCtl->ThisTimeLineID. Admittedly the race is so unprobable that in practice it probably doesn't matter. I just wanted to be spell out the correct way to do unlocked accesses as it can get quite confusing. I found Herb Sutter's "atomic<> weapons" talk very useful, thinking about the problem in terms of acquire and release makes it much clearer to me. Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PERFORM] Cpu usage 100% on slave. s_lock problem.
On Wed, Oct 2, 2013 at 4:39 PM, Merlin Moncure wrote: > On Mon, Sep 30, 2013 at 7:51 PM, Ants Aasma wrote: >> So we need a read barrier somewhere *after* reading the flag in >> RecoveryInProgress() and reading the shared memory structures, and in >> theory a full barrier if we are going to be writing data. > > wow -- thanks for your review and provided detail. Considering there > are no examples of barrier instructions to date, I think some of your > commentary should be included in the in-source documentation. > > In this particular case, a read barrier should be sufficient? By > 'writing data', do you mean to the xlog control structure? This > routine only sets a backend local flag so that should be safe? I haven't reviewed the code in as much detail to say if there is an actual race here, I tend to think there's probably not, but the specific pattern that I had in mind is that with the following actual code: Process A: globalVar = 1; write_barrier(); recoveryInProgress = false; Process B: if (!recoveryInProgress) { globalVar = 2; doWork(); } If process B speculatively executes line 2 before reading the flag for line 1, then it's possible that the store in process B is executed before the store in process A, making globalVar move backwards. The barriers as they are defined don't make this scenario impossible. That said, I don't know of any hardware that would make speculatively executed stores visible to non-speculative state, as I said, that would be completely insane. However currently compilers consider it completely legal to rewrite the code into the following form: tmp = globalVar; globalVar = 2; if (!recoveryInProgress) { doWork(); } else { globalVar = tmp; } That still exhibits the same problem. An abstract read barrier would not be enough here, as this requires a LoadStore barrier. However, the control dependency is enough for the hardware and PostgreSQL pg_read_barrier() is a full compiler barrier, so in practice a simple pg_read_barrier() is enough. Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PERFORM] Cpu usage 100% on slave. s_lock problem.
On Tue, Oct 1, 2013 at 2:08 PM, Andres Freund wrote: >> As for the specific patch being discussed here. The read barrier is in >> the wrong place and with the wrong comment, and the write side is >> assuming that SpinLockAcquire() is a write barrier, which it isn't >> documented to do at the moment. > > Lots of things we do pretty much already assume it is one - and I have a > hard time imagining any spinlock implementation that isn't a write > barrier. In fact it's the only reason we use spinlocks in quite some > places. > What we are missing is that it's not guaranteed to be a compiler barrier > on all platforms :(. But that's imo a bug. Agree on both counts. >> The correct way to think of this is >> that StartupXLOG() does a bunch of state modifications and then >> advertises the fact that it's done by setting >> xlogctl->SharedRecoveryInProgress = false; The state modifications >> should better be visible to anyone seeing that last write, so you need >> one write barrier between the state modifications and setting the >> flag. > > SpinLockAcquire() should provide that. Yes. It's notable that in this case it's a matter of correctness that the global state modifications do *not* share the critical section with the flag update. Otherwise the flag update may become visible before the state updates. >> So we need a read barrier somewhere *after* reading the flag in >> RecoveryInProgress() and reading the shared memory structures, and in >> theory a full barrier if we are going to be writing data. In practice >> x86 is covered thanks to it's memory model, Power is covered thanks to >> the control dependency and ARM would need a read barrier, but I don't >> think any real ARM CPU does speculative stores as that would be >> insane. > > Does there necessarily have to be a "visible" control dependency? Unfortunately no, the compiler is quite free to hoist loads and even stores out from the conditional block losing the control dependency. :( It's quite unlikely to do so as it would be a very large code motion and it probably has no reason to do it, but I don't see anything that would disallow it. I wonder if we should just emit a full fence there for platforms with weak memory models. Trying to enforce the control dependency seems quite fragile. Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Freezing without write I/O
On Tue, Oct 1, 2013 at 2:13 PM, Andres Freund wrote: > Agreed. The "wait free LW_SHARED" thing[1] I posted recently had a simple > > #define pg_atomic_read(atomic) (*(volatile uint32 *)&(atomic)) > > That should be sufficient and easily greppable, right? Looks good enough for me. I would consider using a naming scheme that accounts for possible future uint64 atomics. Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Freezing without write I/O
Just found this in my drafts folder. Sorry for the late response. On Fri, Sep 20, 2013 at 3:32 PM, Robert Haas wrote: > I am entirely unconvinced that we need this. Some people use acquire > + release fences, some people use read + write fences, and there are > all combinations (read-acquire, read-release, read-acquire-release, > write-acquire, write-release, write-acquire-release, both-acquire, > both-release, both-acquire-release). I think we're going to have > enough trouble deciding between the primitives we already have, let > alone with a whole mess of new ones. Mistakes will only manifest > themselves on certain platforms and in many cases the races are so > tight that actual failures are very unlikely to be reserved in > regression testing. I have to retract my proposal to try to emulate C11 atomics in C89. I guess this goes to show why one shouldn't try to conjure up atomic API's at 2AM. I forgot the fact that while acquire-release semantics are enough to ensure sequentially consistent behavior, the actual barriers need to be paired with specific atomic accesses to achieve that. It's not possible to use freestanding acquire/release barriers to do implement this, nor would it be possible to include barriers in the atomic accesses themselves without causing significant pessimization. Sequentially consistency (along with causal transitivity and total store ordering that come with it) should be regarded as a goal. I'm not able to reason about concurrent programs without those guarantees, and I suspect no one else is either. Sequential consistency is guaranteed if atomic variables (including locks) are accessed with appropriate acquire and release semantics. We just need to use a hodge-podge of read/write/full barriers and volatile memory accesses to actually implement the semantics until some distant future date where we can start relying on compilers getting it right. I still think we should have a macro for the volatile memory accesses. As a rule, each one of those needs a memory barrier, and if we consolidate them, or optimize them out, the considerations why this is safe should be explained in a comment. Race prone memory accesses should stick out like a sore thumb. > Personally, I think the biggest change that would help here is to > mandate that spinlock operations serve as compiler fences. That would > eliminate the need for scads of volatile references all over the > place. +1. The commits that you showed fixing issues in this area show quite well why this is a good idea. Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PERFORM] Cpu usage 100% on slave. s_lock problem.
On Sat, Sep 28, 2013 at 12:53 AM, Andres Freund wrote: > What confuses me is that pg_read_barrier() is just a compiler barrier on > x86[-64] in barrier.h. According to my knowledge it needs to be an > lfence or the full barrier? > The linked papers from Paul McKenney - which are a great read - seem to > agree? x86 provides a pretty strong memory model, the only (relevant) allowed reordering is that stores may be delayed beyond subsequent loads. A simple and functionally accurate model of this is to ignore all caches and think of the system as a bunch of CPU's accessing the main memory through a shared bus, but each CPU has a small buffer that stores writes done by that CPU. Intel and AMD have performed feats of black magic in the memory pipelines that this isn't a good performance model, but for correctness it is valid. The exceptions are few unordered memory instructions that you have to specifically ask for and non-writeback memory that concerns the kernel. (See section 8.2 in [1] for details) The note by McKenney stating that lfence is required for a read barrier is for those unordered instructions. I don't think it's necessary in PostgreSQL. As for the specific patch being discussed here. The read barrier is in the wrong place and with the wrong comment, and the write side is assuming that SpinLockAcquire() is a write barrier, which it isn't documented to do at the moment. The correct way to think of this is that StartupXLOG() does a bunch of state modifications and then advertises the fact that it's done by setting xlogctl->SharedRecoveryInProgress = false; The state modifications should better be visible to anyone seeing that last write, so you need one write barrier between the state modifications and setting the flag. On the other side, after reading the flag as false in RecoveryInProgress() the state modified by StartupXLOG() may be investigated, those loads better not happen before we read the flag. So we need a read barrier somewhere *after* reading the flag in RecoveryInProgress() and reading the shared memory structures, and in theory a full barrier if we are going to be writing data. In practice x86 is covered thanks to it's memory model, Power is covered thanks to the control dependency and ARM would need a read barrier, but I don't think any real ARM CPU does speculative stores as that would be insane. [1] http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-vol-3a-part-1-manual.pdf Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Freezing without write I/O
On Thu, Sep 19, 2013 at 2:42 PM, Heikki Linnakangas wrote: >> * switchFinishXmin and nextSwitchXid should probably be either volatile >>or have a compiler barrier between accessing shared memory and >>checking them. The compiler very well could optimize them away and >>access shmem all the time which could lead to weird results. > > Hmm, that would be a weird "optimization". Is that really legal for the > compiler to do? Better safe than sorry, I guess. C doesn't really have a multi-threaded memory model before C11, so the compiler makers have just made one up that suits them best. For unlocked memory reads the compiler is free to assume that no one else is accessing the variable, so yes that "optimization" would be legal according to their rules. I'm tackling similar issues in my patch. What I'm thinking is that we should change our existing supposedly atomic accesses to be more explicit and make the API compatible with C11 atomics[1]. For now I think the changes should be something like this: * Have separate typedefs for atomic variants of datatypes (I don't think we have a whole lot of them). With C11 atomics support, this would change to typedef _Atomic TransactionId AtomicTransactionId; * Require that pg_atomic_init(type, var, val) be used to init atomic variables. Initially it would just pass through to assignment, as all supported platforms can do 32bit atomic ops. C11 compatible compilers will delegate to atomic_init(). * Create atomic read and write macros that look something like: #define pg_atomic_load(type, var) (*((volatile type*) &(var))) and #define pg_atomic_store(type, var, val) do { \ *((volatile type*) &(var)) = (val); \ } while(0) C11 would pass through to the compilers implementation with relaxed memory ordering. * Rename pg_read_barrier()/pg_write_barrier()/pg_memory_barrier() to pg_acquire_fence()/pg_release_fence()/pg_acq_rel_fence(). Herb Sutter makes a convincing argument why loosening up the barrier semantics is the sane choice here. [2] C11 support can then pass though to atomic_thread_fence(). This way we have a relatively future proof baseline for lockfree programming, with options to expand with other primitives. We could also only do the volatile access macros part, at least it would make the intention more clear than the current approach of sprinkling around volatile pointers. Regards, Ants Aasma [1] http://en.cppreference.com/w/c/atomic [2] (long video about atomics) http://channel9.msdn.com/Shows/Going+Deep/Cpp-and-Beyond-2012-Herb-Sutter-atomic-Weapons-1-of-2 -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Questions about checksum feature in 9.3
On Sun, Sep 15, 2013 at 8:13 AM, Kevin wrote: > My attempts to compile it vectorized on OS X seemed to have failed since I > don't find a vector instruction in the .o file even though the options > -msse4.1 -funroll-loops -ftree-vectorize should be supported according to the > man page for Apple's llvm-gcc. I'm not sure what version of LLVM Apple is using for llvm-gcc. I know that clang+llvm 3.3 can successfully vectorize the checksum algorithm when -O3 is used. > So, has anyone compiled checksum vectorized on OS X? Are there any > performance data that would indicate whether or not I should worry with this > in the first place? Even without vectorization the worst case performance hit is about 20%. This is for a workload that is fully bottlenecked on swapping pages in between shared buffers and OS cache. In real world cases it's hard to imagine it having any measurable effect. A single core can checksum several gigabytes per second of I/O without vectorization, and about 30GB/s with vectorization. Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL CPU overhead/optimization (was Master-slave visibility order)
On Fri, Aug 30, 2013 at 3:02 AM, Andres Freund wrote: > I am not sure "hot cache large buffer performance" is really the > interesting case. Most of the XLogInsert()s are pretty small in the > common workloads. I vaguely recall trying 8 and getting worse > performance on many workloads, but that might have been a problem of my > implementation. Slice-by-8 doesn't have any overhead for small buffers besides the lookup tables, so it most likely the cache misses that were the issue. Murmur3, CityHash and SpookyHash don't have any lookup tables and are excellent with small keys. Especially CityHash, 64 byte hash is quoted at 9ns. > The reason I'd like to go for a faster CRC32 implementation as a first > step is that it's easy. Easy to verify, easy to analyze, easy to > backout. I personally don't have enough interest/time in the 9.4 cycle > to purse conversion to a different algorithm (I find the idea of using > different ones on 32/64bit pretty bad), but I obviously won't stop > somebody else ;) I might give it a shot later this cycle as I have familiarized my self with the problem domain anyway. I understand the appeal of staying with what we have, but this would cap the speedup at 4x and has large caveats with the extra lookup tables. A 28x speedup might be worth the extra effort. Regards, Ants Aasma -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers