Re: [HACKERS] PL/pgSQL 2
On 09/01/2014 04:04 AM, Joel Jacobson wrote: + Make UPDATE/INSERT/DELETE throw error if they didnt' modify exactly 1 row, as that's the most common use-case, and provide alternative syntax to modify multiple or zero rows. What? No. The whole point of SQL is that it's set-based and can modify multiple rows at once. Hobbling it specifically for functions seems fundamentally flawed. Especially for what we purport to be a core PostgreSQL language. + Change all warnings into errors I... what? I could see coming up with a better exception handling mechanism for escalating messages. But you're talking about taking a core element of PostgreSQL (warnings) and simply ripping them out so plpgsql2 loses even that small functionality. I'm sure you've put a lot of thought into this, but you're not the only person using plpgsql or any, however ambitious, potential replacement. -- Shaun Thomas OptionsHouse, LLC | 141 W. Jackson Blvd. | Suite 800 | Chicago IL, 60604 312-676-8870 stho...@optionshouse.com __ See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PL/pgSQL 1.2
On 09/03/2014 04:19 PM, Hannu Krosing wrote: 1. Conditions for number of rows returned by SELECT or touched by UPDATE or DELETE Now that I think upon this... don't we already have it? SELECT ... LIMIT 1 That already solves the purported problem of multiple results in SELECT INTO as well. Could we possibly extend that to UPDATE and DELETE syntax too? -- Shaun Thomas OptionsHouse, LLC | 141 W. Jackson Blvd. | Suite 800 | Chicago IL, 60604 312-676-8870 stho...@optionshouse.com __ See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PostrgeSQL vs oracle doing 1 million sqrts am I doing it wrong?
On 08/05/2014 12:56 AM, Mark Kirkwood wrote: The moral of the story for this case is that mapping Oracle to Postgres datatypes can require some careful thought. Using 'native' types (like integer, float8 etc) will generally give vastly quicker performance. We've seen a lot of this ourselves. Oracle's NUMERIC is a native type, whereas ours is emulated. From the performance, it would appear that REAL is another calculated type. At least you used INT though. I've seen too many Oracle shops using NUMERIC in PostgreSQL because it's there, and suffering major performance hits because of it. That said, the documentation here says FLOAT4 is an alias for REAL, so it's somewhat nonintuitive for FLOAT4 to be so much slower than FLOAT8, which is an alias for DOUBLE PRECISION. http://www.postgresql.org/docs/9.3/static/datatype.html Not sure why that would be. -- Shaun Thomas OptionsHouse, LLC | 141 W. Jackson Blvd. | Suite 800 | Chicago IL, 60604 312-676-8870 stho...@optionshouse.com __ See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Allowing parallel pg_restore from pipe
On 05/16/2013 12:16 PM, Timothy Garnett wrote: I think you'd have to be real careful around foreign key constraints for that to work. Not especially. All you need to do is bootstrap the database with a bunch of empty table targets (no constraints, keys, etc), then restore with the xargs command. Then you can apply the constraints, keys, and indexes later by doing a schema-only parallel pg_restore. -- Shaun Thomas OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604 312-676-8870 stho...@optionshouse.com __ See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] high io BUT huge amount of free memory
On 05/01/2013 06:37 PM, Bruce Momjian wrote: Sorry to be dense here, but what is the problem with that output? That there is a lot of memory marked as free? Why would it mark any memory free? That's kind of my point. :) That 14GB isn't allocated to cache, buffers, any process, or anything else. It's just... free. In the middle of the day, where 800 PG threads are pulling 7000TPS on average. Based on that scenario, I'd like to think it would cache pretty aggressively, but instead, it's just leaving 14GB around to do absolutely nothing. It makes me sad. :( -- Shaun Thomas OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604 312-676-8870 stho...@optionshouse.com __ See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] high io BUT huge amount of free memory
On 05/02/2013 12:04 PM, Josh Berkus wrote: There is a good, but sad, reason for this: IBM and Oracle and their partners are the largest employers of people hacking on core Linux memory/IO functionality, and both of those companies use DirectIO extensively in their products. I never thought of that. Somehow I figured all the Redhat engineers would somehow counterbalance that kind of influence. But that brings up an interesting question. How hard / feasible would it be to add DIO functionality to PG itself? I've already heard chatter (Robert Haas?) about converting the shared memory allocation to an anonymous block, so could we simultaneously open up a DMA relationship? -- Shaun Thomas OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604 312-676-8870 stho...@optionshouse.com __ See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Allowing parallel pg_restore from pipe
On 04/25/2013 12:56 PM, Timothy Garnett wrote: As the OP, I'll just note that my organization would definitely find use for a parallel migrator tool as long as it supported doing a selection of tables (i.e. -t / -T) in addition to the whole database and it supported or we were able to patch in an option to cluster as part of the migration (the equivalent of something like https://github.com/tgarnett/postgres/commit/cc320a71 ). If you need something like this short term, we actually found a way to do it ourselves for a migration we performed back in October. The secret is xargs with the -P option: xargs -I{} -P 8 -a table-list.txt \ bash -c pg_dump -Fc -t {} my_db | pg_restore -h remote -d my_db Fill table-list.txt with as many, or as few tables as you want. The above example would give you 8 parallel threads. Well equipped systems may be able to increase this. Admittedly it's a gross hack, but it works. :) -- Shaun Thomas OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604 312-676-8870 stho...@optionshouse.com __ See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] high io BUT huge amount of free memory
On 04/24/2013 08:24 AM, Robert Haas wrote: Are you referring to the fact that vm.zone_reclaim_mode = 1 is an idiotic default? Well... it is. But even on systems where it's not the default or is explicitly disabled, there's just something hideously wrong with NUMA in general. Take a look at our numa distribution on a heavily loaded system: available: 2 nodes (0-1) node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22 node 0 size: 36853 MB node 0 free: 14315 MB node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23 node 1 size: 36863 MB node 1 free: 300 MB node distances: node 0 1 0: 10 20 1: 20 10 What the hell? Seriously? Using numactl and starting in interleave didn't fix this, either. It just... arbitrarily ignores a huge chunk of memory for no discernible reason. The memory pressure code in Linux is extremely fucked up. I can't find it right now, but the memory management algorithm makes some pretty ridiculous assumptions once you pass half memory usage, regarding what is in active and inactive cache. I hate to rant, but it gets clearer to me every day that Linux is optimized for desktop systems, and generally only kinda works for servers. Once you start throwing vast amounts of memory, CPU, and processes at it though, things start to get unpredictable. That all goes back to my earlier threads that disabling process autogrouping via the kernel.sched_autogroup_enabled setting, magically gave us 20-30% better performance. The optimal setting for a server is clearly to disable process autogrouping, and yet it's enabled by default, and strongly advocated by Linus himself as a vast improvement. I get it. It's better for desktop systems. But the LAMP stack alone has probably a couple orders of magnitude more use cases than Joe Blow's Pentium 4 in his basement. Yet it's the latter case that's optimized for. Servers are getting shafted in a lot of cases, and it's actually starting to make me angry. -- Shaun Thomas OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604 312-676-8870 stho...@optionshouse.com __ See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] high io BUT huge amount of free memory
On 04/24/2013 08:49 AM, Andres Freund wrote: Uh. Ranting can be rather healthy thing every now and then and it good for the soul and such. But. Did you actually try reporting those issues? That's actually part of the problem. How do you report: Throwing a lot of processes at a high-memory system seems to break the mm code in horrible ways. I'm asking seriously here, because I have no clue how to isolate this behavior. It's clearly happening often enough that random people are starting to notice now that bigger servers are becoming the norm. I'm also not a kernel dev in any sense of the word. My C is so rusty, I can barely even read the patches going through the ML. I feel comfortable posting to PG lists because that's my bread and butter. Kernel lists seem way more imposing, and I'm probably not the only one who feels that way. I guess I don't mean to imply that kernel devs don't care. Maybe the right way to put it is that there don't seem to be enough kernel devs being provided with more capable testing hardware. Which is odd, considering Red Hat's involvement and activity on the kernel. I apologize, though. These last few months have been really frustrating thanks to this and other odd kernel-related issues. We've reached an equilibrium where the occasional waste of 20GB of RAM doesn't completely cripple the system, but this thread kinda struck a sore point. :) -- Shaun Thomas OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604 312-676-8870 stho...@optionshouse.com __ See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] high io BUT huge amount of free memory
On 04/22/2013 05:12 PM, Merlin Moncure wrote: free -g total used free sharedbuffers cached Mem: 378250128 0 0229 -/+ buffers/cache: 20357 This is most likely a NUMA issue. There really seems to be some kind of horrible flaw in the Linux kernel when it comes to properly handling NUMA on large memory systems. What does this say: numactl --hardware -- Shaun Thomas OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604 312-676-8870 stho...@optionshouse.com __ See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fast promotion, loose ends
On 04/22/2013 02:58 AM, Simon Riggs wrote: So, to initiate promotion, you can create a file called $DATADIR/fast_promote or $DATADIR/promote Pardon my naiveté, but could it also be an option to read the method from the promotion file? echo slow /my/promotion/path That would work without any default naming scheme, and only incurs a read on the file-handle. -- Shaun Thomas OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604 312-676-8870 stho...@optionshouse.com __ See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inconsistent DB data in Streaming Replication
On 04/10/2013 09:10 AM, Tom Lane wrote: IOW, I wouldn't consider skipping the rsync even if I had a feature like this. Totally. Out in the field, we consider the old database corrupt the moment we fail over. There is literally no way to verify the safety of any data along the broken chain, given race conditions and multiple potential failure points. The only potential use case for this that I can see, would be for system maintenance and a controlled failover. I agree: that's a major PITA when doing DR testing, but I personally don't think this is the way to fix that particular edge case. Maybe checksums will fix this in the long run... I don't know. DRBD has a handy block-level verify function for things like this, and it can re-sync master/slave data by comparing the commit log across the servers if you tell it one node should be considered incorrect. The thing is... we have clogs, and we have WAL. If we can assume bidirectional communication and verification (checksum comparison?) of both of those components, the database *should* be able to re-sync itself. Even if that were possible given the internals, I can't see anyone jumping on this before 9.4 or 9.5 unless someone sponsors the feature. Automatic re-sync would (within available WALs) be an awesome feature, though... -- Shaun Thomas OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604 312-676-8870 stho...@optionshouse.com __ See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inconsistent DB data in Streaming Replication
On 04/10/2013 11:40 AM, Fujii Masao wrote: Strange. If this is really true, shared disk failover solution is fundamentally broken because the standby needs to start up with the shared corrupted database at the failover. How so? Shared disk doesn't use replication. The point I was trying to make is that replication requires synchronization between two disparate servers, and verifying they have exactly the same data is a non-trivial exercise. Even a single transaction after a failover (effectively) negates the old server because there's no easy catch up mechanism yet. Even if this isn't necessarily true, it's the safest approach IMO. -- Shaun Thomas OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604 312-676-8870 stho...@optionshouse.com __ See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inconsistent DB data in Streaming Replication
On 04/08/2013 05:34 AM, Samrat Revagade wrote: One solution to avoid this situation is have the master send WAL records to standby and wait for ACK from standby committing WAL files to disk and only after that commit data page related to this transaction on master. Isn't this basically what synchronous replication does in PG 9.1+? -- Shaun Thomas OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604 312-676-8870 stho...@optionshouse.com __ See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] count(*) of zero rows returns 1
On 01/15/2013 01:18 PM, Bruce Momjian wrote: AFAICR, 'dual' table has one column named 'DUMMY' and one row with value, single character X. How elegant. :-( Let's see what EnterpriseDB produces: test= select * from dual; dummy --- X (1 row) Yep, elegant gets my vote. ;) But then again, Oracle also created VARCHAR2 and told everyone to start using that, just in case they ever modified VARCHAR to be SQL compliant. Thankfully we have you guys, so PG won't go down a similar route. -- Shaun Thomas OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604 312-676-8870 stho...@optionshouse.com __ See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Synchronous commit not... synchronous?
On 11/02/2012 12:31 PM, Simon Riggs wrote: If people want full two phase commit, that option exists also. I was about to say... isn't that what savepoints are for? -- Shaun Thomas OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604 312-444-8534 stho...@optionshouse.com __ See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Synchronous Standalone Master Redoux
On 07/12/2012 12:31 AM, Daniel Farina wrote: But RAID-1 as nominally seen is a fundamentally different problem, with much tinier differences in latency, bandwidth, and connectivity. Perhaps useful for study, but to suggest the problem is *that* similar I think is wrong. Well, yes and no. One of the reasons I brought up DRBD was because it's basically RAID-1 over a network interface. It's not without overhead, but a few basic pgbench tests show it's still 10-15% faster than a synchronous PG setup for two servers in the same rack. Greg Smith's tests show that beyond a certain point, a synchronous PG setup effectively becomes untenable simply due to network latency in the protocol implementation. In reality, it probably wouldn't be usable beyond two servers in different datacenters in the same city. RAID-1 was the model for DRBD, but I brought it up only because it's pretty much the definition of a synchronous commit that degrades gracefully. I'd even suggest it's more important in a network context than for RAID-1, because you're far more likely to get sync interruptions due to network issues than you are for a disk to fail. But, putting that aside, why not write a piece of middleware that does precisely this, or whatever you want? It can live on the same machine as Postgres and ack synchronous commit when nobody is home, and notify (e.g. page) you in the most precise way you want if nobody is home for a while. You're right that there are lots of ways to kinda get this ability, they're just not mature enough or capable enough to really matter. Tailing the log to watch for secondary disconnect is too slow. Monit or Nagios style checks are too slow and unreliable. A custom-built middle-layer (a master-slave plugin for Pacemaker, for example) is too slow. All of these would rely on some kind of check interval. Set that too high, and we get 10,000xn missed transactions for n seconds. Too low, and we'd increase the likelihood of false positives and unnecessary detachments. If it's possible through a PG 9.x extension, that'd probably be the way to *safely* handle it as a bolt-on solution. If the original author of the patch can convert it to such a beast, we'd install it approximately five seconds after it finished compiling. So far as transaction durability is concerned... we have a continuous background rsync over dark fiber for archived transaction logs, DRBD for block-level sync, filesystem snapshots for our backups, a redundant async DR cluster, an offsite backup location, and a tape archival service stretching back for seven years. And none of that will cause the master to stop processing transactions unless the master itself dies and triggers a failover. Using PG sync in its current incarnation would introduce an extra failure scenario that wasn't there before. I'm pretty sure we're not the only ones avoiding it for exactly that reason. Our queue discards messages it can't fulfil within ten seconds and then throws an error for each one. We need to decouple the secondary as quickly as possible if it becomes unresponsive, and there's really no way to do that without something in the database, one way or another. -- Shaun Thomas OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604 312-444-8534 stho...@optionshouse.com __ See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Synchronous Standalone Master Redoux
On 07/12/2012 12:02 PM, Bruce Momjian wrote: Well, the problem also exists if add it as an internal database feature --- how long do we wait to consider the standby dead, how do we inform administrators, etc. True. Though if there is no secondary connected, either because it's not there yet, or because it disconnected, that's an easy check. It's the network lag/stall detection that's tricky. I don't think anyone says the feature is useless, but is isn't going to be a simple boolean either. Oh $Deity no. I'd never suggest that. I just tend to be overly verbose, and sometimes my intent gets lost in the rambling as I try to explain my perspective. I apologize if it somehow came across that anyone could just flip a switch and have it work. My C is way too rusty, or I'd be writing an extension right now to do this, or be looking over that patch I linked to originally to make suitable adaptations. I know I talk about how relatively handy DRBD is, but it's also a gigantic PITA since it has to exist underneath the actual filesystem. :) -- Shaun Thomas OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604 312-444-8534 stho...@optionshouse.com __ See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Synchronous Standalone Master Redoux
On 07/10/2012 06:02 PM, Daniel Farina wrote: For example, what if DRBD can only complete one page per second for some reason? Does it it simply have the primary wait at this glacial pace, or drop synchronous replication and go degraded? Or does it do something more clever than just a timeout? That's a good question, and way beyond what I know about the internals. :) In practice though, there are configurable thresholds, and if exceeded, it will invalidate the secondary. When using Pacemaker, we've actually had instances where the 10G link we had between the servers died, so each node thought the other was down. That lead to the secondary node self-promoting and trying to steal the VIP from the primary. Throw in a gratuitous arp, and you get a huge mess. That lead to what DRBD calls split-brain, because both nodes were running and writing to the block device. Thankfully, you can actually tell one node to discard its changes and re-subscribe. Doing that will replay the transactions from the good node on the bad one. And even then, it's a good idea to run an online verify to do a block-by-block checksum and correct any differences. Of course, all of that's only possible because it's a block-level replication. I can't even imagine PG doing anything like that. It would have to know the last good transaction from the primary and do an implied PIT recovery to reach that state, then re-attach for sync commits. Regardless of what DRBD does, I think the problem with the async/sync duality as-is is there is no nice way to manage exposure to transaction loss under various situations and requirements. Which would be handy. With synchronous commits, it's given that the protocol is bi-directional. Then again, PG can detect when clients disconnect the instant they do so, and having such an event implicitly disable synchronous_standby_names until reconnect would be an easy fix. The database already keeps transaction logs, so replaying would still happen on re-attach. It could easily throw a warning for every sync-required commit so long as it's in degraded mode. Those alone are very small changes that don't really harm the intent of sync commit. That's basically what a RAID-1 does, and people have been fine with that for decades. -- Shaun Thomas OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604 312-444-8534 stho...@optionshouse.com __ See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Synchronous Standalone Master Redoux
On 07/10/2012 01:11 AM, Daniel Farina wrote: So if I get this straight, what you are saying is be asynchronous replication unless someone is around, in which case be synchronous is the mode you want. Er, no. I think I see where you might have gotten that, but no. This is a pretty tricky definition: consider if you bring a standby on-line from archive replay and it shows up in streaming with pretty high lag, and stops all commit traffic while it reaches the bounded window of what acceptable lag is. That sounds pretty terrible, too. How does DBRD handle this? It seems like the catchup phase might be interesting prior art. Well, DRBD actually has a very definitive sync mode, and no attenuation is involved at all. Here's what a fully working cluster looks like, according to /proc/drbd: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate Here's what happens when I disconnect the secondary: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown So there's a few things here: 1. Primary is waiting for the secondary to reconnect. 2. It knows its own data is still up to date. 3. It's waiting to assess the secondary when it re-appears 4. It's still capable of writing to the device. This is more akin to degraded RAID-1. Writes are synchronous as long as two devices exist, but if one vanishes, you can still use the disk at your own risk. Checking the status of DRBD will show this readily. I also want to point out it is *fully* synchronous when both nodes are available. I.e., you can't even call a filesystem sync without the sync succeeding on both nodes. When you re-connect a secondary device, it catches up as fast as possible by replaying waiting transactions, and then re-attaching to the cluster. Until it's fully caught-up, it doesn't exist. DRBD acknowledges the secondary is there and attempting to catch up, but does not leave degraded mode until the secondary reaches UpToDate status. This is a much more graceful failure scenario than is currently possible with PostgreSQL. With DRBD, you'd still need a tool to notice the master node is in an invalid state and perform a failover, but the secondary going belly-up will not suddenly halt the master. But I'm not even hoping for *that* level of functionality. I just want to be able to tell PostgreSQL to notice when the secondary becomes unavailable *on its own*, and then perform in degraded non-sync mode because it's much faster than any monitor I can possibly attach to perform the same function. I plan on using DRBD until either PG can do that, or a better alternative presents itself. Async is simply too slow for our OLTP system except for the disaster recovery node, which isn't expected to carry on within seconds of the primary's failure. I briefly considered sync mode when it appeared as a feature, but I see it's still too early in its development cycle, because there are no degraded operation modes. That's fine, I'm willing to wait. I just don't understand the push-back, I guess. RAID-1 is the poster child for synchronous writes for fault tolerance. It will whine constantly to anyone who will listen when operating only on one device, but at least it still works. I'm pretty sure nobody would use RAID-1 if its failure mode was: block writes until someone installs a replacement disk. -- Shaun Thomas OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604 312-444-8534 stho...@optionshouse.com __ See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Synchronous Standalone Master Redoux
On 07/09/2012 05:15 PM, Josh Berkus wrote: Total-consistency replication is what I think you want, that is, to guarantee that at any given time a read query on the master will return the same results as a read query on the standby. Heck, *most* people would like to have that. You would also be advancing database science in general if you could come up with a way to implement it. Doesn't having consistent transactional state across the systems imply that? So I'm unclear on why sync rep would be faster than async rep given that they use exactly the same mechanism. Explain? Too many mental gymnastics. I get that async is faster than sync, but the inconsistent transactional state makes it *look* slower. If a customer makes an order, but just happens to check that order state on the secondary before it can catch up, that's a net loss. Like I said, that's fine for our DR system, or a reporting mirror, or any one of several use-case scenarios, but it's not good enough for a failover when better alternatives exist. In this case, better alternatives are anything that can guarantee transaction durability: DRBD / PG sync. PG sync mode does what I want in that regard, it just has no graceful failure state without relatively invasive intervention. Theoretically we could write a Pacemaker agent, or some other simple harness, that just monitors both servers and performs an LSB HUP after modifying the primary node to disable synchronous_standby_names if the secondary dies, or promotes the secondary if the primary dies. But after being spoiled by DRBD knowing the instant the secondary disconnects, but still being available until the secondary is restored, we can't justifiably switch to something that will have the primary hang for ten seconds between monitor checks and service reloads. I'm just saying I considered it briefly during testing the last few days, but there's no way I can make a business case for it. PG sync rep is a great step forward, but it's not for us. Yet. -- Shaun Thomas OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604 312-444-8534 stho...@optionshouse.com __ See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Synchronous Standalone Master Redoux
On 07/10/2012 09:40 AM, Heikki Linnakangas wrote: You are mistaken. It only guarantees that it's been sync'd to disk in the standby, but if there are open snapshots or the system is simply busy, it might takes minutes or more until the effects of that transaction become visible. Well, crap. It's subtle distinctions like this I wish I'd noticed before. Doesn't really affect our plans, it just makes sync rep even less viable for our use case. Thanks for the correction! :) -- Shaun Thomas OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604 312-444-8534 stho...@optionshouse.com __ See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Synchronous Standalone Master Redoux
Hey everyone, Upon doing some usability tests with PostgreSQL 9.1 recently, I ran across this discussion: http://archives.postgresql.org/pgsql-hackers/2011-12/msg01224.php And after reading the entire thing, I found it odd that the overriding pushback was because nobody could think of a use case. The argument was: if you don't care if the slave dies, why not just use asynchronous replication? I'd like to introduce all of you to DRBD. DRBD is, for those who aren't familiar, distributed (network) block-level replication. Right now, this is what we're using, and will use in the future, to ensure a stable synchronous PostgreSQL copy on our backup node. I was excited to read about synchronous replication, because with it, came the possibility we could have two readable nodes with the servers we already have. You can't do that with DRBD; secondary nodes can't even mount the device. So here's your use case: 1. Slave wants to be synchronous with master. Master wants replication on at least one slave. They have this, and are happy. 2. For whatever reason, slave crashes or becomes unavailable. 3. Master notices no more slaves are available, and operates in standalone mode, accumulating WAL files until a suitable slave appears. 4. Slave finishes rebooting/rebuilding/upgrading/whatever, and re-subscribes to the feed. 5. Slave stays in degraded sync (asynchronous) mode until it is caught up, and then switches to synchronous. This makes both master and slave happy, because *intent* of synchronous replication is fulfilled. PostgreSQL's implementation means the master will block until someone/something notices and tells it to stop waiting, or the slave comes back. For pretty much any high-availability environment, this is not viable. Based on that alone, I can't imagine a scenario where synchronous replication would be considered beneficial. The current setup doubles unplanned system outage scenarios in such a way I'd never use it in a production environment. Right now, we only care if the master server dies. With sync rep, we'd have to watch both servers like a hawk and be ready to tell the master to disable sync rep, lest our 10k TPS system come to an absolute halt because the slave died. With DRBD, when a slave node goes offline, the master operates in standalone until the secondary re-appears, after which it re-synchronizes missing data, and then operates in sync mode afterwards. Just because the data is temporarily out of sync does *not* mean we want asynchronous replication. I think you'd be hard pressed to find many users taking advantage of DRBD's async mode. Just because data is temporarily catching up, doesn't mean it will remain in that state. I would *love* to have the functionality discussed in the patch. If I can make a case for it, I might even be able to convince my company to sponsor its addition, provided someone has time to integrate it. Right now, we're using DRBD so we can have a very short outage window while the offline node gets promoted, and it works, but that means a basically idle server at all times. I'd gladly accept a 10-20% performance hit for sync rep if it meant that other server could reliably act as a read slave. That's currently impossible because async replication is too slow, and sync is too fragile for reasons stated above. Am I totally off-base, here? I was shocked when I actually read the documentation on how sync rep worked, and saw that no servers would function properly until at least two were online. -- Shaun Thomas OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604 312-444-8534 stho...@optionshouse.com __ See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Attempting upgrade path; is this possible?
I'm in charge of a very large database, and we're using a highly decrepit version of Postgresql currently. After searching through the archives, Google, and trying out several replication engines, I have a question. I had originally considered Slony-I, as it doesn't seem to require version compatibility between nodes like pgCluster, so upgrading from 7.4.2 to 8.1.3 would be a possible, if slow process. But after looking into the level of micro-management necessary, such as defining sets of every table on a per-database level, then having it add artificial primary-keys to applicable tables, it just doesn't seem like a good choice. Not a fault of Slony-I, but several multi-gig databases hosting hundreds of tables would be a nightmare to use with Slony-I. Then I thought about the backup/recovery system and the WAL files. Would this scenario be possible: 1. Do a pg_dumpall on the existing database running 7.4.2. 2. Do a psql -f foo template1 on the new database running 8.1.3. 3. Wait a very long time while the new database loads. 4. Shut down old database. 5. Start the new database in restore mode, and point it to the WAL files from the old database. 6. Wait for restore to finish. 7. Restart the new database. I wondered about this, as the pg_dumpall/restore would take a very long time for a 50GB database cluster, but theoretically the WAL files would continue to accumulate on the old db while this loading was taking place. If the WAL formats were compatible, the total upgrade time would only be restricted to how long it took to replay the WAL files in the new database. Does the current format of the WAL files make this possible? If not, is such an option for the future? Thanks in advance. -- Shaun Thomas Database Administrator Leapfrog Online 847-440-8253 CONFIDENTIALITY NOTE The document(s) accompanying this e-mail transmission, if any, and the e-mail transmittal message contain information from Leapfrog Online Customer Acquisition, LLC is confidential or privileged. The information is intended to be for the use of the individual(s) or entity(ies) named on this e-mail transmission message. If you are not the intended recipient, be aware that any disclosure, copying, distribution or use of the contents of this e-mail is prohibited. If you have received this e-mail in error, please immediately delete this e-mail and notify us by telephone of the error. ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org