Re: [HACKERS] PL/pgSQL 2

2014-09-04 Thread Shaun Thomas

On 09/01/2014 04:04 AM, Joel Jacobson wrote:


+ Make UPDATE/INSERT/DELETE throw error if they didnt' modify exactly 1
row, as that's the most common use-case, and provide alternative syntax
to modify multiple or zero rows.


What? No. The whole point of SQL is that it's set-based and can modify 
multiple rows at once. Hobbling it specifically for functions seems 
fundamentally flawed. Especially for what we purport to be a core 
PostgreSQL language.



+ Change all warnings into errors


I... what? I could see coming up with a better exception handling 
mechanism for escalating messages. But you're talking about taking a 
core element of PostgreSQL (warnings) and simply ripping them out so 
plpgsql2 loses even that small functionality.


I'm sure you've put a lot of thought into this, but you're not the only 
person using plpgsql or any, however ambitious, potential replacement.


--
Shaun Thomas
OptionsHouse, LLC | 141 W. Jackson Blvd. | Suite 800 | Chicago IL, 60604
312-676-8870
stho...@optionshouse.com

__

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to 
this email


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PL/pgSQL 1.2

2014-09-04 Thread Shaun Thomas

On 09/03/2014 04:19 PM, Hannu Krosing wrote:


1. Conditions for number of rows returned by SELECT or touched by
UPDATE or DELETE


Now that I think upon this... don't we already have it?

SELECT ... LIMIT 1

That already solves the purported problem of multiple results in SELECT 
INTO as well. Could we possibly extend that to UPDATE and DELETE syntax too?


--
Shaun Thomas
OptionsHouse, LLC | 141 W. Jackson Blvd. | Suite 800 | Chicago IL, 60604
312-676-8870
stho...@optionshouse.com

__

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to 
this email


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PostrgeSQL vs oracle doing 1 million sqrts am I doing it wrong?

2014-08-05 Thread Shaun Thomas

On 08/05/2014 12:56 AM, Mark Kirkwood wrote:


The moral of the story for this case is that mapping Oracle to Postgres
datatypes can require some careful thought. Using 'native' types (like
integer, float8 etc) will generally give vastly quicker performance.


We've seen a lot of this ourselves. Oracle's NUMERIC is a native type, 
whereas ours is emulated. From the performance, it would appear that 
REAL is another calculated type.


At least you used INT though. I've seen too many Oracle shops using 
NUMERIC in PostgreSQL because it's there, and suffering major 
performance hits because of it.


That said, the documentation here says FLOAT4 is an alias for REAL, so 
it's somewhat nonintuitive for FLOAT4 to be so much slower than FLOAT8, 
which is an alias for DOUBLE PRECISION.


http://www.postgresql.org/docs/9.3/static/datatype.html

Not sure why that would be.

--
Shaun Thomas
OptionsHouse, LLC | 141 W. Jackson Blvd. | Suite 800 | Chicago IL, 60604
312-676-8870
stho...@optionshouse.com

__

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to 
this email


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Allowing parallel pg_restore from pipe

2013-05-16 Thread Shaun Thomas

On 05/16/2013 12:16 PM, Timothy Garnett wrote:


I think you'd have to be real careful around foreign key constraints
for that to work.


Not especially. All you need to do is bootstrap the database with a 
bunch of empty table targets (no constraints, keys, etc), then restore 
with the xargs command. Then you can apply the constraints, keys, and 
indexes later by doing a schema-only parallel pg_restore.


--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-676-8870
stho...@optionshouse.com

__

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to 
this email


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] high io BUT huge amount of free memory

2013-05-02 Thread Shaun Thomas

On 05/01/2013 06:37 PM, Bruce Momjian wrote:


Sorry to be dense here, but what is the problem with that output?
That there is a lot of memory marked as free?  Why would it mark
any memory free?


That's kind of my point. :) That 14GB isn't allocated to cache, buffers, 
any process, or anything else. It's just... free. In the middle of the 
day, where 800 PG threads are pulling 7000TPS on average. Based on that 
scenario, I'd like to think it would cache pretty aggressively, but 
instead, it's just leaving 14GB around to do absolutely nothing.


It makes me sad. :(

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-676-8870
stho...@optionshouse.com

__

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to 
this email


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] high io BUT huge amount of free memory

2013-05-02 Thread Shaun Thomas

On 05/02/2013 12:04 PM, Josh Berkus wrote:


There is a good, but sad, reason for this: IBM and Oracle and their
partners are the largest employers of people hacking on core Linux
memory/IO functionality, and both of those companies use DirectIO
extensively in their products.


I never thought of that. Somehow I figured all the Redhat engineers 
would somehow counterbalance that kind of influence.


But that brings up an interesting question. How hard / feasible would it 
be to add DIO functionality to PG itself? I've already heard chatter 
(Robert Haas?) about converting the shared memory allocation to an 
anonymous block, so could we simultaneously open up a DMA relationship?



--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-676-8870
stho...@optionshouse.com

__

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to 
this email


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Allowing parallel pg_restore from pipe

2013-04-25 Thread Shaun Thomas

On 04/25/2013 12:56 PM, Timothy Garnett wrote:


As the OP, I'll just note that my organization would definitely find use
for a parallel migrator tool as long as it supported doing a selection
of tables (i.e. -t / -T) in addition to the whole database and it
supported or we were able to patch in an option to cluster as part of
the migration (the equivalent of something like
https://github.com/tgarnett/postgres/commit/cc320a71 ).


If you need something like this short term, we actually found a way to 
do it ourselves for a migration we performed back in October. The secret 
is xargs with the -P option:


xargs -I{} -P 8 -a table-list.txt \
bash -c pg_dump -Fc -t {} my_db | pg_restore -h remote -d my_db

Fill table-list.txt with as many, or as few tables as you want. The 
above example would give you 8 parallel threads. Well equipped systems 
may be able to increase this.


Admittedly it's a gross hack, but it works. :)

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-676-8870
stho...@optionshouse.com

__

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to 
this email


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] high io BUT huge amount of free memory

2013-04-24 Thread Shaun Thomas

On 04/24/2013 08:24 AM, Robert Haas wrote:


Are you referring to the fact that vm.zone_reclaim_mode = 1 is an
idiotic default?


Well... it is. But even on systems where it's not the default or is 
explicitly disabled, there's just something hideously wrong with NUMA in 
general. Take a look at our numa distribution on a heavily loaded system:


available: 2 nodes (0-1)
node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22
node 0 size: 36853 MB
node 0 free: 14315 MB
node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23
node 1 size: 36863 MB
node 1 free: 300 MB
node distances:
node   0   1
  0:  10  20
  1:  20  10

What the hell? Seriously? Using numactl and starting in interleave 
didn't fix this, either. It just... arbitrarily ignores a huge chunk of 
memory for no discernible reason.


The memory pressure code in Linux is extremely fucked up. I can't find 
it right now, but the memory management algorithm makes some pretty 
ridiculous assumptions once you pass half memory usage, regarding what 
is in active and inactive cache.


I hate to rant, but it gets clearer to me every day that Linux is 
optimized for desktop systems, and generally only kinda works for 
servers. Once you start throwing vast amounts of memory, CPU, and 
processes at it though, things start to get unpredictable.


That all goes back to my earlier threads that disabling process 
autogrouping via the kernel.sched_autogroup_enabled setting, magically 
gave us 20-30% better performance. The optimal setting for a server is 
clearly to disable process autogrouping, and yet it's enabled by 
default, and strongly advocated by Linus himself as a vast improvement.


I get it. It's better for desktop systems. But the LAMP stack alone has 
probably a couple orders of magnitude more use cases than Joe Blow's 
Pentium 4 in his basement. Yet it's the latter case that's optimized for.


Servers are getting shafted in a lot of cases, and it's actually 
starting to make me angry.


--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-676-8870
stho...@optionshouse.com

__

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to 
this email


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] high io BUT huge amount of free memory

2013-04-24 Thread Shaun Thomas

On 04/24/2013 08:49 AM, Andres Freund wrote:


Uh. Ranting can be rather healthy thing every now and then and it good
for the soul and such. But. Did you actually try reporting those issues?


That's actually part of the problem. How do you report:

Throwing a lot of processes at a high-memory system seems to break the 
mm code in horrible ways.


I'm asking seriously here, because I have no clue how to isolate this 
behavior. It's clearly happening often enough that random people are 
starting to notice now that bigger servers are becoming the norm.


I'm also not a kernel dev in any sense of the word. My C is so rusty, I 
can barely even read the patches going through the ML. I feel 
comfortable posting to PG lists because that's my bread and butter. 
Kernel lists seem way more imposing, and I'm probably not the only one 
who feels that way.


I guess I don't mean to imply that kernel devs don't care. Maybe the 
right way to put it is that there don't seem to be enough kernel devs 
being provided with more capable testing hardware. Which is odd, 
considering Red Hat's involvement and activity on the kernel.


I apologize, though. These last few months have been really frustrating 
thanks to this and other odd kernel-related issues. We've reached an 
equilibrium where the occasional waste of 20GB of RAM doesn't completely 
cripple the system, but this thread kinda struck a sore point. :)


--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-676-8870
stho...@optionshouse.com

__

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to 
this email


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] high io BUT huge amount of free memory

2013-04-23 Thread Shaun Thomas

On 04/22/2013 05:12 PM, Merlin Moncure wrote:


free -g
  total   used   free sharedbuffers cached
Mem:   378250128  0  0229
-/+ buffers/cache: 20357


This is most likely a NUMA issue. There really seems to be some kind of 
horrible flaw in the Linux kernel when it comes to properly handling 
NUMA on large memory systems.


What does this say:

numactl --hardware

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-676-8870
stho...@optionshouse.com

__

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to 
this email


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fast promotion, loose ends

2013-04-22 Thread Shaun Thomas

On 04/22/2013 02:58 AM, Simon Riggs wrote:


So, to initiate promotion, you can create a file called
$DATADIR/fast_promote or $DATADIR/promote


Pardon my naiveté, but could it also be an option to read the method
from the promotion file?

echo slow  /my/promotion/path

That would work without any default naming scheme, and only incurs a
read on the file-handle.

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-676-8870
stho...@optionshouse.com

__

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to 
this email


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inconsistent DB data in Streaming Replication

2013-04-10 Thread Shaun Thomas

On 04/10/2013 09:10 AM, Tom Lane wrote:


IOW, I wouldn't consider skipping the rsync even if I had a feature
like this.


Totally. Out in the field, we consider the old database corrupt the 
moment we fail over. There is literally no way to verify the safety of 
any data along the broken chain, given race conditions and multiple 
potential failure points.


The only potential use case for this that I can see, would be for system 
maintenance and a controlled failover. I agree: that's a major PITA when 
doing DR testing, but I personally don't think this is the way to fix 
that particular edge case.


Maybe checksums will fix this in the long run... I don't know. DRBD has 
a handy block-level verify function for things like this, and it can 
re-sync master/slave data by comparing the commit log across the servers 
if you tell it one node should be considered incorrect.


The thing is... we have clogs, and we have WAL. If we can assume 
bidirectional communication and verification (checksum comparison?) of 
both of those components, the database *should* be able to re-sync itself.


Even if that were possible given the internals, I can't see anyone 
jumping on this before 9.4 or 9.5 unless someone sponsors the feature.


Automatic re-sync would (within available WALs) be an awesome feature, 
though...


--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-676-8870
stho...@optionshouse.com

__

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to 
this email


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inconsistent DB data in Streaming Replication

2013-04-10 Thread Shaun Thomas

On 04/10/2013 11:40 AM, Fujii Masao wrote:


Strange. If this is really true, shared disk failover solution is
fundamentally broken because the standby needs to start up with the
shared corrupted database at the failover.


How so? Shared disk doesn't use replication. The point I was trying to 
make is that replication requires synchronization between two disparate 
servers, and verifying they have exactly the same data is a non-trivial 
exercise. Even a single transaction after a failover (effectively) 
negates the old server because there's no easy catch up mechanism yet.


Even if this isn't necessarily true, it's the safest approach IMO.

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-676-8870
stho...@optionshouse.com

__

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to 
this email


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inconsistent DB data in Streaming Replication

2013-04-08 Thread Shaun Thomas

On 04/08/2013 05:34 AM, Samrat Revagade wrote:


One solution to avoid this situation is have the master send WAL
records to standby and wait for ACK from standby committing WAL files
to disk and only after that commit data page related to this
transaction on master.


Isn't this basically what synchronous replication does in PG 9.1+?

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-676-8870
stho...@optionshouse.com

__

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to 
this email


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] count(*) of zero rows returns 1

2013-01-15 Thread Shaun Thomas

On 01/15/2013 01:18 PM, Bruce Momjian wrote:


AFAICR, 'dual' table has one column named 'DUMMY' and one row with
value, single character X.


How elegant.  :-(


Let's see what EnterpriseDB produces:

test= select * from dual;

 dummy
---
 X
(1 row)

Yep, elegant gets my vote. ;) But then again, Oracle also created 
VARCHAR2 and told everyone to start using that, just in case they ever 
modified VARCHAR to be SQL compliant. Thankfully we have you guys, so PG 
won't go down a similar route.


--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-676-8870
stho...@optionshouse.com

__

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to 
this email


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronous commit not... synchronous?

2012-11-02 Thread Shaun Thomas

On 11/02/2012 12:31 PM, Simon Riggs wrote:


If people want full two phase commit, that option exists also.


I was about to say...  isn't that what savepoints are for?


--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-444-8534
stho...@optionshouse.com

__

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to 
this email


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronous Standalone Master Redoux

2012-07-12 Thread Shaun Thomas

On 07/12/2012 12:31 AM, Daniel Farina wrote:


But RAID-1 as nominally seen is a fundamentally different problem,
with much tinier differences in latency, bandwidth, and connectivity.
Perhaps useful for study, but to suggest the problem is *that* similar
I think is wrong.


Well, yes and no. One of the reasons I brought up DRBD was because it's 
basically RAID-1 over a network interface. It's not without overhead, 
but a few basic pgbench tests show it's still 10-15% faster than a 
synchronous PG setup for two servers in the same rack. Greg Smith's 
tests show that beyond a certain point, a synchronous PG setup 
effectively becomes untenable simply due to network latency in the 
protocol implementation. In reality, it probably wouldn't be usable 
beyond two servers in different datacenters in the same city.


RAID-1 was the model for DRBD, but I brought it up only because it's 
pretty much the definition of a synchronous commit that degrades 
gracefully. I'd even suggest it's more important in a network context 
than for RAID-1, because you're far more likely to get sync 
interruptions due to network issues than you are for a disk to fail.



But, putting that aside, why not write a piece of middleware that
does precisely this, or whatever you want? It can live on the same
machine as Postgres and ack synchronous commit when nobody is home,
and notify (e.g. page) you in the most precise way you want if nobody
is home for a while.


You're right that there are lots of ways to kinda get this ability, 
they're just not mature enough or capable enough to really matter. 
Tailing the log to watch for secondary disconnect is too slow. Monit or 
Nagios style checks are too slow and unreliable. A custom-built 
middle-layer (a master-slave plugin for Pacemaker, for example) is too 
slow. All of these would rely on some kind of check interval. Set that 
too high, and we get 10,000xn missed transactions for n seconds. Too 
low, and we'd increase the likelihood of false positives and unnecessary 
detachments.


If it's possible through a PG 9.x extension, that'd probably be the way 
to *safely* handle it as a bolt-on solution. If the original author of 
the patch can convert it to such a beast, we'd install it approximately 
five seconds after it finished compiling.


So far as transaction durability is concerned... we have a continuous 
background rsync over dark fiber for archived transaction logs, DRBD for 
block-level sync, filesystem snapshots for our backups, a redundant 
async DR cluster, an offsite backup location, and a tape archival 
service stretching back for seven years. And none of that will cause the 
master to stop processing transactions unless the master itself dies and 
triggers a failover.


Using PG sync in its current incarnation would introduce an extra 
failure scenario that wasn't there before. I'm pretty sure we're not the 
only ones avoiding it for exactly that reason. Our queue discards 
messages it can't fulfil within ten seconds and then throws an error for 
each one. We need to decouple the secondary as quickly as possible if it 
becomes unresponsive, and there's really no way to do that without 
something in the database, one way or another.


--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-444-8534
stho...@optionshouse.com



__

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to 
this email

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronous Standalone Master Redoux

2012-07-12 Thread Shaun Thomas

On 07/12/2012 12:02 PM, Bruce Momjian wrote:


Well, the problem also exists if add it as an internal database
feature --- how long do we wait to consider the standby dead, how do
we inform administrators, etc.


True. Though if there is no secondary connected, either because it's not 
there yet, or because it disconnected, that's an easy check. It's the 
network lag/stall detection that's tricky.



I don't think anyone says the feature is useless, but is isn't going
to be a simple boolean either.


Oh $Deity no. I'd never suggest that. I just tend to be overly verbose, 
and sometimes my intent gets lost in the rambling as I try to explain my 
perspective. I apologize if it somehow came across that anyone could 
just flip a switch and have it work.


My C is way too rusty, or I'd be writing an extension right now to do 
this, or be looking over that patch I linked to originally to make 
suitable adaptations. I know I talk about how relatively handy DRBD is, 
but it's also a gigantic PITA since it has to exist underneath the 
actual filesystem. :)


--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-444-8534
stho...@optionshouse.com



__

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to 
this email

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronous Standalone Master Redoux

2012-07-11 Thread Shaun Thomas

On 07/10/2012 06:02 PM, Daniel Farina wrote:


For example, what if DRBD can only complete one page per second for
some reason?  Does it it simply have the primary wait at this glacial
pace, or drop synchronous replication and go degraded?  Or does it do
something more clever than just a timeout?


That's a good question, and way beyond what I know about the internals. 
:) In practice though, there are configurable thresholds, and if 
exceeded, it will invalidate the secondary. When using Pacemaker, we've 
actually had instances where the 10G link we had between the servers 
died, so each node thought the other was down. That lead to the 
secondary node self-promoting and trying to steal the VIP from the 
primary. Throw in a gratuitous arp, and you get a huge mess.


That lead to what DRBD calls split-brain, because both nodes were 
running and writing to the block device. Thankfully, you can actually 
tell one node to discard its changes and re-subscribe. Doing that will 
replay the transactions from the good node on the bad one. And even 
then, it's a good idea to run an online verify to do a block-by-block 
checksum and correct any differences.


Of course, all of that's only possible because it's a block-level 
replication. I can't even imagine PG doing anything like that. It would 
have to know the last good transaction from the primary and do an 
implied PIT recovery to reach that state, then re-attach for sync commits.



Regardless of what DRBD does, I think the problem with the
async/sync duality as-is is there is no nice way to manage exposure
to transaction loss under various situations and requirements.


Which would be handy. With synchronous commits, it's given that the 
protocol is bi-directional. Then again, PG can detect when clients 
disconnect the instant they do so, and having such an event implicitly 
disable synchronous_standby_names until reconnect would be an easy fix. 
The database already keeps transaction logs, so replaying would still 
happen on re-attach. It could easily throw a warning for every 
sync-required commit so long as it's in degraded mode. Those alone are 
very small changes that don't really harm the intent of sync commit.


That's basically what a RAID-1 does, and people have been fine with that 
for decades.


--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-444-8534
stho...@optionshouse.com



__

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to 
this email

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronous Standalone Master Redoux

2012-07-10 Thread Shaun Thomas

On 07/10/2012 01:11 AM, Daniel Farina wrote:


So if I get this straight, what you are saying is be asynchronous
replication unless someone is around, in which case be synchronous
is the mode you want.


Er, no. I think I see where you might have gotten that, but no.


This is a pretty tricky definition: consider if you bring a standby
on-line from archive replay and it shows up in streaming with pretty
high lag, and stops all commit traffic while it reaches the bounded
window of what acceptable lag is. That sounds pretty terrible, too.
How does DBRD handle this? It seems like the catchup phase might be
interesting prior art.


Well, DRBD actually has a very definitive sync mode, and no 
attenuation is involved at all. Here's what a fully working cluster 
looks like, according to /proc/drbd:


cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate

Here's what happens when I disconnect the secondary:

cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown

So there's a few things here:

1. Primary is waiting for the secondary to reconnect.
2. It knows its own data is still up to date.
3. It's waiting to assess the secondary when it re-appears
4. It's still capable of writing to the device.

This is more akin to degraded RAID-1. Writes are synchronous as long as 
two devices exist, but if one vanishes, you can still use the disk at 
your own risk. Checking the status of DRBD will show this readily. I 
also want to point out it is *fully* synchronous when both nodes are 
available. I.e., you can't even call a filesystem sync without the sync 
succeeding on both nodes.


When you re-connect a secondary device, it catches up as fast as 
possible by replaying waiting transactions, and then re-attaching to the 
cluster. Until it's fully caught-up, it doesn't exist. DRBD acknowledges 
the secondary is there and attempting to catch up, but does not leave 
degraded mode until the secondary reaches UpToDate status.


This is a much more graceful failure scenario than is currently possible 
with PostgreSQL. With DRBD, you'd still need a tool to notice the master 
node is in an invalid state and perform a failover, but the secondary 
going belly-up will not suddenly halt the master.


But I'm not even hoping for *that* level of functionality. I just want 
to be able to tell PostgreSQL to notice when the secondary becomes 
unavailable *on its own*, and then perform in degraded non-sync mode 
because it's much faster than any monitor I can possibly attach to 
perform the same function. I plan on using DRBD until either PG can do 
that, or a better alternative presents itself.


Async is simply too slow for our OLTP system except for the disaster 
recovery node, which isn't expected to carry on within seconds of the 
primary's failure. I briefly considered sync mode when it appeared as a 
feature, but I see it's still too early in its development cycle, 
because there are no degraded operation modes. That's fine, I'm willing 
to wait.


I just don't understand the push-back, I guess. RAID-1 is the poster 
child for synchronous writes for fault tolerance. It will whine 
constantly to anyone who will listen when operating only on one device, 
but at least it still works. I'm pretty sure nobody would use RAID-1 if 
its failure mode was: block writes until someone installs a replacement 
disk.


--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-444-8534
stho...@optionshouse.com



__

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to 
this email

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronous Standalone Master Redoux

2012-07-10 Thread Shaun Thomas

On 07/09/2012 05:15 PM, Josh Berkus wrote:



Total-consistency replication is what I think you want, that is, to
guarantee that at any given time a read query on the master will return
the same results as a read query on the standby.  Heck, *most* people
would like to have that.  You would also be advancing database science
in general if you could come up with a way to implement it.


Doesn't having consistent transactional state across the systems imply that?


So I'm unclear on why sync rep would be faster than async rep given
that they use exactly the same mechanism.  Explain?


Too many mental gymnastics. I get that async is faster than sync, but 
the inconsistent transactional state makes it *look* slower. If a 
customer makes an order, but just happens to check that order state on 
the secondary before it can catch up, that's a net loss. Like I said, 
that's fine for our DR system, or a reporting mirror, or any one of 
several use-case scenarios, but it's not good enough for a failover when 
better alternatives exist. In this case, better alternatives are 
anything that can guarantee transaction durability: DRBD / PG sync.


PG sync mode does what I want in that regard, it just has no graceful 
failure state without relatively invasive intervention. Theoretically we 
could write a Pacemaker agent, or some other simple harness, that just 
monitors both servers and performs an LSB HUP after modifying the 
primary node to disable synchronous_standby_names if the secondary dies, 
or promotes the secondary if the primary dies. But after being spoiled 
by DRBD knowing the instant the secondary disconnects, but still being 
available until the secondary is restored, we can't justifiably switch 
to something that will have the primary hang for ten seconds between 
monitor checks and service reloads.


I'm just saying I considered it briefly during testing the last few 
days, but there's no way I can make a business case for it. PG sync rep 
is a great step forward, but it's not for us. Yet.


--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-444-8534
stho...@optionshouse.com



__

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to 
this email

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronous Standalone Master Redoux

2012-07-10 Thread Shaun Thomas

On 07/10/2012 09:40 AM, Heikki Linnakangas wrote:


You are mistaken. It only guarantees that it's been sync'd to disk in
the standby, but if there are open snapshots or the system is simply
busy, it might takes minutes or more until the effects of that
transaction become visible.


Well, crap. It's subtle distinctions like this I wish I'd noticed 
before. Doesn't really affect our plans, it just makes sync rep even 
less viable for our use case. Thanks for the correction! :)


--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-444-8534
stho...@optionshouse.com



__

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to 
this email

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Synchronous Standalone Master Redoux

2012-07-09 Thread Shaun Thomas

Hey everyone,

Upon doing some usability tests with PostgreSQL 9.1 recently, I ran 
across this discussion:


http://archives.postgresql.org/pgsql-hackers/2011-12/msg01224.php

And after reading the entire thing, I found it odd that the overriding 
pushback was because nobody could think of a use case. The argument was: 
if you don't care if the slave dies, why not just use asynchronous 
replication?


I'd like to introduce all of you to DRBD. DRBD is, for those who aren't 
familiar, distributed (network) block-level replication. Right now, this 
is what we're using, and will use in the future, to ensure a stable 
synchronous PostgreSQL copy on our backup node. I was excited to read 
about synchronous replication, because with it, came the possibility we 
could have two readable nodes with the servers we already have. You 
can't do that with DRBD; secondary nodes can't even mount the device.


So here's your use case:

1. Slave wants to be synchronous with master. Master wants replication 
on at least one slave. They have this, and are happy.

2. For whatever reason, slave crashes or becomes unavailable.
3. Master notices no more slaves are available, and operates in 
standalone mode, accumulating WAL files until a suitable slave appears.
4. Slave finishes rebooting/rebuilding/upgrading/whatever, and 
re-subscribes to the feed.
5. Slave stays in degraded sync (asynchronous) mode until it is caught 
up, and then switches to synchronous. This makes both master and slave 
happy, because *intent* of synchronous replication is fulfilled.


PostgreSQL's implementation means the master will block until 
someone/something notices and tells it to stop waiting, or the slave 
comes back. For pretty much any high-availability environment, this is 
not viable. Based on that alone, I can't imagine a scenario where 
synchronous replication would be considered beneficial.


The current setup doubles unplanned system outage scenarios in such a 
way I'd never use it in a production environment. Right now, we only 
care if the master server dies. With sync rep, we'd have to watch both 
servers like a hawk and be ready to tell the master to disable sync rep, 
lest our 10k TPS system come to an absolute halt because the slave died.


With DRBD, when a slave node goes offline, the master operates in 
standalone until the secondary re-appears, after which it 
re-synchronizes missing data, and then operates in sync mode afterwards. 
Just because the data is temporarily out of sync does *not* mean we want 
asynchronous replication. I think you'd be hard pressed to find many 
users taking advantage of DRBD's async mode. Just because data is 
temporarily catching up, doesn't mean it will remain in that state.


I would *love* to have the functionality discussed in the patch. If I 
can make a case for it, I might even be able to convince my company to 
sponsor its addition, provided someone has time to integrate it. Right 
now, we're using DRBD so we can have a very short outage window while 
the offline node gets promoted, and it works, but that means a basically 
idle server at all times. I'd gladly accept a 10-20% performance hit for 
sync rep if it meant that other server could reliably act as a read 
slave. That's currently impossible because async replication is too 
slow, and sync is too fragile for reasons stated above.


Am I totally off-base, here? I was shocked when I actually read the 
documentation on how sync rep worked, and saw that no servers would 
function properly until at least two were online.


--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-444-8534
stho...@optionshouse.com


__

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to 
this email

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Attempting upgrade path; is this possible?

2006-02-22 Thread Shaun Thomas
I'm in charge of a very large database, and we're using a highly decrepit 
version of Postgresql currently.  After searching through the archives, Google, 
and trying out several replication engines, I have a question.

I had originally considered Slony-I, as it doesn't seem to require version 
compatibility between nodes like pgCluster, so upgrading from 7.4.2 to 8.1.3 
would be a possible, if slow process.  But after looking into the level of 
micro-management necessary, such as defining sets of every table on a 
per-database level, then having it add artificial primary-keys to applicable 
tables, it just doesn't seem like a good choice.  Not a fault of Slony-I, but 
several multi-gig databases hosting hundreds of tables would be a nightmare to 
use with Slony-I.

Then I thought about the backup/recovery system and the WAL files.  Would this 
scenario be possible:

1. Do a pg_dumpall on the existing database running 7.4.2.
2. Do a psql -f foo template1 on the new database running 8.1.3.
3. Wait a very long time while the new database loads.
4. Shut down old database.
5. Start the new database in restore mode, and point it to the WAL
   files from the old database.
6. Wait for restore to finish.
7. Restart the new database.

I wondered about this, as the pg_dumpall/restore would take a very long time 
for a 50GB database cluster, but theoretically the WAL files would continue to 
accumulate on the old db while this loading was taking place.  If the WAL 
formats were compatible, the total upgrade time would only be restricted to how 
long it took to replay the WAL files in the new database.  Does the current 
format of the WAL files make this possible?  If not, is such an option for the 
future?

Thanks in advance.
-- 

Shaun Thomas
Database Administrator
Leapfrog Online
847-440-8253

CONFIDENTIALITY NOTE 
The document(s) accompanying this e-mail transmission, if any, and the e-mail 
transmittal message contain information from Leapfrog Online Customer 
Acquisition, LLC is confidential or privileged. The information is intended to 
be for the use of the individual(s) or entity(ies) named on this e-mail 
transmission message. If you are not the intended recipient, be aware that any 
disclosure, copying, distribution or use of the contents of this e-mail is 
prohibited. If you have received this e-mail in error, please immediately 
delete this e-mail and notify us by telephone of the error. 


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org