Re: [HACKERS] Partitioning performance: cache stringToNode() of pg_constraint.ccbin

2013-06-06 Thread Amit Kapila
On Friday, June 07, 2013 2:10 AM Noah Misch wrote:
> On Thu, Jun 06, 2013 at 07:02:27PM +0530, Amit Kapila wrote:
> > On Tuesday, June 04, 2013 12:37 AM Noah Misch wrote:
> 
> > This patch can give good performance gain in the scenario described
> by you.
> > Infact I had taken the readings with patch, it shows similar gain.
> 
> Thanks for testing.
> 
> > This patch would increase the relcache size, as for each constraint
> on table
> > it would increase 4 bytes irrespective of whether that can give
> performance
> > benefit or not.
> 
> Yep, sizeof(Node *) bytes.

So the memory increase number's would like:

Example for 64-bit m/c
In database, there are 100 tables, each having 2 constraints and 30 live
connections

Size increase = no. of tables * size of (Node*) * number of constraints *
no. of live sessions
  = 100 * 8 * 2 * 30
  = 46.8K

So if such a memory increase seems reasonable, then I think it would be
really beneficial for the load of data in inherited tables.

If no one has objections on this part, then I think this is really useful
patch for PostgreSQL.


With Regards,
Amit Kapila.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)

2013-06-06 Thread Daniel Farina
On Thu, Jun 6, 2013 at 9:30 PM, Jeff Janes  wrote:
> I would oppose that as the solution, either an unconditional one, or
> configurable with is it as the default.  Those segments are not unneeded.  I
> need them.  That is why I set up archiving in the first place.  If you need
> to shut down the database rather than violate my established retention
> policy, then shut down the database.

Same boat.  My archives are the real storage.  The disks are
write-back caching.  That's because the storage of my archives is
probably three to five orders of magnitude more reliable.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)

2013-06-06 Thread Joshua D. Drake


On 06/06/2013 09:30 PM, Jeff Janes wrote:


Archiving
-

In some ways, this is the simplest case.  Really, we just need a way to
know when the available WAL space has become 90% full, and abort
archiving at that stage.  Once we stop attempting to archive, we can
clean up the unneeded log segments.


I would oppose that as the solution, either an unconditional one, or
configurable with is it as the default.  Those segments are not
unneeded.  I need them.  That is why I set up archiving in the first
place.  If you need to shut down the database rather than violate my
established retention policy, then shut down the database.


Agreed and I would oppose it even as configurable. We set up the 
archiving for a reason. I do think it might be useful to be able to 
store archiving logs as well as wal_keep_segments logs in a different 
location than pg_xlog.




What we need is a better way for the DBA to find out that archiving is
falling behind when it first starts to fall behind.  Tailing the log and
examining the rather cryptic error messages we give out isn't very
effective.


The archive command can be made a shell script (or that matter a
compiled program) which can do anything it wants upon failure, including
emailing people.



Yep, that is what PITRTools does. You can make it do whatever you want.


JD



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)

2013-06-06 Thread Jeff Janes
On Thursday, June 6, 2013, Josh Berkus wrote:

> Let's talk failure cases.
>
> There's actually three potential failure cases here:
>
> - One Volume: WAL is on the same volume as PGDATA, and that volume is
> completely out of space.
>
> - XLog Partition: WAL is on its own partition/volume, and fills it up.
>
> - Archiving: archiving is failing or too slow, causing the disk to fill
> up with waiting log segments.
>
> I'll argue that these three cases need to be dealt with in three
> different ways, and no single solution is going to work for all three.
>
> Archiving
> -
>
> In some ways, this is the simplest case.  Really, we just need a way to
> know when the available WAL space has become 90% full, and abort
> archiving at that stage.  Once we stop attempting to archive, we can
> clean up the unneeded log segments.
>

I would oppose that as the solution, either an unconditional one, or
configurable with is it as the default.  Those segments are not unneeded.
 I need them.  That is why I set up archiving in the first place.  If you
need to shut down the database rather than violate my established retention
policy, then shut down the database.



> What we need is a better way for the DBA to find out that archiving is
> falling behind when it first starts to fall behind.  Tailing the log and
> examining the rather cryptic error messages we give out isn't very
> effective.
>

The archive command can be made a shell script (or that matter a compiled
program) which can do anything it wants upon failure, including emailing
people.  Of course maybe whatever causes the archive to fail will also
cause the delivery of the message to fail, but I don't see a real solution
to this that doesn't start down an infinite regress.  If it is not failing
outright, but merely falling behind, then I don't really know how to go
about detecting that, either in archive_command, or through tailing the
PostgreSQL log.  I guess archive_command, each time it is invoked, could
count the files in the pg_xlog directory and warn if it thinks the number
is unreasonable.



>
> xLog Partition
> --
>
> As Heikki pointed, out, a full dedicated WAL drive is hard to fix once
> it gets full, since there's nothing you can safely delete to clear
> space, even enough for a checkpoint record.
>

Although the DBA probably wouldn't know it from reading the manual, it is
almost always safe to delete the oldest WAL file (after copying it to a
different partition just in case something goes wrong--it should be
possible to do that as if WAL is on its own partition, it is hard to
imagine you can't scrounge up 16MB on a different one), as PostgreSQL keeps
two complete checkpoints worth of WAL around.  I think the only reason you
would not be able to recover after removing the oldest file is if the
controldata file is damaged such that the most recent checkpoint record
cannot be found and so it has to fall back to the previous one.  Or at
least, this is my understanding.


>
> On the other hand, it should be easy to prevent full status; we could
> simply force a non-spread checkpoint whenever the available WAL space
> gets 90% full.  We'd also probably want to be prepared to switch to a
> read-only mode if we get full enough that there's only room for the
> checkpoint records.
>

I think that that last sentence could also be applied without modification
to the "one volume" case as well.

So what would that look like?  Before accepting a (non-checkpoint) WAL
Insert that fills up the current segment to a high enough level that a
checkpoint record will no longer fit, it must first verify that a recycled
file exists, or if not it must successfully init a new file.

If that init fails, then it must do what?  Signal for a checkpoint, release
it's locks, and then ERROR out?  That would be better than a PANIC, but can
it do better?  Enter a retry loop so that once the checkpoint has finished
and assuming it has freed up enough WAL files to recycling/removal, then it
can try the original WAL Insert again?


Cheers,

Jeff


Re: [HACKERS] Proposal for Allow postgresql.conf values to be changed via SQL [review]

2013-06-06 Thread Amit Kapila
On Thursday, June 06, 2013 10:22 PM Robert Haas wrote:
> On Wed, Jun 5, 2013 at 7:24 AM, Amit Kapila 
> wrote:
> > On Monday, May 27, 2013 4:17 PM Amit Kapila wrote:
> >> On Wednesday, April 03, 2013 11:55 AM Amit Kapila wote:
> >> > On Tuesday, April 02, 2013 9:49 PM Peter Eisentraut wrote:
> >>
> >
> > There are 2 options to proceed for this patch for 9.4
> >
> > 1. Upload the SET PERSISTENT syntax patch for coming CF by fixing
> existing
> > review comments
> > 2. Implement new syntax ALTER SYSTEM as proposed in below mail
> >
> > Could you suggest me what could be best way to proceed for this
> patch?
> 
> I'm still in favor of some syntax involving ALTER, because it's still
> true that this behaves more like the existing GUC-setting commands
> that use ALTER (which change configuration for future sessions) rather
> the ones that use SET (which change the current settings for some
> period of time).


I will change the patch as per below syntax if there are no objections:

 ALTER SYSTEM SET configuration_parameter {TO | =} {value, | 'value'};


With Regards,
Amit Kapila.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Redesigning checkpoint_segments

2013-06-06 Thread Greg Smith

On 6/6/13 4:41 AM, Heikki Linnakangas wrote:


I was thinking of letting the estimate
decrease like a moving average, but react to any increases immediately.
Same thing we do in bgwriter to track buffer allocations:


Combine what your submitted patch does and this idea, and you'll have 
something I prototyped a few years ago.  I took the logic and tested it 
out in user space by parsing the output from log_checkpoints to see how 
many segments were being used.  That approach coughed out a value about 
as good for checkpoint_segments as I picked by hand.


The main problem was it liked to over-tune the segments based on a small 
bursts of activity, leaving a value higher than you might want to use 
the rest of the time.  The background writer didn't worry about this 
very much because the cost of making a mistake for one 200ms cycle was 
pretty low.  Setting checkpoint_segments high is a more expensive issue. 
 When I set these by hand, I'll aim more to cover a 99th percentile of 
the maximum segments number rather than every worst case seen.


I don't think that improvement is worth spending very much effort on 
though.  The moving average approach is more than good enough in most 
cases.  I've wanted checkpoint_segments to go away in exactly this 
fashion for a while.


The general complaint the last time I suggested a change in this area, 
to make checkpoint_segments larger for the average user, was that some 
people had seen workloads where that was counterproductive.  Pretty sure 
Kevin Grittner said he'd seen that happen.  That's how I remember this 
general idea dying the last time, and I still don't have enough data to 
refute that doesn't happen.


As far as the UI, if it's a soft limit I'd suggest wal_size_target for 
the name.  What I would like to see is a single number here in memory 
units that replaces both checkpoint_segments and wal_keep_segments.  If 
you're willing to use a large chunk of disk space to handle either one 
of activity spikes or the class of replication issues wal_keep_segments 
targets, I don't see why you'd want to ban using that space for the 
other one too.


To put some perspective on how far we've been able to push this in the 
field with minimal gripes, the repmgr tool requires wal_keep_segments be 
>=5000, which works out to 78GB.  I still see some people use 73GB SAS 
drives in production servers for their WAL files, but that's the only 
time I've seen that number become scary when deploying repmgr. 
Meanwhile, the highest value for checkpoint_segments I've set based on 
real activity levels was 1024, on a server where checkpoint_timeout is 
15 minutes (and can be no shorter without checkpoint spikes).  At no 
point during that fairly difficult but of tuning work did 
checkpoint_segments do anything but get in the way.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Redesigning checkpoint_segments

2013-06-06 Thread Greg Smith

On 6/6/13 4:42 AM, Joshua D. Drake wrote:


On 6/6/2013 1:11 AM, Heikki Linnakangas wrote:


(I'm sure you know this, but:) If you perform a checkpoint as fast and
short as possible, the sudden burst of writes and fsyncs will
overwhelm the I/O subsystem, and slow down queries. That's what we saw
before spread checkpoints: when a checkpoint happens, the response
times of queries jumped up.


That isn't quite right. Previously we had lock issues as well and
checkpoints would take considerable time to complete. What I am talking
about is that the background writer (and wal writer where applicable)
have done all the work before a checkpoint is even called.


That is not possible, and if you look deeper at a lot of workloads 
you'll eventually see why.  I'd recommend grabbing snapshots of 
pg_buffercache output from a lot of different types of servers and see 
what the usage count distribution looks like.  That's what did in order 
to create all of the behaviors the current background writer code caters 
to.  Attached is a small spreadsheet that shows the main two extremes 
here, from one of my old talks.  "Effective buffer cache system" is full 
of usage count 5 pages, while the "Minimally effective buffer cache" one 
is all usage count 1 or 0.  We don't have redundant systems here; we 
have two that aim at distinctly different workloads.  That's one reason 
why splitting them apart ended up being necessary to move forward, they 
really don't overlap very much on some servers.


Sampling a few servers that way was where the controversial idea of 
scanning the whole buffer pool every few minutes even without activity 
came from too.  I found a bursty real world workload where that was 
necessary to keep buffers clean usefully, and that heuristic helped them 
a lot.  I too would like to visit the exact logic used, but I could cook 
up a test case where it's useful again if people really doubt it has any 
value.  There's one in the 2007 archives somewhere.


The reason the checkpointer code has to do this work, and it has to 
spread the writes out, is that on some systems the hot data set hits a 
high usage count.  If shared_buffers is 8GB and at any moment 6GB of it 
has a usage count of 5, which absolutely happens on many busy servers, 
the background writer will do almost nothing useful.  It won't and 
shouldn't touch buffers unless their usage count is low.  Those heavily 
referenced blocks will only be written to disk once per checkpoint cycle.


Without the spreading, in this example you will drop 6GB into "Dirty 
Memory" on a Linux server, call fdatasync, and the server might stop 
doing any work at all for *minutes* of time.  Easiest way to see it 
happen is to set checkpoint_completion_target to 0, put the filesystem 
on ext3, and have a server with lots of RAM.  I have a monitoring tool 
that graphs Dirty Memory over time because this problem is so nasty even 
with the spreading code in place.


There is this idea that pops up sometimes that a background writer write 
is better than a checkpoint one.  This is backwards.  A dirty block must 
be written at least once per checkpoint.  If you only write it once per 
checkpoint, inside of the checkpoint process, that is the ideal.  It's 
what you want for best performance when it's possible.


At the same time, some workloads churn through a lot of low usage count 
data, rather than building up a large block of high usage count stuff. 
On those your best hope for low latency is to crank up the background 
writer and let it try to stay ahead of backends with the writes.  The 
checkpointer won't have nearly as much work to do in that situation.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


bgwriter-snapshot.xls
Description: MS-Excel spreadsheet

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SPGist "triple parity" concept doesn't work

2013-06-06 Thread Tom Lane
Greg Stark  writes:
> On Thu, Jun 6, 2013 at 10:46 PM, Tom Lane  wrote:
>> : This rule guarantees that tuples on page M will have no children on page N,
>> : since (M+1) mod 3 != N mod 3.

> Even if the invariant was maintained why doesn't that just mean you
> need three concurrent inserts to create a deadlock?

Hm, good point.  That reinforces my feeling that the page-number-based
approach isn't workable as a guarantee; though we might want to keep
that layout rule as a heuristic that would help reduce contention.

I thought a little bit about whether we could drop the requirement of
locking two tree levels during insertion descent, or at least recover
from deadlock if it did occur.  One simple fix would be to do a
ConditionalLockBuffer on the child level, and if it fails, just abandon
the insertion attempt and start over.  However that could lead to a lot
of wasted work when there's contention, so it's not terribly attractive
as-is.  Another line of thought is to lock just the single parent tuple,
not its whole page, when descending --- then we can't deadlock unless
there are actual circularities in the index.  We'd probably have to use
a heavyweight lock for that, which might be problematic for performance,
and I'm not exactly sure about timing of when to take that lock relative
to acquiring the page's buffer lock (which we'd still need).  There are
probably some other ways to attack this.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] create a git symbolic-ref for REL9_3_STABLE

2013-06-06 Thread Peter Eisentraut
On Wed, 2013-06-05 at 08:43 +0800, amul sul wrote:
> Just want to ask, what exactly you want, is like this
> 1. you want to create symbolic -ref as _git symbolic-ref
> "refs/heads/REL9_3_STABLE" "refs/heads/master"_
> 2. which will show in _git branch_ as  
>  REL9_3_STABLE -> master
>  * master
> 3. Then you want to  checkout to REL9_3_STABLE , and pull the all
> changes pointing to master?
> 4. while actual happened then want to create  branch REL9_3_STABLE?
> 
> Is this way? 

yes



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)

2013-06-06 Thread Jaime Casanova
On Thu, Jun 6, 2013 at 4:28 PM, Christian Ullrich  wrote:
> * Heikki Linnakangas wrote:
>
>> The current situation is that if you run out of disk space while writing
>> WAL, you get a PANIC, and the server shuts down. That's awful. We can
>
>
>> So we need to somehow stop new WAL insertions from happening, before
>> it's too late.
>
>
>> A naive idea is to check if there's enough preallocated WAL space, just
>> before inserting the WAL record. However, it's too late to check that in
>
>
> There is a database engine, Microsoft's "Jet Blue" aka the Extensible
> Storage Engine, that just keeps some preallocated log files around,
> specifically so it can get consistent and halt cleanly if it runs out of
> disk space.
>

fwiw, informix (at least until IDS 2000, not sure after that) had the
same thing. only this was a parameter to set, and bad things happened
if you forgot about it :D

--
Jaime Casanova www.2ndQuadrant.com
Professional PostgreSQL: Soporte 24x7 y capacitación
Phone: +593 4 5107566 Cell: +593 987171157


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)

2013-06-06 Thread Josh Berkus
Let's talk failure cases.

There's actually three potential failure cases here:

- One Volume: WAL is on the same volume as PGDATA, and that volume is
completely out of space.

- XLog Partition: WAL is on its own partition/volume, and fills it up.

- Archiving: archiving is failing or too slow, causing the disk to fill
up with waiting log segments.

I'll argue that these three cases need to be dealt with in three
different ways, and no single solution is going to work for all three.

Archiving
-

In some ways, this is the simplest case.  Really, we just need a way to
know when the available WAL space has become 90% full, and abort
archiving at that stage.  Once we stop attempting to archive, we can
clean up the unneeded log segments.

What we need is a better way for the DBA to find out that archiving is
falling behind when it first starts to fall behind.  Tailing the log and
examining the rather cryptic error messages we give out isn't very
effective.

xLog Partition
--

As Heikki pointed, out, a full dedicated WAL drive is hard to fix once
it gets full, since there's nothing you can safely delete to clear
space, even enough for a checkpoint record.

On the other hand, it should be easy to prevent full status; we could
simply force a non-spread checkpoint whenever the available WAL space
gets 90% full.  We'd also probably want to be prepared to switch to a
read-only mode if we get full enough that there's only room for the
checkpoint records.

One Volume
--

This is the most complicated case, because we wouldn't necessarily run
out of space because of WAL using it up.  Anything could cause us to run
out of disk space, including activity logs, swapping, pgsql_tmp files,
database growth, or some other process which writes files.

This means that the DBA getting out of disk-full manually is in some
ways easier; there's usually stuff she can delete.  However, it's much
harder -- maybe impossible -- for PostgreSQL to prevent this kind of
space outage.  There should be things we can do to make it easier for
the DBA to troubleshoot this, but I'm not sure what.

We could use a hard limit for WAL to prevent WAL from contributing to
out-of-space, but that'll only prevent a minority of cases.


-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Redesigning checkpoint_segments

2013-06-06 Thread Josh Berkus

>> Given the behavior of xlog, I'd want to adjust the
>> algo so that peak usage on a 24-hour basis would affect current
>> preallocation.  That is, if a site regularly has a peak from 2-3pm where
>> they're using 180 segments/cycle, then they should still be somewhat
>> higher at 2am than a database which doesn't have that peak.  I'm pretty
>> sure that the bgwriter's moving average cycles much shorter time scales
>> than that.
> 
> Makes sense. I didn't implement that in the attached, though.

It's possible that it won't matter.  Performance testing will tell us.

> Having a separate option to specify a minimum number of segments (or
> rather minimum size in MB) to keep preallocated would at least allow a
> DBA to set that manually, based on the observed peak. I didn't implement
> such a manual option in the attached, but that would be easy.

Yeah, I'd really like to get away from adding manual options which need
to be used in non-specialty cases.  I think we'll need one at some point
-- there are DB applications which are VERY bursty -- but let's not
start there and see if we can make reasonable autotuning work.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Proposal for CSN based snapshots

2013-06-06 Thread Ants Aasma
Given the recent ideas being thrown about changing how freezing and
clog is handled and MVCC catalog access I thought I would write out
the ideas that I have had about speeding up snapshots in case there is
an interesting tie in with the current discussions.

To refresh your memory the basic idea is to change visibility
determination to be based on a commit sequence number (CSN for short)
- a 8 byte number incremented on every commit representing the total
ordering of commits. To take a snapshot in this scheme you only need
to know the value of last assigned CSN, all transactions with XID less
than or equal to that number were commited at the time of the
snapshots, everything above wasn't committed. Besides speeding up
snapshot taking, this scheme can also be a building block for
consistent snapshots in a multi-master environment with minimal
communication. Google's Spanner database uses snapshots based on a
similar scheme.

The main tricky part about this scheme is finding the CSN that was
assigned to each XIDs in face of arbitrarily long transactions and
snapshots using only a bounded amount of shared memory. The secondary
tricky part is doing this in a way that doesn't need locks for
visibility determination as that would kill any hope of a performance
gain.

We need to keep around CSN slots for all currently running
transactions and CSN slots of transactions that are concurrent with
any active CSN based snapshot (xid.csn > min(snapshot.csn)). To do
this I propose the following datastructures to do the XID-to-CSN
mapping. For most recently assigned XIDs there is a ringbuffer of
slots that contain the CSN values of the XIDs or special CSN values
for transactions that haven't completed yet, aborted transactions or
subtransactions. I call this the dense CSN map. Looking up a CSN of a
XID from the ringbuffer is just a trivial direct indexing into the
ring buffer.

For long running transactions the ringbuffer may do a full circle
before a transaction commits. Such CSN slots along with slots that are
needed for older snapshots are evicted from the dense buffer into a
sorted array of XID-CSN pairs, or the sparse mapping. For locking
purposes there are two sparse buffers, one of them being active the
other inactive, more on that later. Looking up the CSN value of a XID
that has been evicted into the sparse mapping is a matter of
performing a binary search to find the slot and reading the CSN value.

Because snapshots can remain active for an unbounded amount of time
and there can be unbounded amount of active snapshots, even the sparse
mapping can fill up. To handle that case, each backend advertises its
lowest snapshot number csn_min. When values need to be evicted from
the sparse mapping, they are evicted in CSN order and written into the
CSN log - a series of CSN-XID pairs. Backends that may still be
concerned about those values are then notified that values that they
might need to use have been evicted. Backends with invalidated
snapshots can then convert their snapshots to regular list of
concurrent XIDs snapshots at their leisure.

To convert a CSN based snapshot to XID based, a backend would first
scan the shared memory structures for xids up to snapshot.xmax for
CSNs that are concurrent to the snapshot and insert the XIDs into the
snapshot, then read in the CSN log starting from snapshots CSN,
looking for xid's less than the snapshots xmax. After this the
snapshot can be handled like current snapshots are handled.

A more detailed view of synchronization primitives required for common
operations follows.

Taking a new snapshot
-

Taking a CSN based snapshot under this scheme would consist of reading
xmin, csn and xmax from global variables, unlocked and in that order
with read barriers in between each load. If this is our oldest
snapshot we write our csn_min into pgproc, do a full memory barrier
and check from a global variable if the CSN we used is still
guaranteed to not be evicted (exceedingly unlikely but cannot be ruled
out).

The read barrier between reading xmin and csn is needed so the
guarantee applies that no transaction with tx.xid < ss.xmin could have
committed with tx.csn >= ss.csn, so xmin can be used to safely exclude
csn lookups. Read barrier between reading csn and xmax is needed to
guarantee that if tx.xid >= ss.xmax, then it's known that tx.csn >=
ss.csn. From the write side, there needs to be at least one full
memory barrier between GetNewTransactionId updating nextXid and
CommitTransaction updating nextCsn, which is quite easily satisfied.
Updating global xmin without races is slightly trickier but doable.

Checking visibility
---

XidInMVCCSnapshot will need to switch on the type of snapshot used
after checking xmin and xmax. For list-of-XIDs snapshots the current
logic applies. CSN based snapshots need to first do an unlocked read
from a backend specific global variable to check if we have been
instructed to convert our snapshots to xid based, i

Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)

2013-06-06 Thread Greg Stark
On Thu, Jun 6, 2013 at 10:38 PM, Andres Freund  wrote:
> That's not a bad technique. I wonder how reliable it would be in
> postgres. Do all filesystems allow a rename() to succeed if there isn't
> actually any space left? E.g. on btrfs I wouldn't be sure.  We need to
> rename because WAL files need to be named after the LSN timelineid...

I suppose we could just always do the rename at the same time as
setting up the current log file. That is, when we start wal log x also
set up wal file x+1 at that time.

This isn't actually guaranteed to be enough btw. It's possible that
the record we're actively about to write will require all of both
those files... But that should be very unlikely.


-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SPGist "triple parity" concept doesn't work

2013-06-06 Thread Greg Stark
On Thu, Jun 6, 2013 at 10:46 PM, Tom Lane  wrote:
>  To prevent
> : deadlocks we introduce a concept of "triple parity" of pages: if inner tuple
> : is on page with BlockNumber N, then its child tuples should be placed on the
> : same page, or else on a page with BlockNumber M where (N+1) mod 3 == M mod 
> 3.
> : This rule guarantees that tuples on page M will have no children on page N,
> : since (M+1) mod 3 != N mod 3.

Even if the invariant was maintained why doesn't that just mean you
need three concurrent inserts to create a deadlock?


-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freezing without write I/O

2013-06-06 Thread Greg Stark
On Thu, Jun 6, 2013 at 1:39 PM, Heikki Linnakangas
 wrote:
> That will keep OldestXmin from advancing. Which will keep vacuum from
> advancing relfrozenxid/datfrozenxid. Which will first trigger the warnings
> about wrap-around, then stops new XIDs from being generated, and finally a
> forced shutdown.
>
> The forced shutdown will actually happen some time before going beyond 2
> billion XIDs. So it is not possible to have a long-lived transaction, older
> than 2 B XIDs, still live in the system. But let's imagine that you somehow
> bypass the safety mechanism:

Ah, so if you do the epoch in the page header thing or Robert's LSN
trick that I didn't follow then you'll need a new safety check against
this. Since relfrozenxid/datfrozenxid will no longer be necessary.

-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] SPGist "triple parity" concept doesn't work

2013-06-06 Thread Tom Lane
I've been looking into the problem reported at
http://www.postgresql.org/message-id/519a5917.40...@qunar.com
and what I find is that we have spgist insertion operations deadlocking
against each other because one is descending from page A to page B while
the other descends from page B to page A.  According to the README file,
the "triple parity" page selection algorithm is supposed to prevent
that:

: While descending the tree, the insertion algorithm holds exclusive lock on
: two tree levels at a time, ie both parent and child pages (parent and child
: pages can be the same, see notes above). There is a possibility of deadlock
: between two insertions if there are cross-referenced pages in different
: branches.  That is, if inner tuple on page M has a child on page N while
: an inner tuple from another branch is on page N and has a child on page M,
: then two insertions descending the two branches could deadlock.  To prevent
: deadlocks we introduce a concept of "triple parity" of pages: if inner tuple
: is on page with BlockNumber N, then its child tuples should be placed on the
: same page, or else on a page with BlockNumber M where (N+1) mod 3 == M mod 3.
: This rule guarantees that tuples on page M will have no children on page N,
: since (M+1) mod 3 != N mod 3.

That would work fine as long as the invariant is maintained accurately.
However, there are at least two cases where the existing code fails to
maintain the invariant:

1. In spgAddNodeAction, if the enlarged inner tuple doesn't fit on the
current page anymore, we do this:

/*
 * obtain new buffer with the same parity as current, since it will be
 * a child of same parent tuple
 */
current->buffer = SpGistGetBuffer(index,
  GBUF_INNER_PARITY(current->blkno),
  ...

That's fine as long as the parent tuple wasn't also on the current page.
If it was on the current page, we end up re-downlinking the parent to a
page having the same parity it has, not one more as it should be.

I tried to fix this like so:

/*
 * get a new buffer that has the right parity to store a child of
 * the current tuple's parent
 */
current->buffer = SpGistGetBuffer(index,
  GBUF_INNER_PARITY(parent->blkno + 1),
  ...

but that just moves the problem somewhere else: the link from the parent
to the new inner tuple is now guaranteed to follow the parity rules, but
the downlinks leading out of it don't follow them anymore.

2. In spgSplitNodeAction, we split an inner tuple into a "prefix" tuple
that replaces that inner tuple, and a "postfix" tuple that contains the
same downlinks the original tuple did.  That's fine as long as we can
fit the postfix tuple on the same page.  If we can't, we assign it to a
page that's one parity level below the current page, and then its
outgoing links violate the parity rules.  (Keeping the postfix tuple
on the current page wouldn't make things better, since we'd still
violate the parity rules with respect to either the incoming or outgoing
links of the prefix tuple, if it has to go to another page.)

With a few repetitions of either of these cases, and some bad luck
about placement of the new tuples, you get into situations where two
pages each contain downlinks leading to the other; and then a deadlock
is just a matter of time.

I don't immediately see any good way to fix this.  I think the "triple
parity" rule as it stands is hopelessly broken, but I don't know what
to replace it with, even granting that we don't need to maintain on-disk
compatibility.  (We'd have to tell people to reindex SPGist indexes
anyway, because of the risk that they already contain circular links;
so switching to a new layout rule doesn't seem to add any more pain.)
Or we could try to modify the insertion algorithm so it doesn't lock
two levels of the tree at once, but that seems pretty risky.

Thoughts?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)

2013-06-06 Thread Andres Freund
On 2013-06-06 23:28:19 +0200, Christian Ullrich wrote:
> * Heikki Linnakangas wrote:
> 
> >The current situation is that if you run out of disk space while writing
> >WAL, you get a PANIC, and the server shuts down. That's awful. We can
> 
> >So we need to somehow stop new WAL insertions from happening, before
> >it's too late.
> 
> >A naive idea is to check if there's enough preallocated WAL space, just
> >before inserting the WAL record. However, it's too late to check that in
> 
> There is a database engine, Microsoft's "Jet Blue" aka the Extensible
> Storage Engine, that just keeps some preallocated log files around,
> specifically so it can get consistent and halt cleanly if it runs out of
> disk space.
> 
> In other words, the idea is not to check over and over again that there is
> enough already-reserved WAL space, but to make sure there always is by
> having a preallocated segment that is never used outside a disk space
> emergency.

That's not a bad technique. I wonder how reliable it would be in
postgres. Do all filesystems allow a rename() to succeed if there isn't
actually any space left? E.g. on btrfs I wouldn't be sure.  We need to
rename because WAL files need to be named after the LSN timelineid...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum, Freeze and Analyze: the big picture

2013-06-06 Thread Jeff Janes
On Mon, Jun 3, 2013 at 5:03 AM, Craig Ringer  wrote:

> ->
> "I'll whack in some manual VACUUM cron jobs during low load maintenance
> hours and hope that keeps the worst of the problem away, that's what
> random forum posts on the Internet say to do".
> -> "oh my, why did my DB just do an emergency shutdown?"
>


This one doesn't make much sense to me, unless they mucked around with
autovacuum_freeze_max_age as well as turning autovacuum itself off (common
practice?).  With the default setting of autovacuum_freeze_max_age, if it
can't complete the anti-wraparound before emergency shutdown with autovac
off, it probably would not have completed it with autovac on, either.

Cheers,

Jeff


Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)

2013-06-06 Thread Christian Ullrich

* Heikki Linnakangas wrote:


The current situation is that if you run out of disk space while writing
WAL, you get a PANIC, and the server shuts down. That's awful. We can



So we need to somehow stop new WAL insertions from happening, before
it's too late.



A naive idea is to check if there's enough preallocated WAL space, just
before inserting the WAL record. However, it's too late to check that in


There is a database engine, Microsoft's "Jet Blue" aka the Extensible 
Storage Engine, that just keeps some preallocated log files around, 
specifically so it can get consistent and halt cleanly if it runs out of 
disk space.


In other words, the idea is not to check over and over again that there 
is enough already-reserved WAL space, but to make sure there always is 
by having a preallocated segment that is never used outside a disk space 
emergency.


--
Christian




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cost limited statements RFC

2013-06-06 Thread Andres Freund
On 2013-06-06 12:34:01 -0700, Jeff Janes wrote:
> On Fri, May 24, 2013 at 11:51 AM, Greg Smith  wrote:
> 
> > On 5/24/13 9:21 AM, Robert Haas wrote:
> >
> >  But I wonder if we wouldn't be better off coming up with a little more
> >> user-friendly API.  Instead of exposing a cost delay, a cost limit,
> >> and various charges, perhaps we should just provide limits measured in
> >> KB/s, like dirty_rate_limit =  >> second, in kB> and read_rate_limit =  >> shared buffers per second, in kB>.
> >>
> >
> > I already made and lost the argument for doing vacuum in KB/s units, so I
> > wasn't planning on putting that in the way of this one.
> 
> 
> I think the problem is that making that change would force people to
> relearn something that was already long established, and it was far from
> clear that the improvement, though real, was big enough to justify forcing
> people to do that.

I don't find that argument very convincing. Since you basically can
translate the current variables into something like the above variables
with some squinting we sure could have come up with some way to keep the
old definition and automatically set the new GUCs and the other way
round. guc.c should even have enough information to prohibit setting
both in the config file...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum, Freeze and Analyze: the big picture

2013-06-06 Thread Jeff Janes
On Mon, Jun 3, 2013 at 6:34 AM, Kevin Grittner  wrote:

>
>
> Where I hit a nightmare scenario with an anti-wraparound
> autovacuum, personally, was after an upgrade using pg_dump piped to
> psql.  At a high OLTP transaction load time (obviously the most
> likely time for it to kick in, because it is triggered by xid
> consumption), it started to READ AND REWRITE every heap page of
> every table.  This overwhelmed the battery-backed write cache,
> causing a series of "freezes" for a few minutes at a time, raising
> a very large number of end-user complaints.



But this is only after autovacuum_vacuum_cost_delay was already changed to
zero, right?  It is hard to imagine the write cache being overwhelmed by
the default setting, or even substantially more aggressive than the default
but still not zero.  Anti-wraparound vacuums should generate almost purely
sequential writes (at least if only btree indexes exist), so they should
clear very quickly.


> > "I'll whack in some manual VACUUM cron jobs during low load maintenance
> > hours and hope that keeps the worst of the problem away, that's what
> > random forum posts on the Internet say to do".
> > -> "oh my, why did my DB just do an emergency shutdown?"
>
> Yeah, I've seen exactly that sequence, and some variations on it
> quite often.  In fact, when I was first using PostgreSQL I got as
> far as "Maybe I didn't solve the autovacuum thing" but instead of
> "I'll just turn it off" my next step was "I wonder what would
> happen if I tried making it *more* aggressive so that it didn't
> have so much work to do each time it fired?"  Of course, that
> vastly improved things.  I have found it surprisingly difficult to
> convince other people to try that, though.
>

What is it you changed?  Either a anti-wraparound happens, or it does not,
so I'm not sure what you mean about making it more aggressive so there is
less to do.  It always has to do the whole thing.  Was it the
autovacuum_vacuum_scale_factor that you changed?

Cheers,

Jeff


Re: [HACKERS] Partitioning performance: cache stringToNode() of pg_constraint.ccbin

2013-06-06 Thread Noah Misch
On Thu, Jun 06, 2013 at 07:02:27PM +0530, Amit Kapila wrote:
> On Tuesday, June 04, 2013 12:37 AM Noah Misch wrote:

> This patch can give good performance gain in the scenario described by you.
> Infact I had taken the readings with patch, it shows similar gain.

Thanks for testing.

> This patch would increase the relcache size, as for each constraint on table
> it would increase 4 bytes irrespective of whether that can give performance
> benefit or not.

Yep, sizeof(Node *) bytes.

> Why in function CheckConstraintFetch(), the node is not formed from string?

That's to avoid the cost of stringToNode() if the field is not needed during
the life of the cache entry.

> > Some call sites need to modify the node tree, so the patch has them do
> > copyObject().  I ran a microbenchmark of copyObject() on the cached
> > node tree vs. redoing stringToNode(), and copyObject() still won by a
> > factor of four.
> 
> I have not tried any performance run to measure if extra copyObject() has
> added any benefit.

Callers must not modify the cache entry; those that would otherwise do so must
use copyObject() first.  I benchmarked to ensure they wouldn't be better off
ignoring the cached node tree and calling stringToNode() themselves.

> What kind of benchmark you use to validate it?

It boiled down to comparing runtime of loops like these:

while (n-- > 0)
copyObject(RelationGetConstraint(r, 0));

while (n-- > 0)
stringToNode(r->rd_att->constr->check[0].ccbin);

-- 
Noah Misch
EnterpriseDB http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum, Freeze and Analyze: the big picture

2013-06-06 Thread Jeff Janes
On Mon, Jun 3, 2013 at 5:03 AM, Craig Ringer  wrote:

> On 06/02/2013 05:56 AM, Robert Haas wrote:
>


> > (b) users
> > making ridiculous settings changes to avoid the problems caused by
> > anti-wraparound vacuums kicking in at inconvenient times and eating up
> > too many resources.
>
> Some recent experiences I've had have also bought home to me that vacuum
> problems are often of the user's own making.
>
> "My database is slow"
> ->
> "This autovacuum thing is using up lots of I/O and CPU, I'll increase
> this delay setting here"
>


Do you think this was the correct diagnosis but with the wrong action
taken, or was the diagnosis incorrect in the first place (i.e. it may be
using some IO and CPU, but that isn't what was  causing the initial
problem)?  And if the diagnosis was correct, was it causing problems under
default settings, or only because they already turned off the cost delay?

Cheers,

Jeff


Re: [HACKERS] Redesigning checkpoint_segments

2013-06-06 Thread Heikki Linnakangas

On 06.06.2013 20:24, Josh Berkus wrote:

Yeah, something like that :-). I was thinking of letting the estimate
decrease like a moving average, but react to any increases immediately.
Same thing we do in bgwriter to track buffer allocations:


Seems reasonable.


Here's a patch implementing that. Docs not updated yet. I did not change 
the way checkpoint_segments triggers checkpoints - that'll can be a 
separate patch. This only decouples the segment preallocation behavior 
from checkpoint_segments. With the patch, you can set 
checkpoint_segments really high, without consuming that much disk space 
all the time.



Given the behavior of xlog, I'd want to adjust the
algo so that peak usage on a 24-hour basis would affect current
preallocation.  That is, if a site regularly has a peak from 2-3pm where
they're using 180 segments/cycle, then they should still be somewhat
higher at 2am than a database which doesn't have that peak.  I'm pretty
sure that the bgwriter's moving average cycles much shorter time scales
than that.


Makes sense. I didn't implement that in the attached, though.

Having a separate option to specify a minimum number of segments (or 
rather minimum size in MB) to keep preallocated would at least allow a 
DBA to set that manually, based on the observed peak. I didn't implement 
such a manual option in the attached, but that would be easy.


- Heikki
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 40b780c..5244ce1 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -89,18 +89,11 @@ bool		XLOG_DEBUG = false;
 #endif
 
 /*
- * XLOGfileslop is the maximum number of preallocated future XLOG segments.
- * When we are done with an old XLOG segment file, we will recycle it as a
- * future XLOG segment as long as there aren't already XLOGfileslop future
- * segments; else we'll delete it.  This could be made a separate GUC
- * variable, but at present I think it's sufficient to hardwire it as
- * 2*CheckPointSegments+1.	Under normal conditions, a checkpoint will free
- * no more than 2*CheckPointSegments log segments, and we want to recycle all
- * of them; the +1 allows boundary cases to happen without wasting a
- * delete/create-segment cycle.
+ * Estimated distance between checkpoints, in bytes, and measured distance of
+ * previous checkpoint cycle.
  */
-#define XLOGfileslop	(2*CheckPointSegments + 1)
-
+static double CheckPointDistanceEstimate = 0;
+static double PrevCheckPointDistance = 0;
 
 /*
  * GUC support
@@ -668,7 +661,7 @@ static bool WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 static int	emode_for_corrupt_record(int emode, XLogRecPtr RecPtr);
 static void XLogFileClose(void);
 static void PreallocXlogFiles(XLogRecPtr endptr);
-static void RemoveOldXlogFiles(XLogSegNo segno, XLogRecPtr endptr);
+static void RemoveOldXlogFiles(XLogSegNo segno, XLogRecPtr PriorRedoPtr, XLogRecPtr endptr);
 static void UpdateLastRemovedPtr(char *filename);
 static void ValidateXLOGDirectoryStructure(void);
 static void CleanupBackupHistory(void);
@@ -1458,11 +1451,85 @@ AdvanceXLInsertBuffer(bool new_segment)
 }
 
 /*
+ * XLOGfileslop is the maximum number of preallocated future XLOG segments.
+ * When we are done with an old XLOG segment file, we will recycle it as a
+ * future XLOG segment as long as there aren't already XLOGfileslop future
+ * segments; else we'll delete it.
+ */
+static int
+XLOGfileslop(XLogRecPtr PriorRedoPtr, XLogRecPtr CurrPtr)
+{
+	double		nsegments;
+	double		targetPtr;
+	double		distance;
+
+	/*
+	 * The number segments to preallocate/recycle is based on two things:
+	 * an estimate of how much WAL is consumed between checkpoints, and the
+	 * current distance from the prior checkpoint (ie. the point at which
+	 * we're about to truncate the WAL) to the current WAL insert location.
+	 *
+	 * First, calculate how much WAL space the system would need, if it ran
+	 * steady, using the estimated amount of WAL generated between every
+	 * checkpoint cycle. Then see how much WAL is actually in use at the moment
+	 * (= the distance between Prior redo pointer and current WAL insert
+	 * location). The difference between the two is how much WAL we should keep
+	 * preallocated, so that backends won't have to create new WAL segments.
+	 *
+	 * The reason we do these calculations from the prior checkpoint, not the
+	 * one that just finished, is that this behaves better if some checkpoint
+	 * cycles are abnormally short, like if you perform a manual checkpoint
+	 * right after a timed one. The manual checkpoint will make almost
+	 * a full cycle's worth of WAL segments available for recycling, because
+	 * the segments from the prior's prior, fully-sized checkpoint cycle are
+	 * no longer needed. However, the next checkpoint will make only few
+	 * segments available for recycling, the ones generated between the timed
+	 * checkpoint and the manual one right after that. If at the manu

Re: [HACKERS] Cost limited statements RFC

2013-06-06 Thread Robert Haas
On Thu, Jun 6, 2013 at 3:34 PM, Jeff Janes  wrote:
> On Fri, May 24, 2013 at 11:51 AM, Greg Smith  wrote:
>>
>> On 5/24/13 9:21 AM, Robert Haas wrote:
>>
>>> But I wonder if we wouldn't be better off coming up with a little more
>>> user-friendly API.  Instead of exposing a cost delay, a cost limit,
>>> and various charges, perhaps we should just provide limits measured in
>>> KB/s, like dirty_rate_limit = >> second, in kB> and read_rate_limit = >> shared buffers per second, in kB>.
>>
>>
>> I already made and lost the argument for doing vacuum in KB/s units, so I
>> wasn't planning on putting that in the way of this one.
>
>
> I think the problem is that making that change would force people to relearn
> something that was already long established, and it was far from clear that
> the improvement, though real, was big enough to justify forcing people to do
> that.  That objection would not apply to a new feature, as there would be
> nothing to re-learn.  The other objection was that (at that time) we had
> some hope that the entire workings would be redone for 9.3, and it seemed
> unfriendly to re-name things in 9.2 without much change in functionality,
> and then redo them completely in 9.3.

Right.  Also, IIRC, the limits didn't really mean what they purported
to mean.  You set either a read or a dirty rate in KB/s, but what was
really limited was the combination of the two, and the relative
importance of the two factors was based on other settings in a
severely non-obvious way.

If we can see our way clear to ripping out the autovacuum costing
stuff and replacing them with a read rate limit and a dirty rate
limit, I'd be in favor of that.  The current system limits the linear
combination of those with user-specified coefficients, which is more
powerful but less intuitive.  If we need that, we'll have to keep it
the way it is, but I'm hoping we don't.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Redesigning checkpoint_segments

2013-06-06 Thread Jeff Janes
On Thu, Jun 6, 2013 at 1:42 AM, Joshua D. Drake wrote:

>
>
> I may be confused but it is my understanding that bgwriter writes out the
> data from the shared buffer cache that is dirty based on an interval and a
> max pages written.


It primarily writes out based on how many buffers have recently needed to
be evicted in order to make room to read in new ones.  There are secondary
clamp limits based on an interval (it does enough work to circle the buffer
pool once every 2 minutes) and another on max pages written but the main
one is based on recent usage.  I've never really understood the point of
those secondary clamps.


>>> This makes sense except I don't see a need for the parameter. Why not
>>> just specify how the algorithm works and adhere to that without the need
>>> for another GUC?
>>>
>>
>> Because you want to limit the amount of disk space used for WAL. It's a
>> soft limit, but still.
>>
>>
> Why? This is the point that confuses me. Why do we care? We don't care how
> much disk space PGDATA takes... why do we all of a sudden care about
> pg_xlog?



Presumably someone cares about disk space of PGDATA, but it is probably a
different person, at a different time, on a different time scale.  PGDATA
is a long term planning issue, pg_xlog is an operational issue.  If the
checkpoint had completed 30 seconds earlier or the archive_command had
completed 30 seconds earlier (or the commit rate had been throttled for 30
seconds), then pg_xlog would not have run out of space in the first place.
 Having averted the crisis, maybe it will never arise again, or maybe it
will but we will be able to avoid it again.  If we delay running out of
room on PGDATA for 30 seconds, well, we still ran out of room.

Cheers,

Jeff


Re: [HACKERS] Cost limited statements RFC

2013-06-06 Thread Jeff Janes
On Fri, May 24, 2013 at 11:51 AM, Greg Smith  wrote:

> On 5/24/13 9:21 AM, Robert Haas wrote:
>
>  But I wonder if we wouldn't be better off coming up with a little more
>> user-friendly API.  Instead of exposing a cost delay, a cost limit,
>> and various charges, perhaps we should just provide limits measured in
>> KB/s, like dirty_rate_limit = > second, in kB> and read_rate_limit = > shared buffers per second, in kB>.
>>
>
> I already made and lost the argument for doing vacuum in KB/s units, so I
> wasn't planning on putting that in the way of this one.


I think the problem is that making that change would force people to
relearn something that was already long established, and it was far from
clear that the improvement, though real, was big enough to justify forcing
people to do that.  That objection would not apply to a new feature, as
there would be nothing to re-learn.  The other objection was that (at that
time) we had some hope that the entire workings would be redone for 9.3,
and it seemed unfriendly to re-name things in 9.2 without much change in
functionality, and then redo them completely in 9.3.

Cheers,

Jeff


Re: [HACKERS] [PATCH]Tablesample Submission

2013-06-06 Thread Simon Riggs
On 18 September 2012 10:32, Hitoshi Harada  wrote:

> As wiki says, BERNOULLI relies on the statistics of the table, which
> doesn't sound good to me.  Of course we could say this is our
> restriction and say good-bye to users who hadn't run ANALYZE first,
> but it is too hard for a normal users to use it.  We may need
> quick-and-rough count(*) for this.

For Bernoulli sampling, SQL Standard says "Further, whether a given
row of RT is included in result of TF is independent of whether other
rows of RT are included in result of TF."

Which means BERNOULLI sampling looks essentially identical to using

  WHERE random() <= ($percent/100)

So my proposed implementation route for bernoulli sampling is to
literally add an AND-ed qual that does a random() test (and
repeatability also). That looks fairly simple and it is still
accurate, because it doesn't matter whether we do the indpendent test
to include the tuple before or after any other quals. I realise that
isn't a cool and hip approach, but it works and is exactly accurate.
Which would change the patch quite a bit.

Taking the random() approach would mean we don't rely on statistics either.

Thoughts?


SYSTEM sampling uses a completely different approach and is the really
interesting part of this feature.

--
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] MVCC catalog access

2013-06-06 Thread Jim Nasby

On 6/5/13 3:49 PM, Robert Haas wrote:

Now, I did find a couple that I thought should probably stick with
SnapshotNow, specifically pgrowlocks and pgstattuple.


FWIW, I've often wished for a way to make all stat access transactional, across all the 
stats views. Perhaps that couldn't be done by default, but I'd love something like a 
function that would make a "snapshot" of all stats data as of one point. Even 
if that snapshot itself wasn't completely atomic, at least then you could query any stats 
views however you wanted and know that the info wasn't changing over time.

The reason I don't think this would work so well if done in userspace is how 
long it would take. Presumably making a complete backend-local copy of pg_proc 
etc and the stats file would be orders of magnitude faster than a bunch of 
CREATE TEMP TABLE's.
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Redesigning checkpoint_segments

2013-06-06 Thread Jeff Janes
On Wed, Jun 5, 2013 at 8:20 PM, Joshua D. Drake wrote:

>
> On 06/05/2013 05:37 PM, Robert Haas wrote:
>
>  - If it looks like we're going to exceed limit #3 before the
>> checkpoint completes, we start exerting back-pressure on writers by
>> making them wait every time they write WAL, probably in proportion to
>> the number of bytes written.  We keep ratcheting up the wait until
>> we've slowed down writers enough that will finish within limit #3.  As
>> we reach limit #3, the wait goes to infinity; only read-only
>> operations can proceed until the checkpoint finishes.
>>
>
> Alright, perhaps I am dense. I have read both this thread and the other
> one on better handling of archive command (http://www.postgresql.org/**
> message-id/CAM3SWZQcyNxvPaskr-**pxm8DeqH7_**qevW7uqbhPCsg1FpSxKpoQ@mail.**
> gmail.com).
> I recognize there are brighter minds than mine on this thread but I just
> honestly don't get it.
>
> 1. WAL writes are already fast. They are the fastest write we have because
> it is sequential.
>
> 2. We don't want them to be slow. We want data written to disk as quickly
> as possible without adversely affecting production. That's the point.
>

If speed of archiving is the fundamental bottleneck on the system, how does
that bottleneck get communicated forward to the user?  PANICs are a
horrible way of doing it, throttling the writing of WAL (and hence the
acceptance of COMMITs) seems like a reasonable alternative .  Maybe speed
of archiving is not the fundamental bottleneck on your systems, but...


>
> 3. The spread checkpoints have always confused me. If anything we want a
> checkpoint to be fast and short because:
>
> 4. Bgwriter. We should be adjusting bgwriter so that it is writing
> everything in a manner that allows any checkpoint to be in the range of
> never noticed.
>


They do different things.  One writes buffers out to make room for incoming
ones.  One writes them out (and fsyncs the underlying files) to allow redo
pointer to advance (limiting soft recovery time) and xlogs to be recycled
(limiting disk space).


>
> Now perhaps my customers workloads are different but for us:
>
> 1. Checkpoint timeout is set as high as reasonable, usually 30 minutes to
> an hour. I wish I could set them even further out.
>

Yeah, I think the limit of 1 hr is rather nanny-ish.  I know what I'm
doing, and I want the freedom to go longer if that is what I want to do.



> 2. Bgwriter is set to be aggressive but not obtrusive. Usually adjusting
> based on an actual amount of IO bandwidth it may take per second based on
> their IO constraints. (Note I know that wal_writer comes into play here but
> I honestly don't remember where and am reading up on it to refresh my
> memory).
>


I find bgwriter to be almost worthless, at least since the fsync queue
compaction code went in.  When io is free-flowing the kernel accepts writes
almost instantaneously, and so the backends can write out dirty buffers
themselves very quickly and it is not worth off-loading to a background
process.  When IO is constipated, it would be worth off-loading except in
those circumstances the bgwriter cannot possibly keep up.


>
> 3. The biggest issue we see with checkpoint segments is not running out of
> space because really 10GB is how many checkpoint segments? It is with
> wal_keep_segments. If we don't want to fill up the pg_xlog directory, put
> the wal logs that are for keep_segments elsewhere.
>

Which is what archiving does.  But then you have a to put a lot of thought
into how to clean up the archive, assuming your policy is not to keep it
forever.  keep_segments can be a nice compromise.


>
> Other oddities:
>
> Yes checkpoint_segments is awkward. We shouldn't have to set it at all. It
> should be gone. Basically we start with X amount perhaps to be set at
> initdb time. That X amount changes dynamically based on the amount of data
> being written. In order to not suffer from recycling and creation penalties
> we always keep X+N where N is enough to keep up with new data.
>
> Along with the above, I don't see any reason for checkpoint_timeout.
> Because of bgwriter we should be able to rather indefinitely not worry
> about checkpoints (with a few exceptions such as pg_start_backup()).
> Perhaps a setting that causes a checkpoint to happen based on some
> non-artificial threshold (timeout) such as amount of data currently in need
> of a checkpoint?
>

Without checkpoints, how would the redo pointer ever advance?

If the system is io limited during recovery, then checkpoint_segments is a
fairly natural way to put a limit on how long recovery from a soft crash
will take.   If the system is CPU limited during recovery, then
checkpoint_timeout is a fairly natural way to put a limit on how long
recovery will take.  It is probably possible to come with a single merged
setting that is better than both of those in almost a

Re: [HACKERS] Redesigning checkpoint_segments

2013-06-06 Thread Josh Berkus

>> Then I suggest we not use exactly that name.  I feel quite sure we
>> would get complaints from people if something labeled as "max" was
>> exceeded -- especially if they set that to the actual size of a
>> filesystem dedicated to WAL files.
> 
> You're probably right. Any suggestions for a better name?
> wal_size_soft_limit?

"checkpoint_size_limit", or something similar.  That is, what you're
defining is:

"this is the size at which we trigger a checkpoint even if
checkpoint_timeout has not been exceeded".

However, I think it's worth considering: if we're doing this "sizing
checkpoints based on prior cycles" thing, do we really need a size_limit
*at all* for most users?   I can see how a hard limit is useful, but not
how a soft limit is.

Most of our users most of the time don't care how large WAL is as long
as it doesn't exceed disk space.  And on most databases, hitting
checkpoint_timeout is more frequent than hitting checkpoint_segments --
at least in my substantial performance-tuning experience.  So I think
most users would prefer a setting which essentially says "make WAL as
big as it has to be in order to maximize throughput", and wouldn't worry
about the disk space.

>
> Yeah, something like that :-). I was thinking of letting the estimate
> decrease like a moving average, but react to any increases immediately.
> Same thing we do in bgwriter to track buffer allocations:

Seems reasonable.  Given the behavior of xlog, I'd want to adjust the
algo so that peak usage on a 24-hour basis would affect current
preallocation.  That is, if a site regularly has a peak from 2-3pm where
they're using 180 segments/cycle, then they should still be somewhat
higher at 2am than a database which doesn't have that peak.  I'm pretty
sure that the bgwriter's moving average cycles much shorter time scales
than that.

>> Well, the ideal unit from the user's point of view is *time*, not space.
>>   That is, the user wants the master to keep, say, "8 hours of
>> transaction logs", not any amount of MB.  I don't want to complicate
>> this proposal by trying to deliver that, though.
>
> OTOH, if you specify it in terms of time, then you don't have any limit
> on the amount of disk space required.

Well, the best setup from my perspective as a remote DBA for a lot of
clients would be two-factor:

wal_keep_time: ##hr
wal_keep_size_limit: ##GB

That is, we would try to keep ##hr of WAL around for the standbys,
unless that amount exceeded ##GB (at which point we'd write a warning to
the logs).  If max_wal_size was a hard limit, we wouldn't need
wal_keep_size_limit, of course.

However, to some degree Andres' work will render all this
wal_keep_segments stuff obsolete by letting the master track what
segment was last consumed by each replica, so I don't think it's worth
pursuing this line of thinking a lot further.

In any case, I'm just pointing out that we need to think of
wal_keep_segments as part of the total WAL size, and not as something
seperate, because that's confusing our users.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Multiple error reports.

2013-06-06 Thread Dmitriy Igrishin
According to http://www.postgresql.org/docs/9.2/static/libpq-async.html
"Even when PQresultStatus indicates a fatal error, PQgetResult should be
called
until it returns a null pointer, to allow libpq to process the error
information completely."
In libpq/fe-exec.c:PQexecFinish() error messages merges if more than one
error
retrieved from the backend.
Can you please explain, in what cases the backend returns multiple error
reports,
and is it guaranteed that all of them with the same SQLSTATE code? If it
is, I would
also like to know, what a reason for sending the same error in multiple
error
reports?

-- 
// Dmitriy.


Re: [HACKERS] Redesigning checkpoint_segments

2013-06-06 Thread Josh Berkus
Daniel,

So your suggestion is that if archiving is falling behind, we should
introduce delays on COMMIT in order to slow down the rate of WAL writing?

Just so I'm clear.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] JSON and unicode surrogate pairs

2013-06-06 Thread Robert Haas
On Wed, Jun 5, 2013 at 10:46 AM, Andrew Dunstan  wrote:
> In 9.2, the JSON parser didn't check the validity of the use of unicode
> escapes other than that it required 4 hex digits to follow '\u'. In 9.3,
> that is still the case. However, the JSON accessor functions and operators
> also try to turn JSON strings into text in the server encoding, and this
> includes de-escaping \u sequences. This works fine except when there is a
> pair of sequences representing a UTF-16 type surrogate pair, something that
> is explicitly permitted in the JSON spec.
>
> The attached patch is an attempt to remedy that, and a surrogate pair is
> turned into the correct code point before converting it to whatever the
> server encoding is.
>
> Note that this would mean we can still put JSON with incorrect use of
> surrogates into the database, as now (9.2 and later), and they will cause
> almost all the accessor functions to raise an error, as now (9.3). All this
> does is allow JSON that uses surrogates correctly not to fail when applying
> the accessor functions and operators. That's a possible violation of POLA,
> and at least worth of a note in the docs, but I'm not sure what else we can
> do now - adding this check to the input lexer would possibly cause restores
> to fail, which users might not thank us for.

I think the approach you've proposed here is a good one.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal for Allow postgresql.conf values to be changed via SQL [review]

2013-06-06 Thread Robert Haas
On Wed, Jun 5, 2013 at 7:24 AM, Amit Kapila  wrote:
> On Monday, May 27, 2013 4:17 PM Amit Kapila wrote:
>> On Wednesday, April 03, 2013 11:55 AM Amit Kapila wote:
>> > On Tuesday, April 02, 2013 9:49 PM Peter Eisentraut wrote:
>>
>
> There are 2 options to proceed for this patch for 9.4
>
> 1. Upload the SET PERSISTENT syntax patch for coming CF by fixing existing
> review comments
> 2. Implement new syntax ALTER SYSTEM as proposed in below mail
>
> Could you suggest me what could be best way to proceed for this patch?

I'm still in favor of some syntax involving ALTER, because it's still
true that this behaves more like the existing GUC-setting commands
that use ALTER (which change configuration for future sessions) rather
the ones that use SET (which change the current settings for some
period of time).

But I can't speak for people who are not me.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] MVCC catalog access

2013-06-06 Thread Robert Haas
On Thu, Jun 6, 2013 at 5:30 AM, Andres Freund  wrote:
>> + * XXX: Now that we have MVCC catalog access, the reasoning above is no 
>> longer
>> + * true.  Are there other good reasons to hard-code this, or should we 
>> revisit
>> + * that decision?
>
> We could just the function by looking in the shared
> relmapper. Everything that can be mapped via it is shared.

I suspect there are several possible sources for this information, but
it's hard to beat a hard-coded list for efficiency, so I wasn't sure
if we should tinker with this or not.

>> --- a/src/backend/commands/cluster.c
>> +++ b/src/backend/commands/cluster.c
>> @@ -480,6 +480,11 @@ check_index_is_clusterable(Relation OldHeap, Oid 
>> indexOid, bool recheck, LOCKMOD
>>   * against concurrent SnapshotNow scans of pg_index.  Therefore this is 
>> unsafe
>>   * to execute with less than full exclusive lock on the parent table;
>>   * otherwise concurrent executions of RelationGetIndexList could miss 
>> indexes.
>> + *
>> + * XXX: Now that we have MVCC catalog access, SnapshotNow scans of pg_index
>> + * shouldn't be common enough to worry about.  The above comment needs
>> + * to be updated, and it may be possible to simplify the logic here in other
>> + * ways also.
>>   */
>
> You're right, the comment needs to be changed, but I don't think the
> effect can. A non-inplace upgrade changes the xmin of the row which is
> relevant for indcheckxmin.

OK.

> (In fact, isn't this update possibly causing problems like delaying the
> use of such an index already)

Well, maybe.  In general, the ephemeral snapshot taken for a catalog
scan can't be any older than the primary snapshot already held.  But
there could be some corner case where that's not true, if we use this
technique somewhere that such a snapshot hasn't already been acquired.

> Hm. Looks like this should also change the description of SecondarySnapshot:
>
> /*
>  * CurrentSnapshot points to the only snapshot taken in transaction-snapshot
>  * mode, and to the latest one taken in a read-committed transaction.
>  * SecondarySnapshot is a snapshot that's always up-to-date as of the current
>  * instant, even in transaction-snapshot mode.  It should only be used for
>  * special-purpose code (say, RI checking.)
>  *

I think that's still more or less true, though we could add catalog
scans as another example.

> and
> /*
>  * Checking SecondarySnapshot is probably useless here, but it seems
>  * better to be sure.
>  */

Yeah, that is not useless any more, for sure.

> Also looks like BuildEventTriggerCache() in evtcache.c should use
> GetInstanSnapshot() now.

OK.

> I actually wonder if we shouldn't just abolish GetLatestSnapshot(). None
> of the callers seem to rely on it's behaviour from a quick look and it
> seems rather confusing to have both.

I assume Tom had some reason for making GetLatestSnapshot() behave the
way it does, so I refrained from doing that.  I might be wrong.

> I think we need another term for what SnapshotNow used to express
> here... Imo this description got less clear with this change.

I thought it was OK but I'm open to suggestions.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Statement timeout logging

2013-06-06 Thread Pavel Stehule
2013/6/6 Thom Brown :
> Hi,
>
> When a statement is cancelled due to it running for long enough for
> statement_timeout to take effect, it logs a message:
>
> ERROR:  canceling statement due to statement timeout
>
> However, it doesn't log what the timeout was at the time of the
> cancellation.  This may be set in postgresql.conf, the database, or on
> the role, but unless log_line_prefix is set to show the database name
> and the user name, there's no reliable way of finding out what context
> the configuration applied from.  Setting log_duration won't help
> either because that only logs the duration of completed queries.
>
> Should we output the statement_timeout value when a query is cancelled?

+1

we use same feature in GoodData. Our long queries are cancelled by
users and we should to known how much a users would to wait.

Regards

Pavel

>
> --
> Thom
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] RFC: ExecNodeExtender

2013-06-06 Thread Robert Haas
On Tue, Jun 4, 2013 at 2:50 PM, Kohei KaiGai  wrote:
> Also, I don't think ExecNodeExtender is not a good naming, because it
> is a bit long and
> abbreviation (ENE?) is hard to imagine the feature. Please give this
> feature a cool and
> well understandable name.

I agree that "Extender" doesn't sound right.  "Extension" would
probably be the right part of speech, but that has multiple meanings
that might confuse the issue.  (Does CREATE EXTENSION take the
relation extension lock?  And don't forget PostgreSQL extensions to
the SQL standard!)

I'm wondering if we ought to use something like "Custom" instead, so
that we'd end up with ExecInitCustom(), ExecCustom(), ExecEndCustom().
 I think that would make it more clear to the casual reader that this
is a hook for user-defined code.

Other bike-shedding?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Processing long AND/OR lists

2013-06-06 Thread Gurjeet Singh
On Mon, May 27, 2013 at 10:32 AM, Christopher Browne wrote:

> On Mon, May 27, 2013 at 1:42 AM, Gurjeet Singh  wrote:
>
>>
>>
>>> Joking about "640K" aside, it doesn't seem reasonable to expect a truly
>>> enormous query as is generated by the broken forms of this logic to turn
>>> out happily.  I'd rather fix Slony (as done in the above patch).
>>>
>>
>> Yes, by all means, fix the application, but that doesn't preclude the
>> argument that the database should be a bit more smarter and efficient,
>> especially if it is easy to do.
>
>
> Agreed, it seems like a fine idea to have the database support such
> queries, as this eases coping with applications that might be more
> difficult to get fixed.
>

Seeing no more objections to it, I am going to add this patch to the
commitfest. Attached is updated patch against latest master; it's the same
as the previous version, except that that the patch now includes a fix for
the failing test case as well.

Best regards,
-- 
Gurjeet Singh

http://gurjeet.singh.im/

EnterpriseDB Inc.


non_recursive_and_or_transformation_v3.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Statement timeout logging

2013-06-06 Thread Thom Brown
Hi,

When a statement is cancelled due to it running for long enough for
statement_timeout to take effect, it logs a message:

ERROR:  canceling statement due to statement timeout

However, it doesn't log what the timeout was at the time of the
cancellation.  This may be set in postgresql.conf, the database, or on
the role, but unless log_line_prefix is set to show the database name
and the user name, there's no reliable way of finding out what context
the configuration applied from.  Setting log_duration won't help
either because that only logs the duration of completed queries.

Should we output the statement_timeout value when a query is cancelled?

--
Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)

2013-06-06 Thread Heikki Linnakangas

On 06.06.2013 17:17, Andres Freund wrote:

On 2013-06-06 17:00:30 +0300, Heikki Linnakangas wrote:

A more workable idea is to sprinkle checks in higher-level code, before you
hold any critical locks, to check that there is enough preallocated WAL.
Like, at the beginning of heap_insert, heap_update, etc., and all similar
indexam entry points. I propose that we maintain a WAL reservation system in
shared memory.


I am rather doubtful that this won't end up with a bunch of complex code
that won't prevent the situation in all circumstances but which will
provide bugs/performance problems for some time.
Obviously that's just gut feeling since I haven't see the code...


I also have a feeling that we'll likely miss some corner cases in the 
first cut, so that you can still run out of disk space if you try hard 
enough / are unlucky. But I think it would still be a big improvement if 
it only catches, say 90% of the cases.


I think it can be made fairly robust otherwise, and the performance 
impact should be pretty easy to measure with e.g pgbench.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] all_visible replay aborting due to uninitialized pages

2013-06-06 Thread Andres Freund
On 2013-06-06 10:22:14 -0400, Robert Haas wrote:
> On Thu, May 30, 2013 at 2:29 AM, Andres Freund  wrote:
> >> Yeah, I think it's fine.  The patch also looks fine, although I think
> >> the comments could use a bit of tidying.  I guess we need to
> >> back-patch this all the way back to 8.4?  It will require some
> >> adjustments for the older branches.
> >
> > I think 9.2 is actually far enough and it should apply there. Before
> > that we only logged the unsetting of all_visible via
> > heap_(inset|update|delete)'s wal records not the setting as far as I can
> > tell. So I don't immediately see a danger < 9.2.
> 
> OK.  I have committed this.  For 9.2, I had to backport
> log_newpage_buffer() and use XLByteEQ rather than ==.

Thanks!

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_ugprade use of --check and --link

2013-06-06 Thread Alvaro Herrera
Bruce Momjian wrote:
> In a private bug report, I have realized that if you are eventually
> going to be using link mode with pg_upgrade, and you run --check mode,
> you should use --link with --check to check that both clusters are on
> the same file system.

Would it make sense to run the filesystem check anyway, and if the
check fails say something like "this will fail in --link mode, although
copy mode might work" ?

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] all_visible replay aborting due to uninitialized pages

2013-06-06 Thread Robert Haas
On Thu, May 30, 2013 at 2:29 AM, Andres Freund  wrote:
>> Yeah, I think it's fine.  The patch also looks fine, although I think
>> the comments could use a bit of tidying.  I guess we need to
>> back-patch this all the way back to 8.4?  It will require some
>> adjustments for the older branches.
>
> I think 9.2 is actually far enough and it should apply there. Before
> that we only logged the unsetting of all_visible via
> heap_(inset|update|delete)'s wal records not the setting as far as I can
> tell. So I don't immediately see a danger < 9.2.

OK.  I have committed this.  For 9.2, I had to backport
log_newpage_buffer() and use XLByteEQ rather than ==.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)

2013-06-06 Thread Andres Freund
On 2013-06-06 17:00:30 +0300, Heikki Linnakangas wrote:
> A more workable idea is to sprinkle checks in higher-level code, before you
> hold any critical locks, to check that there is enough preallocated WAL.
> Like, at the beginning of heap_insert, heap_update, etc., and all similar
> indexam entry points. I propose that we maintain a WAL reservation system in
> shared memory.

I am rather doubtful that this won't end up with a bunch of complex code
that won't prevent the situation in all circumstances but which will
provide bugs/performance problems for some time.
Obviously that's just gut feeling since I haven't see the code...

I am much more excited about getting the soft limit case right and then
seeing how many problems remain in reality.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] pg_ugprade use of --check and --link

2013-06-06 Thread Bruce Momjian
In a private bug report, I have realized that if you are eventually
going to be using link mode with pg_upgrade, and you run --check mode,
you should use --link with --check to check that both clusters are on
the same file system.

I have documented this with the attached, applied patch, and backpatched
it to 9.2.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Hard limit on WAL space used (because PANIC sucks)

2013-06-06 Thread Heikki Linnakangas
In the "Redesigning checkpoint_segments" thread, many people opined that 
there should be a hard limit on the amount of disk space used for WAL: 
http://www.postgresql.org/message-id/CA+TgmoaOkgZb5YsmQeMg8ZVqWMtR=6s4-ppd+6jiy4oq78i...@mail.gmail.com. 
I'm starting a new thread on that, because that's mostly orthogonal to 
redesigning checkpoint_segments.


The current situation is that if you run out of disk space while writing 
WAL, you get a PANIC, and the server shuts down. That's awful. We can 
try to avoid that by checkpointing early enough, so that we can remove 
old WAL segments to make room for new ones before you run out, but 
unless we somehow throttle or stop new WAL insertions, it's always going 
to be possible to use up all disk space. A typical scenario where that 
happens is when archive_command fails for some reason; even a checkpoint 
can't remove old, unarchived segments in that case. But it can happen 
even without WAL archiving.


I've seen a case, where it was even worse than a PANIC and shutdown. 
pg_xlog was on a separate partition that had nothing else on it. The 
partition filled up, and the system shut down with a PANIC. Because 
there was no space left, it could not even write the checkpoint after 
recovery, and thus refused to start up again. There was nothing else on 
the partition that you could delete to make space. The only recourse 
would've been to add more disk space to the partition (impossible), or 
manually delete an old WAL file that was not needed to recover from the 
latest checkpoint (scary). Fortunately this was a test system, so we 
just deleted everything.


So we need to somehow stop new WAL insertions from happening, before 
it's too late.


Peter Geoghegan suggested one method here: 
http://www.postgresql.org/message-id/flat/cam3swzqcynxvpaskr-pxm8deqh7_qevw7uqbhpcsg1fpsxk...@mail.gmail.com. 
I don't think that exact proposal is going to work very well; throttling 
WAL flushing by holding WALWriteLock in WAL writer can have knock-on 
effects on the whole system, as Robert Haas mentioned. Also, it'd still 
be possible to run out of space, just more difficult.


To make sure there is enough room for the checkpoint to finish, other 
WAL insertions have to stop some time before you completely run out of 
disk space. The question is how to do that.


A naive idea is to check if there's enough preallocated WAL space, just 
before inserting the WAL record. However, it's too late to check that in 
XLogInsert; once you get there, you're already holding exclusive locks 
on data pages, and you are in a critical section so you can't back out. 
At that point, you have to write the WAL record quickly, or the whole 
system will suffer. So we need to act earlier.


A more workable idea is to sprinkle checks in higher-level code, before 
you hold any critical locks, to check that there is enough preallocated 
WAL. Like, at the beginning of heap_insert, heap_update, etc., and all 
similar indexam entry points. I propose that we maintain a WAL 
reservation system in shared memory. First of all, keep track of how 
much preallocated WAL there is left (and try to create more if needed). 
Also keep track of a different number: the amount of WAL pre-reserved 
for future insertions. Before entering the critical section, increase 
the reserved number with a conservative estimate (ie. high enough) of 
how much WAL space you need, and check that there is still enough 
preallocated WAL to satisfy all the reservations. If not, throw an error 
or sleep until there is. After you're done with the insertion, release 
the reservation by decreasing the number again.


A shared reservation counter like that could become a point of 
contention. One optimization is keep a constant reservation of, say, 32 
KB for each backend. That's enough for most operations. Change the logic 
so that you check if you've exceeded the reserved amount of space 
*after* writing the WAL record, while you're holding WALInsertLock 
anyway. If you do go over the limit, set a flag in backend-private 
memory indicating that the *next* time you're about to enter a critical 
section where you will write a WAL record, you check again if more space 
has been made available.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partitioning performance: cache stringToNode() of pg_constraint.ccbin

2013-06-06 Thread Amit Kapila
On Tuesday, June 04, 2013 12:37 AM Noah Misch wrote:
> A colleague, Korry Douglas, observed a table partitioning scenario
> where deserializing pg_constraint.ccbin is a hot spot.  The following
> test case, a simplification of a typical partitioning setup, spends 28%
> of its time in
> stringToNode() and callees thereof:
> 
>
> 
> 
> The executor caches each CHECK constraint in ResultRelInfo as a planned
> expression.  That cache is highly effectively for long-running
> statements, but the trivial INSERTs effectively work without a cache.
> Korry devised this patch to cache the stringToNode() form of the
> constraint in the relcache.  It improves the benchmark's partitioned
> scenario by 33%:
> 
> -- Timings (seconds) --
> master, INSERT parent:   14.2, 14.4, 14.4
> patched, INSERT parent:  9.6,  9.7,  9.7
> 
> master, INSERT*10 child: 9.9,  9.9,  10.2
> patched, INSERT*10 child:10.0, 10.2, 10.2
> 
> There's still not much to like about that tenfold overhead from use of
> the partition routing trigger, but this patch makes a nice cut into
> that overhead without doing anything aggressive.  

This patch can give good performance gain in the scenario described by you.
Infact I had taken the readings with patch, it shows similar gain.

-- Timings (seconds) --
master, INSERT parent:   14.9, 15.4, 15.4
patched, INSERT parent:  9.9,  9.6,  9.5

master, INSERT*10 child: 13.8,  14.5, 15.6
patched, INSERT*10 child:13.0,  14.3, 14.6

This patch would increase the relcache size, as for each constraint on table
it would increase 4 bytes irrespective of whether that can give performance
benefit or not.
Why in function CheckConstraintFetch(), the node is not formed from string?

> 
> Some call sites need to modify the node tree, so the patch has them do
> copyObject().  I ran a microbenchmark of copyObject() on the cached
> node tree vs. redoing stringToNode(), and copyObject() still won by a
> factor of four.

I have not tried any performance run to measure if extra copyObject() has
added any benefit.
What kind of benchmark you use to validate it?

With Regards,
Amit Kapila.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Make targets of doc links used by phpPgAdmin static

2013-06-06 Thread Karl O. Pinc
On 06/05/2013 09:13:45 PM, Peter Eisentraut wrote:
> On Tue, 2013-06-04 at 22:27 -0500, Karl O. Pinc wrote:
> > On 06/04/2013 10:16:20 PM, Peter Eisentraut wrote:
> > > On Tue, 2013-05-07 at 23:18 -0400, Alvaro Herrera wrote:
> > > > Peter Eisentraut wrote:
> > > > > On Tue, 2013-05-07 at 00:32 -0500, Karl O. Pinc wrote:
> > > > > > Attached is a documentation patch against head which makes
> > > > > > static the targets of the on-line PG html documentation 
> that
> > > > > > are referenced by the phpPgAdmin help system.e
> > > > > 
> > > > > done
> > > > 
> > > > I wonder about backpatching this to 9.2 ?
> > > 
> > > done
> > 
> > Will this be in the next point release?  Or just when
> > will it go live?
> 
> I don't know when it goes to the web site, but it will be in the next
> point release.

Ok.  Thanks.

> 
> > This is not a huge problem but it does break some
> > existing links into the 9.2 PG docs.
> 
> Well, if it doesn't help you, I can back it out again.

It doesn't make me miserable and it sounds like
other people want it.  In theory the automatically
generated anchors could change and break things
anyway.  I'm happy to let somebody
else decide what to do.

Karl 
Free Software:  "You don't pay back, you pay forward."
 -- Robert A. Heinlein



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Redesigning checkpoint_segments

2013-06-06 Thread Heikki Linnakangas

On 06.06.2013 15:31, Kevin Grittner wrote:

Heikki Linnakangas  wrote:

On 05.06.2013 22:18, Kevin Grittner wrote:

Heikki Linnakangas   wrote:


I was not thinking of making it a hard limit. It would be just
like checkpoint_segments from that point of view - if a
checkpoint takes a long time, max_wal_size might still be
exceeded.


Then I suggest we not use exactly that name.  I feel quite sure we
would get complaints from people if something labeled as "max" was
exceeded -- especially if they set that to the actual size of a
filesystem dedicated to WAL files.


You're probably right. Any suggestions for a better name?
wal_size_soft_limit?


After reading later posts on the thread, I would be inclined to
support making it a hard limit and adapting the behavior to match.


Well, that's a lot more difficult to implement. And even if we have a 
hard limit, I think many people would still want to have a soft limit 
that would trigger a checkpoint, but would not stop WAL writes from 
happening. So what would we call that?


I'd love to see a hard limit too, but I see that as an orthogonal feature.

How about calling the (soft) limit "checkpoint_wal_size"? That goes well 
together with checkpoint_timeout, meaning that a checkpoint will be 
triggered if you're about to exceed the given size.



I'm also concerned about the "spin up" from idle to high activity.
Perhaps a "min" should also be present, to mitigate repeated short
checkpoint cycles for "bursty" environments?


With my proposal, you wouldn't get repeated short checkpoint cycles with 
bursts. The checkpoint interval would be controlled by 
checkpoint_timeout, and checkpoint_wal_size. If there is a lot of 
activity, then checkpoints will happen more frequently, as 
checkpoint_wal_size is reached sooner. But it would not depend on the 
activity in previous checkpoint cycles, only the current one, so it 
would not make a difference if you have a continuously high load, or a 
bursty one.


The history would matter for the calculation of how many segments to 
preallocate/recycle, however. Under the proposal, that would be 
calculated separately from checkpoint_wal_size, and for that we'd use 
some kind of a moving average of how many segments were used in previous 
cycles. A min setting might be useful for that. We could also try to 
make WAL file creation cheaper, ie. by using posix_fallocate(), as was 
proposed in another thread, and doing it in bgwriter or walwriter. That 
would make it less important to get the estimate right, from a 
performance point of view, although you'd still want to get it right to 
avoid running out of disk space (having the segments preallocated 
ensures that they are available when needed).


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freezing without write I/O

2013-06-06 Thread Heikki Linnakangas

On 06.06.2013 15:16, Greg Stark wrote:

On Fri, May 31, 2013 at 3:04 AM, Robert Haas  wrote:

Even at a more modest 10,000 tps, with default
settings, you'll do anti-wraparound vacuums of the entire cluster
about every 8 hours.  That's not fun.


I've forgotten now. What happens if you have a long-lived transaction
still alive from>  2B xid ago?


That will keep OldestXmin from advancing. Which will keep vacuum from 
advancing relfrozenxid/datfrozenxid. Which will first trigger the 
warnings about wrap-around, then stops new XIDs from being generated, 
and finally a forced shutdown.


The forced shutdown will actually happen some time before going beyond 2 
billion XIDs. So it is not possible to have a long-lived transaction, 
older than 2 B XIDs, still live in the system. But let's imagine that 
you somehow bypass the safety mechanism:


After wraparound, old tuples will look like being in the future, and 
will become invisible to new transactions. That happens even if there 
are no old transactions around. I'm not sure what exactly will happen if 
there is still a transaction alive with an XID and/or snapshots older 
than 2^31 XIDs. New tuples that are not supposed to be visible to the 
old snapshot would suddenly become visible, I guess.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Redesigning checkpoint_segments

2013-06-06 Thread Kevin Grittner
Heikki Linnakangas  wrote:
> On 05.06.2013 22:18, Kevin Grittner wrote:
>> Heikki Linnakangas  wrote:
>>
>>> I was not thinking of making it a hard limit. It would be just
>>> like checkpoint_segments from that point of view - if a
>>> checkpoint takes a long time, max_wal_size might still be
>>> exceeded.
>>
>> Then I suggest we not use exactly that name.  I feel quite sure we
>> would get complaints from people if something labeled as "max" was
>> exceeded -- especially if they set that to the actual size of a
>> filesystem dedicated to WAL files.
>
> You're probably right. Any suggestions for a better name?
> wal_size_soft_limit?

After reading later posts on the thread, I would be inclined to
support making it a hard limit and adapting the behavior to match.
I'm pretty sure I've seen at least one case where a separate
filesystem has been allocated for WAL which has been unexpectedly
filled.  People would like some way to deal with that.

I'm also concerned about the "spin up" from idle to high activity.
Perhaps a "min" should also be present, to mitigate repeated short
checkpoint cycles for "bursty" environments?

-- 
Kevin Grittner
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freezing without write I/O

2013-06-06 Thread Greg Stark
On Fri, May 31, 2013 at 3:04 AM, Robert Haas  wrote:
> Even at a more modest 10,000 tps, with default
> settings, you'll do anti-wraparound vacuums of the entire cluster
> about every 8 hours.  That's not fun.

I've forgotten now. What happens if you have a long-lived transaction
still alive from > 2B xid ago?


-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Redesigning checkpoint_segments

2013-06-06 Thread Heikki Linnakangas

On 05.06.2013 22:24, Fujii Masao wrote:

On Thu, Jun 6, 2013 at 3:35 AM, Heikki Linnakangas
  wrote:

The checkpoint spreading code already tracks if the checkpoint is "on
schedule", and it takes into account both checkpoint_timeout and
checkpoint_segments. Ie. if you consume segments faster than expected, the
checkpoint will speed up as well. Once checkpoint_segments is reached, the
checkpoint will complete ASAP, with no delays to spread it out.


Yep, right. One problem is that this mechanism doesn't work in the standby.


Sure it does:


commit 71815306e9e1ba7e95752779d2ad51d0c2b9c747
Author: Heikki Linnakangas 
Date:   Wed Jun 9 15:04:07 2010 +

In standby mode, respect checkpoint_segments in addition to
checkpoint_timeout to trigger restartpoints. We used to deliberately only
do time-based restartpoints, because if checkpoint_segments is small we
would spend time doing restartpoints more often than really necessary.
But now that restartpoints are done in bgwriter, they're not as
disruptive as they used to be. Secondly, because streaming replication
stores the streamed WAL files in pg_xlog, we want to clean it up more
often to avoid running out of disk space when checkpoint_timeout is large
and checkpoint_segments small.

Patch by Fujii Masao, with some minor changes by me.


One problam with that is that if you set checkpoint_segments (or 
max_wal_size, under the proposal) lower in the standby than in the 
master, we can't do restartpoints any more frequently than checkpoints 
have happened in the master. I wasn't planning to do anything about that.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Redesigning checkpoint_segments

2013-06-06 Thread Heikki Linnakangas

On 05.06.2013 22:18, Kevin Grittner wrote:

Heikki Linnakangas  wrote:


I was not thinking of making it a hard limit. It would be just
like checkpoint_segments from that point of view - if a
checkpoint takes a long time, max_wal_size might still be
exceeded.


Then I suggest we not use exactly that name.  I feel quite sure we
would get complaints from people if something labeled as "max" was
exceeded -- especially if they set that to the actual size of a
filesystem dedicated to WAL files.


You're probably right. Any suggestions for a better name? 
wal_size_soft_limit?


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Redesigning checkpoint_segments

2013-06-06 Thread Heikki Linnakangas

On 06.06.2013 11:42, Joshua D. Drake wrote:

On 6/6/2013 1:11 AM, Heikki Linnakangas wrote:

Yes checkpoint_segments is awkward. We shouldn't have to set it at all.
It should be gone.


The point of having checkpoint_segments or max_wal_size is to put a
limit (albeit a soft one) on the amount of disk space used. If you
don't care about that, I guess we could allow max_wal_size=-1 to mean
infinite, and checkpoints would be driven off purely based on time,
not WAL consumption.


I would not only agree with that, I would argue that max_wal_size
doesn't need to be there at least as a default. Perhaps as an "advanced"
configuration option that only those in the know see.


Well, we have checkpoint_segments=3 as the default currently, which in 
the proposed scheme would be about equal to max_wal_size=120MB. For 
better or worse, our defaults are generally geared towards small 
systems, and that sounds about right for that.



Basically we start with X amount perhaps to be set at
initdb time. That X amount changes dynamically based on the amount of
data being written. In order to not suffer from recycling and creation
penalties we always keep X+N where N is enough to keep up with new data.


To clarify, here you're referring to controlling the number of WAL
segments preallocated/recycled, rather than how often checkpoints are
triggered. Currently, both are derived from checkpoint_segments, but I
proposed to separate them. The above is exactly what I proposed to do
for the preallocation/recycling, it would be tuned automatically, but
you still need something like max_wal_size for the other thing, to
trigger a checkpoint if too much WAL is being consumed.


You think so? I agree with 90% of this paragraph but it seems to me that
we can find an algortihm that manages this without the idea of
max_wal_size (at least as a user settable).


We are in a violent agreement :-). max_wal_size would not directly 
affect the preallocation of segments. The preallocation would be driven 
off the actual number of segments used in previous checkpoint cycles, 
not on max_wal_size.


Now, max_wal_size would affect when checkpoints happen (ie. if you're 
about to reach max_wal_size, a checkpoint would be triggered), which 
would in turn affect the number of segments used between cycles. But 
there would be no direct connection between the two; the code to 
calculate how much to preallocate would not refer to max_wal_size.


Maybe max_wal_size should set an upper limit on how much to preallocate, 
though. If you want to limit the WAL size, we probably shouldn't exceed 
it on purpose by preallocating segments, even if the algorithm based on 
previous cycles suggests says we should. This situation would arise if 
the checkpoints can't keep up, so that each checkpoint cycle is longer 
than we'd want, and we'd exceed max_wal_size because of that.



This makes sense except I don't see a need for the parameter. Why not
just specify how the algorithm works and adhere to that without the need
for another GUC?


Because you want to limit the amount of disk space used for WAL. It's
a soft limit, but still.


Why? This is the point that confuses me. Why do we care? We don't care
how much disk space PGDATA takes... why do we all of a sudden care about
pg_xlog?


Hmm, dunno. We always have had checkpoint_segments setting to limit 
that, I was just thinking of retaining that functionality.


A few reasons spring to mind: First, running out of WAL space leads to a 
PANIC, which is not nice (I know, we talked about fixing that). 
Secondly, because we can. If a user inserts 10 GB of data into a table, 
we'll have to just store it, but with WAL, we can always issue a 
checkpoint to shrink it. People have asked for quotas for user data too, 
so some people do want to limit disk usage.


Mind you, it's possible to have a tiny database with a high TPS rate, 
such that the WAL grows really big compared to the size of the user 
data. Something with a small hot table that's updated a lot. In such a 
scenario, limiting the WAL size make sense, and it won't affect 
performance much either because checkpointing a small database is very 
cheap.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] About large objects asynchronous and non-blocking support

2013-06-06 Thread Dmitriy Igrishin
2013/6/6 Tatsuo Ishii 

> > Hi.
> >
> > At the moment libpq doesn't seem to support asynchronous and
> > non-blocking support for large objects, in the style of
> > PQsendQuery/PQgetResult. This makes large objects hardly suited for
> > single-threaded programs based on some variant of select().
> >
> > I would like to know whether this is a deliberate decision or it is
> > considered a bug, and, in case, whether it is scheduled to be fixed.
>
> Certainly not bug, since the doc clearly stats that PQsendQuery can
> only be used as a substituation of PQexec.  (see "Asynchronous Command
> Processing" section" for more details). The large object API is
> completely different from PQexec and its friends, so it cannot be used
> with PQsendQuery.
>
> Talking about more details, PQexec and PQsendQuery is designed to
> handle only "Q" messsage out of PostgreSQL frontend/backend protocol,
> while to access large objects, you need to handle "V" message.
>
Really? I've specialized a C++ standard std::streambuf class by using
only extended query protocol (by using prepared statements via
PQsendPrepare,
PQsendQueryPrepared) to call SQL functions like loread(), lowrite(),
lo_tell(), etc.
All these functions just needs to be called inside BEGIN block. And yes,
it can be done asynchronously.

>
> > Though I cannot guarantee anything, I may be interested into working out
> > a patch, if no one is already doing the same (of course I understand
> > that this patch wouldn't be for 9.3, which is already in its late
> > release cycle).
> >
> > Do you think this may be of interest?
>
> Yes, I understand your pain, and I myself think we need new APIs for
> large objects. Probably that would be not terribly hard. One idea
> would be inventing an asynchronous version of PQfn and let
> lo_read/lo_write allow to use the new API.
>
Yes, but according to
http://www.postgresql.org/docs/9.2/static/protocol-flow.html#AEN95330
and/or http://www.postgresql.org/docs/9.2/static/libpq-fastpath.html
function call sub-protocol is obsolete. Thats why personally I decided to
use prepared statements.

-- 
// Dmitriy.


Re: [HACKERS] MVCC catalog access

2013-06-06 Thread Andres Freund
Hi Robert,

Took a quick look through the patch to understand what your current
revision is actually doing and to facilitate thinking about possible
pain points.

Here are the notes I made during my reading:

On 2013-06-03 14:57:12 -0400, Robert Haas wrote:
> +++ b/src/backend/catalog/catalog.c
> @@ -232,6 +232,10 @@ IsReservedName(const char *name)
>   * know if it's shared.  Fortunately, the set of shared relations is
>   * fairly static, so a hand-maintained list of their OIDs isn't completely
>   * impractical.
> + *
> + * XXX: Now that we have MVCC catalog access, the reasoning above is no 
> longer
> + * true.  Are there other good reasons to hard-code this, or should we 
> revisit
> + * that decision?
>   */

We could just the function by looking in the shared
relmapper. Everything that can be mapped via it is shared.

> --- a/src/backend/commands/cluster.c
> +++ b/src/backend/commands/cluster.c
> @@ -480,6 +480,11 @@ check_index_is_clusterable(Relation OldHeap, Oid 
> indexOid, bool recheck, LOCKMOD
>   * against concurrent SnapshotNow scans of pg_index.  Therefore this is 
> unsafe
>   * to execute with less than full exclusive lock on the parent table;
>   * otherwise concurrent executions of RelationGetIndexList could miss 
> indexes.
> + *
> + * XXX: Now that we have MVCC catalog access, SnapshotNow scans of pg_index
> + * shouldn't be common enough to worry about.  The above comment needs
> + * to be updated, and it may be possible to simplify the logic here in other
> + * ways also.
>   */

You're right, the comment needs to be changed, but I don't think the
effect can. A non-inplace upgrade changes the xmin of the row which is
relevant for indcheckxmin.
(In fact, isn't this update possibly causing problems like delaying the
use of such an index already)


> --- a/src/backend/commands/tablecmds.c
> +++ b/src/backend/commands/tablecmds.c
> @@ -2738,7 +2738,7 @@ AlterTableGetLockLevel(List *cmds)
>* multiple DDL operations occur in a stream against frequently accessed
>* tables.
>*
> -  * 1. Catalog tables are read using SnapshotNow, which has a race bug 
> that
> +  * 1. Catalog tables were read using SnapshotNow, which has a race bug 
> that

Heh.

> --- a/src/backend/utils/time/snapmgr.c
> +++ b/src/backend/utils/time/snapmgr.c
> @@ -207,6 +207,19 @@ GetLatestSnapshot(void)
>  /*
> + * GetInstantSnapshot
> + *   Get a snapshot that is up-to-date as of the current instant,
> + *   but don't set the transaction snapshot.
> + */
> +Snapshot
> +GetInstantSnapshot(void)
> +{
> + SecondarySnapshot = GetSnapshotData(&SecondarySnapshotData);
> +
> + return SecondarySnapshot;
> +}

Hm. Looks like this should also change the description of SecondarySnapshot:

/*
 * CurrentSnapshot points to the only snapshot taken in transaction-snapshot
 * mode, and to the latest one taken in a read-committed transaction.
 * SecondarySnapshot is a snapshot that's always up-to-date as of the current
 * instant, even in transaction-snapshot mode.  It should only be used for
 * special-purpose code (say, RI checking.)
 *
and
/*
 * Checking SecondarySnapshot is probably useless here, but it seems
 * better to be sure.
 */

Also looks like BuildEventTriggerCache() in evtcache.c should use
GetInstanSnapshot() now.

I actually wonder if we shouldn't just abolish GetLatestSnapshot(). None
of the callers seem to rely on it's behaviour from a quick look and it
seems rather confusing to have both.

> --- a/src/bin/pg_dump/pg_dump.c
> +++ b/src/bin/pg_dump/pg_dump.c
> @@ -14,13 +14,13 @@
>   *   Note that pg_dump runs in a transaction-snapshot mode transaction,
>   *   so it sees a consistent snapshot of the database including system
>   *   catalogs. However, it relies in part on various specialized backend
> - *   functions like pg_get_indexdef(), and those things tend to run on
> - *   SnapshotNow time, ie they look at the currently committed state.  So
> - *   it is possible to get 'cache lookup failed' error if someone
> - *   performs DDL changes while a dump is happening. The window for this
> - *   sort of thing is from the acquisition of the transaction snapshot to
> - *   getSchemaData() (when pg_dump acquires AccessShareLock on every
> - *   table it intends to dump). It isn't very large, but it can happen.
> + *   functions like pg_get_indexdef(), and those things tend to look at
> + *   the currently committed state.  So it is possible to get 'cache
> + *   lookup failed' error if someone performs DDL changes while a dump is
> + *   happening. The window for this sort of thing is from the acquisition
> + *   of the transaction snapshot to getSchemaData() (when pg_dump acquires
> + *   AccessShareLock on every table it intends to dump). It isn't very large,
> + *   but it can happen.

I think we need another term for what SnapshotNow used to express
here... Imo this description got less clear with this change.

Greetings,

Andres 

Re: [HACKERS] Redesigning checkpoint_segments

2013-06-06 Thread Joshua D. Drake


On 6/6/2013 1:11 AM, Heikki Linnakangas wrote:


(I'm sure you know this, but:) If you perform a checkpoint as fast and 
short as possible, the sudden burst of writes and fsyncs will 
overwhelm the I/O subsystem, and slow down queries. That's what we saw 
before spread checkpoints: when a checkpoint happens, the response 
times of queries jumped up.


That isn't quite right. Previously we had lock issues as well and 
checkpoints would take considerable time to complete. What I am talking 
about is that the background writer (and wal writer where applicable) 
have done all the work before a checkpoint is even called. Consider that 
everyone of my clients that I am active with sets the 
checkpoint_completion_target to 0.9. With a proper bgwriter config this 
works.





4. Bgwriter. We should be adjusting bgwriter so that it is writing
everything in a manner that allows any checkpoint to be in the range of
never noticed.


Oh, I see where you're going.


O.k. good. I am not nuts :D
Yeah, that would be one way to do it. However, spread checkpoints has 
pretty much the same effect. Imagine that you tune your system like 
this: disable bgwriter altogether, and set 
checkpoint_completion_target=0.9. With that, there will be a 
checkpoint in progress most of the time, because by the time one 
checkpoint completes, it's almost time to begin the next one already. 
In that case, the checkpointer will be slowly performing the writes, 
all the time, in the background, without affecting queries. The effect 
is the same as what you described above, except that it's the 
checkpointer doing the writing, not bgwriter.


O.k. if that is true, then we have redundant systems and we need to 
remove one of them.
Yeah, wal_keep_segments is a hack. We should replace it with something 
else, like having a registry of standbys in the master, and how far 
they've streamed. That way the master could keep around the amount of 
WAL actually needed by them, not more not less. But that's a different 
story.



Other oddities:

Yes checkpoint_segments is awkward. We shouldn't have to set it at all.
It should be gone.


The point of having checkpoint_segments or max_wal_size is to put a 
limit (albeit a soft one) on the amount of disk space used. If you 
don't care about that, I guess we could allow max_wal_size=-1 to mean 
infinite, and checkpoints would be driven off purely based on time, 
not WAL consumption.




I would not only agree with that, I would argue that max_wal_size 
doesn't need to be there at least as a default. Perhaps as an "advanced" 
configuration option that only those in the know see.




Basically we start with X amount perhaps to be set at
initdb time. That X amount changes dynamically based on the amount of
data being written. In order to not suffer from recycling and creation
penalties we always keep X+N where N is enough to keep up with new data.


To clarify, here you're referring to controlling the number of WAL 
segments preallocated/recycled, rather than how often checkpoints are 
triggered. Currently, both are derived from checkpoint_segments, but I 
proposed to separate them. The above is exactly what I proposed to do 
for the preallocation/recycling, it would be tuned automatically, but 
you still need something like max_wal_size for the other thing, to 
trigger a checkpoint if too much WAL is being consumed.


You think so? I agree with 90% of this paragraph but it seems to me that 
we can find an algortihm that manages this without the idea of 
max_wal_size (at least as a user settable).



Along with the above, I don't see any reason for checkpoint_timeout.
Because of bgwriter we should be able to rather indefinitely not worry
about checkpoints (with a few exceptions such as pg_start_backup()).
Perhaps a setting that causes a checkpoint to happen based on some
non-artificial threshold (timeout) such as amount of data currently in
need of a checkpoint?


Either I'm not understanding what you said, or you're confused. The 
point of checkpoint_timeout is put a limit on the time it will take to 
recover in case of crash. The relation between the two, 
checkpoint_timeout and how long it will take to recover after a crash, 
it not straightforward, but that's the best we have.


I may be confused but it is my understanding that bgwriter writes out 
the data from the shared buffer cache that is dirty based on an interval 
and a max pages written. If we are writing data continuously, we don't 
need checkpoints except for special cases (like pg_start_backup())?


Bgwriter does not worry about checkpoints. By "amount of data 
currently in need of a checkpoint", do you mean the number of dirty 
buffers in shared_buffers, or something else? I don't see how or why 
that should affect when you perform a checkpoint.



Heikki said, "I propose that we do something similar, but not exactly
the same. Let's have a setting, max_wal_size, to control the max. disk
space reserved for WAL. Once that's reached (or you get clo

Re: [HACKERS] Redesigning checkpoint_segments

2013-06-06 Thread Heikki Linnakangas

On 05.06.2013 23:16, Josh Berkus wrote:

For limiting the time required to recover after crash,
checkpoint_segments is awkward because it's difficult to calculate how
long recovery will take, given checkpoint_segments=X. A bulk load can
use up segments really fast, and recovery will be fast, while segments
full of random deletions can need a lot of random I/O to replay, and
take a long time. IMO checkpoint_timeout is a much better way to control
that, although it's not perfect either.


This is true, but I don't see that your proposal changes this at all
(for the better or for the worse).


Right, it doesn't. I explained this to justify that it's OK to replace 
checkpoint_segments with max_wal_size. If someone is trying to use 
checkpoint_segments to limit the time required to recover after crash, 
he might find the current checkpoint_segments setting more intuitive 
than my proposed max_wal_size. checkpoint_segments means "perform a 
checkpoint every X segments", so you know that after a crash, you will 
have to replay at most X segments (except that 
checkpoint_completion_target complicates that already). With 
max_wal_size, the relationship is not as clear.


What I tried to argue is that I don't think that's a serious concern.


I propose that we do something similar, but not exactly the same. Let's
have a setting, max_wal_size, to control the max. disk space reserved
for WAL. Once that's reached (or you get close enough, so that there are
still some segments left to consume while the checkpoint runs), a
checkpoint is triggered.


Refinement of the proposal:

1. max_wal_size is a hard limit


I'd like to punt on that until later. Making it a hard limit would be a 
much bigger patch, and needs a lot of discussion how it should behave 
(switch to read-only mode, progressively slow down WAL writes, or what?) 
and how to implement it.


But I think there's a clear evolution path here; with current 
checkpoint_segments, it's not sensible to treat that as a hard limit. 
Once we have something like max_wal_size, defined in MB, it's much more 
sensible. So turning it into a hard limit could be a follow-up patch, if 
someone wants to step up to the plate.



2. checkpointing targets 50% of ( max_wal_size - wal_keep_segments )
to avoid lockup if checkpoint takes longer than expected.


Will also have to factor in checkpoint_completion_target.


Hmm, haven't thought about that. I think a better unit to set
wal_keep_segments in would also be MB, not segments.


Well, the ideal unit from the user's point of view is *time*, not space.
  That is, the user wants the master to keep, say, "8 hours of
transaction logs", not any amount of MB.  I don't want to complicate
this proposal by trying to deliver that, though.


OTOH, if you specify it in terms of time, then you don't have any limit 
on the amount of disk space required.



In this proposal, the number of segments preallocated is controlled
separately from max_wal_size, so that you can set max_wal_size high,
without actually consuming that much space in normal operation. It's
just a backstop, to avoid completely filling the disk, if there's a
sudden burst of activity. The number of segments preallocated is
auto-tuned, based on the number of segments used in previous checkpoint
cycles.


"based on"; can you give me your algorithmic thinking here?  I'm
thinking we should have some calculation of last cycle size and peak
cycle size so that bursty workloads aren't compromised.


Yeah, something like that :-). I was thinking of letting the estimate 
decrease like a moving average, but react to any increases immediately. 
Same thing we do in bgwriter to track buffer allocations:



/*
 * Track a moving average of recent buffer allocations.  Here, rather 
than
 * a true average we want a fast-attack, slow-decline behavior: we
 * immediately follow any increase.
 */
if (smoothed_alloc <= (float) recent_alloc)
smoothed_alloc = recent_alloc;
else
smoothed_alloc += ((float) recent_alloc - smoothed_alloc) /
smoothing_samples;



- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] MVCC catalog access

2013-06-06 Thread Andres Freund
On 2013-06-05 18:56:28 -0400, Tom Lane wrote:
> Robert Haas  writes:
> > Now, I did find a couple that I thought should probably stick with
> > SnapshotNow, specifically pgrowlocks and pgstattuple. Those are just
> > gathering statistical information, so there's no harm in having the
> > snapshot change part-way through the scan, and if the scan is long,
> > the user might actually regard the results under SnapshotNow as more
> > accurate.  Whether that's the case or not, holding back xmin for those
> > kinds of scans does not seem wise.
> 
> FWIW, I think if we're going to ditch SnapshotNow we should ditch
> SnapshotNow, full stop, even removing the tqual.c routines for it.
> Then we can require that *any* reference to SnapshotNow is replaced by
> an MVCC reference prior to execution, and throw an error if we actually
> try to test a tuple with that snapshot.

I suggest simply renaming it to something else. Every external project
that decides to follow the renaming either has a proper usecase for it
or the amount of sympathy for them keeping the old behaviour is rather
limited.

> If they really want that sort of uncertain semantics they could use
> SnapshotDirty, no?

Not that happy with that. A scan behaving inconsistently over its
proceedings is something rather different than reading uncommitted
rows. I have the feeling that trouble is waiting that way.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Redesigning checkpoint_segments

2013-06-06 Thread Heikki Linnakangas

On 06.06.2013 06:20, Joshua D. Drake wrote:

3. The spread checkpoints have always confused me. If anything we want a
checkpoint to be fast and short because:


(I'm sure you know this, but:) If you perform a checkpoint as fast and 
short as possible, the sudden burst of writes and fsyncs will overwhelm 
the I/O subsystem, and slow down queries. That's what we saw before 
spread checkpoints: when a checkpoint happens, the response times of 
queries jumped up.



4. Bgwriter. We should be adjusting bgwriter so that it is writing
everything in a manner that allows any checkpoint to be in the range of
never noticed.


Oh, I see where you're going. Yeah, that would be one way to do it. 
However, spread checkpoints has pretty much the same effect. Imagine 
that you tune your system like this: disable bgwriter altogether, and 
set checkpoint_completion_target=0.9. With that, there will be a 
checkpoint in progress most of the time, because by the time one 
checkpoint completes, it's almost time to begin the next one already. In 
that case, the checkpointer will be slowly performing the writes, all 
the time, in the background, without affecting queries. The effect is 
the same as what you described above, except that it's the checkpointer 
doing the writing, not bgwriter.


As it happens, that's pretty much what you get with the default settings.


Now perhaps my customers workloads are different but for us:

1. Checkpoint timeout is set as high as reasonable, usually 30 minutes
to an hour. I wish I could set them even further out.

2. Bgwriter is set to be aggressive but not obtrusive. Usually adjusting
based on an actual amount of IO bandwidth it may take per second based
on their IO constraints. (Note I know that wal_writer comes into play
here but I honestly don't remember where and am reading up on it to
refresh my memory).


I've heard people just turning off bgwriter because it doesn't have much 
effect anyway. You might want to try that, and if checkpoints cause I/O 
spikes, raise checkpoint_completion_target instead.



3. The biggest issue we see with checkpoint segments is not running out
of space because really 10GB is how many checkpoint segments? It is
with wal_keep_segments. If we don't want to fill up the pg_xlog
directory, put the wal logs that are for keep_segments elsewhere.


Yeah, wal_keep_segments is a hack. We should replace it with something 
else, like having a registry of standbys in the master, and how far 
they've streamed. That way the master could keep around the amount of 
WAL actually needed by them, not more not less. But that's a different 
story.



Other oddities:

Yes checkpoint_segments is awkward. We shouldn't have to set it at all.
It should be gone.


The point of having checkpoint_segments or max_wal_size is to put a 
limit (albeit a soft one) on the amount of disk space used. If you don't 
care about that, I guess we could allow max_wal_size=-1 to mean 
infinite, and checkpoints would be driven off purely based on time, not 
WAL consumption.



Basically we start with X amount perhaps to be set at
initdb time. That X amount changes dynamically based on the amount of
data being written. In order to not suffer from recycling and creation
penalties we always keep X+N where N is enough to keep up with new data.


To clarify, here you're referring to controlling the number of WAL 
segments preallocated/recycled, rather than how often checkpoints are 
triggered. Currently, both are derived from checkpoint_segments, but I 
proposed to separate them. The above is exactly what I proposed to do 
for the preallocation/recycling, it would be tuned automatically, but 
you still need something like max_wal_size for the other thing, to 
trigger a checkpoint if too much WAL is being consumed.



Along with the above, I don't see any reason for checkpoint_timeout.
Because of bgwriter we should be able to rather indefinitely not worry
about checkpoints (with a few exceptions such as pg_start_backup()).
Perhaps a setting that causes a checkpoint to happen based on some
non-artificial threshold (timeout) such as amount of data currently in
need of a checkpoint?


Either I'm not understanding what you said, or you're confused. The 
point of checkpoint_timeout is put a limit on the time it will take to 
recover in case of crash. The relation between the two, 
checkpoint_timeout and how long it will take to recover after a crash, 
it not straightforward, but that's the best we have.


Bgwriter does not worry about checkpoints. By "amount of data currently 
in need of a checkpoint", do you mean the number of dirty buffers in 
shared_buffers, or something else? I don't see how or why that should 
affect when you perform a checkpoint.



Heikki said, "I propose that we do something similar, but not exactly
the same. Let's have a setting, max_wal_size, to control the max. disk
space reserved for WAL. Once that's reached (or you get close enough, so
that there are still some segments left to 

Re: [HACKERS] Make targets of doc links used by phpPgAdmin static

2013-06-06 Thread Magnus Hagander
On Jun 6, 2013 4:14 AM, "Peter Eisentraut"  wrote:
>
> On Tue, 2013-06-04 at 22:27 -0500, Karl O. Pinc wrote:
> > On 06/04/2013 10:16:20 PM, Peter Eisentraut wrote:
> > > On Tue, 2013-05-07 at 23:18 -0400, Alvaro Herrera wrote:
> > > > Peter Eisentraut wrote:
> > > > > On Tue, 2013-05-07 at 00:32 -0500, Karl O. Pinc wrote:
> > > > > > Attached is a documentation patch against head which makes
> > > > > > static the targets of the on-line PG html documentation that
> > > > > > are referenced by the phpPgAdmin help system.e
> > > > >
> > > > > done
> > > >
> > > > I wonder about backpatching this to 9.2 ?
> > >
> > > done
> >
> > Will this be in the next point release?  Or just when
> > will it go live?
>
> I don't know when it goes to the web site, but it will be in the next
> point release.

Website doc updates for back branches are loaded at the same time as the
release is pushed. The "developer docs" are loaded every few hours from
git, but release docs are only loaded on release day.

/Magnus


Re: [HACKERS] Redesigning checkpoint_segments

2013-06-06 Thread Joshua D. Drake


On 6/5/2013 11:31 PM, Peter Geoghegan wrote:

On Wed, Jun 5, 2013 at 11:28 PM, Joshua D. Drake  wrote:

I have zero doubt that in your case it is true and desirable. I just don't
know that it is a positive solution to the problem as a whole. Your case is
rather limited to your environment, which is rather limited to the type of
user that your environment has. Which lends itself to the idea that this
should be a Heroku Postgres thing, not a .Org wide thing.

If you look through the -general archives, or on stack overflow you'll
find ample evidence that it is a problem that lots of people have.



Not to be unkind but the problems of the uniformed certainly are not the 
problems of the informed. Or perhaps they are certainly the problems of 
the informed :P. I do read -general and I don't see it much honestly. I 
don't watch stackoverflow that much but I am sure it probably does come 
up here, sometimes but I bet I can point once again to a lack of 
provisioning on their part.


This reminds me of the time that someone from Heroku said at PgEast, 
with a show of hands how many people here don't backup there database to 
S3. Almost everyone in the audience raised their hands.


Again, I don't question your need but just because it is hot and now 
doesn't mean it is healthy. I honestly do no see the requirement you are 
trying to represent as a need for the wider, production community.


(in short, not a single one of my customers would benefit from it, and 
90% of them are running databases Heroku can't.)


That is not a slight, honestly. I think your service is cool. I am just 
being honest.


Sincerely,

JD


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Redesigning checkpoint_segments

2013-06-06 Thread Joshua D. Drake

On 6/5/2013 11:25 PM, Harold Giménez wrote:


Instead of "running out of disk space PANIC" we should just write
to an emergency location within PGDATA


This merely buys you some time, but with aggressive and sustained 
write throughput you are left on the same spot. Practically speaking 
it's the same situation as increasing the pg_xlog disk space.


Except that you likely can't increase pg_xlog space (easily). The point 
here is to have overflow, think swap space.


I agree it is better than PANIC, but read-only mode is definitely also 
a form of throttling; a much more abrupt and unfriendly one if I may add.




I would think read only is less unfriendly than an all out failure. 
Consider if done correctly, the database would move back into read-write 
mode once the problem was resolved.


JD




Re: [HACKERS] how to find out whether a view is updatable

2013-06-06 Thread Dean Rasheed
On 5 June 2013 08:59, Dean Rasheed  wrote:
> I'm still not happy with pg_view_is_updatable() et al. and the
> information_schema views. I accept that the information_schema views
> have to be the way they are because that's what's defined in the
> standard, but as it stands, the distinction between updatable and
> trigger-updatable makes it impossible in general to answer the simple
> question "does foo support UPDATEs?".
>
> I'm thinking what we really need is a single function with a slightly
> different signature, that can be used to support both the information
> schema views and psql's \d+ (and potentially other client apps).
> Perhaps something like:-
>
>   pg_relation_is_updatable(include_triggers boolean)
>   returns int
>

OK, here's what it looks like using this approach:


 FUNCTION pg_relation_is_updatable(reloid oid,
   include_triggers boolean)
 RETURNS integer


 FUNCTION pg_column_is_updatable(reloid oid,
 attnum integer,
 include_triggers boolean)
 RETURNS boolean


These replace pg_view_is_updatable() and pg_view_is_insertable(). I
think I definitely prefer this over the old API, because it gives much
greater flexibility.

The information schema views all pass include_triggers = false for
compatibility with the standard. The return value from
pg_relation_is_updatable() is now an integer bitmask reflecting
whether or not the relation is insertable, updatable and/or deletable.

psql and other clients can more usefully pass include_triggers = true
to determine whether a relation actually supports INSERT, UPDATE and
DELETE, including checks for INSTEAD OF triggers on the specified
relation or any underlying base relations.

I thought about having pg_relation_is_updatable() return text, like
the GRANT support functions, but I thought that it would make the
information schema views harder to write, using a single call to check
for updatable+deletable, whereas integer bit operations are easy.

There is a backwards-incompatible change to the information schema,
reflected in the regression tests: if a view is updatable but not
deletable, the relevant rows in information_schema.columns now say
'YES' --- the columns are updatable, even though the relation as a
whole isn't.

I've initially defined matching FDW callback functions:


int
IsForeignRelUpdatable (Oid foreigntableid,
   bool include_triggers);


bool
IsForeignColUpdatable (Oid foreigntableid,
   int attnum,
   bool include_triggers);


but I'm now having second thoughts about whether we should bother
passing include_triggers to the FDW. If we regard the foreign table as
a black box, we only care about whether it is updatable, not *how*
that update is performed.

Regards,
Dean


pg_relation_is_updatable.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Move unused buffers to freelist

2013-06-06 Thread Amit Kapila
On Tuesday, May 28, 2013 6:54 PM Robert Haas wrote:
> >> Instead, I suggest modifying BgBufferSync, specifically this part
> right
> >> here:
> >>
> >> else if (buffer_state & BUF_REUSABLE)
> >> reusable_buffers++;
> >>
> >> What I would suggest is that if the BUF_REUSABLE flag is set here,
> use
> >> that as the trigger to do StrategyMoveBufferToFreeListEnd().
> >
> > I think at this point also we need to lock buffer header to check
> refcount
> > and usage_count before moving to freelist, or do you think it is not
> > required?
> 
> If BUF_REUSABLE is set, that means we just did exactly what you're
> saying.  Why do it twice?

Even if we just did it, but we have released the buf header lock, so
theoretically chances are there that backend can increase the count, however
still it will be protected by check in StrategyGetBuffer(). As there is a
very rare chance of it, so doing without buffer header lock might not cause
any harm.
Modified patch to address the same is attached with mail. 

Performance Data
---

As far as I have noticed, performance data for this patch depends on 3
factors
1. Pre-loading of data in buffers, so that buffers holding pages should have
some usage count before running pgbench. 
   Reason is it might be creating difference in performance of clock-sweep 
2. Clearing of pages in OS cache before running pgbench with different
patch, it can create difference because when we run pgbench with or without
patch, 
   it can access pages already cached due to previous runs which causes
variation in performance. 
3. Scale factor and shared buffer configuration

To avoid above 3 factors in test readings, I used below steps:
1. Initialize the database with scale factor such that database size +
shared_buffers = RAM (shared_buffers = 1/4 of RAM).
   For example: 
   Example -1
if RAM = 128G, then initialize db with scale factor = 6700
and shared_buffers = 32GB.
Database size (98 GB) + shared_buffers (32GB) = 130 (which
is approximately equal to total RAM)
   Example -2 (this is based on your test m/c)
If RAM = 64GB, then initialize db with scale factor = 3400
and shared_buffers = 16GB.
2. reboot m/c
3. Load all buffers with data (tables/indexes of pgbench) using pg_prewarm.
I had loaded 3 times, so that usage count of buffers will be approximately
3.
   Used file load_all_buffers.sql attached with this mail
4. run 3 times pgbench select-only case for 10 or 15 minutes without patch
5. reboot m/c
6. Load all buffers with data (tables/indexes of pgbench) using pg_prewarm.
I had loaded 3 times, so that usage count of buffers will be approximately
3.
   Used file load_all_buffers.sql attached with this mail
7. run 3 times pgbench select-only case for 10 or 15 minutes with patch

Using above steps, I had taken performance data on 2 different m/c's

Configuration Details
O/S - Suse-11
RAM - 128GB
Number of Cores - 16
Server Conf - checkpoint_segments = 300; checkpoint_timeout = 15 min,
synchronous_commit = 0FF, shared_buffers = 32GB, AutoVacuum=off
Pgbench - Select-only
Scalefactor - 1200
Time - Each run is of 15 mins

Below data is for average of 3 runs

   16C-16T32C-32T64C-64T
HEAD   43913971 3464
After Patch61475093 3944

Detailed data of each run is attached with mail in file
move_unused_buffers_to_freelist_v2.htm

Below data is for 1 run of half hour on same configuration

   16C-16T32C-32T64C-64T
HEAD   43773861 3295
After Patch65424770 3504


Configuration Details
O/S - Suse-11
RAM - 24GB
Number of Cores - 8
Server Conf - checkpoint_segments = 256; checkpoint_timeout = 25 min,
synchronous_commit = 0FF, shared_buffers = 5GB
Pgbench - Select-only
Scalefactor - 1200
Time - Each run is of 10 mins

Below data is for average 3 runs of 10 minutes

   8C-8T16C-16T32C-32T
64C-64T 128C-128T   256C-256T
HEAD   58837   56740   19390
568131912160
After Patch59482   56936   25070
765541662704

Detailed data of each run is attached with mail in file
move_unused_buffers_to_freelist_v2.htm


Below data is for 1 run of half hour on same configuration

   32C-32T 
HEAD   17703 
After Patch20586 

I had run these tests multiple times to ensure the correctness. I think last
time why it didn't show performance improvement in your runs is
because the way we both are running pgbench is different. This time, I have
detailed the steps I have used to collect performance data.


With Regards,
Amit Kapila.


move_unused_buffers_to_freelist_v2.patch
Description: Binary d