date:20130605



On 6/5/2013 11:09 PM, Daniel Farina wrote:

Instead of "running out of disk space PANIC" we should just write to an
emergency location within PGDATA and log very loudly that the SA isn't
paying attention. Perhaps if that area starts to get to an unhappy place we
immediately bounce into read-only mode and log even more loudly that the SA
should be fired. I would think read-only mode is safer and more polite than
an PANIC crash.

I do not think we should worry about filling up the hard disk except to
protect against data loss in the event. It is not user unfriendly to assume
that a user will pay attention to disk space. Really?
Okay, then I will say it's user unfriendly, especially for a transient
use of space, and particularly if there's no knob for said SA to
attenuate what's going on.  You appear to assume the SA can lean on
the application to knock off whatever is going on or provision more
disk in time, or that disk is reliable enough to meet one's goals.  In
my case, none of these precepts are true or desirable.
I have zero doubt that in your case it is true and desirable. I just 
don't know that it is a positive solution to the problem as a whole. 
Your case is rather limited to your environment, which is rather limited 
to the type of user that your environment has. Which lends itself to the 
idea that this should be a Heroku Postgres thing, not a .Org wide thing.


Sincerely,

JD



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Redesigning checkpoint_segments

2013-06-05 Thread Harold Giménez

Hi,

On Wed, Jun 5, 2013 at 11:05 PM, Joshua D. Drake wrote:

>
> On 6/5/2013 10:54 PM, Peter Geoghegan wrote:
>
>> On Wed, Jun 5, 2013 at 10:27 PM, Joshua D. Drake 
>> wrote:
>
>
> Instead of "running out of disk space PANIC" we should just write to an
> emergency location within PGDATA

This merely buys you some time, but with aggressive and sustained write
throughput you are left on the same spot. Practically speaking it's the
same situation as increasing the pg_xlog disk space.

> and log very loudly that the SA isn't paying attention. Perhaps if that
> area starts to get to an unhappy place we immediately bounce into read-only
> mode and log even more loudly that the SA should be fired. I would think
> read-only mode is safer and more polite than an PANIC crash.
>

I agree it is better than PANIC, but read-only mode is definitely also a
form of throttling; a much more abrupt and unfriendly one if I may add.

Regards,

-Harold

Re: [HACKERS] Redesigning checkpoint_segments

On Wed, Jun 5, 2013 at 11:05 PM, Joshua D. Drake  wrote:
>
> On 6/5/2013 10:54 PM, Peter Geoghegan wrote:
>>
>> On Wed, Jun 5, 2013 at 10:27 PM, Joshua D. Drake 
>> wrote:
>>>
>>> I just wonder if we are looking in the right place (outside of some
>>> obvious
>>> badness like the PANIC running out of disk space).
>>
>> So you don't think we should PANIC on running out of disk space? If
>> you don't think we should do that, and you don't think that WAL
>> writing should be throttled, what's the alternative?
>
>
> As I mentioned in my previous email:
>
>
> Instead of "running out of disk space PANIC" we should just write to an
> emergency location within PGDATA and log very loudly that the SA isn't
> paying attention. Perhaps if that area starts to get to an unhappy place we
> immediately bounce into read-only mode and log even more loudly that the SA
> should be fired. I would think read-only mode is safer and more polite than
> an PANIC crash.
>
> I do not think we should worry about filling up the hard disk except to
> protect against data loss in the event. It is not user unfriendly to assume
> that a user will pay attention to disk space. Really?

Okay, then I will say it's user unfriendly, especially for a transient
use of space, and particularly if there's no knob for said SA to
attenuate what's going on.  You appear to assume the SA can lean on
the application to knock off whatever is going on or provision more
disk in time, or that disk is reliable enough to meet one's goals.  In
my case, none of these precepts are true or desirable.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Redesigning checkpoint_segments



On 6/5/2013 10:54 PM, Peter Geoghegan wrote:

On Wed, Jun 5, 2013 at 10:27 PM, Joshua D. Drake  wrote:

I just wonder if we are looking in the right place (outside of some obvious
badness like the PANIC running out of disk space).

So you don't think we should PANIC on running out of disk space? If
you don't think we should do that, and you don't think that WAL
writing should be throttled, what's the alternative?


As I mentioned in my previous email:

Instead of "running out of disk space PANIC" we should just write to an 
emergency location within PGDATA and log very loudly that the SA isn't 
paying attention. Perhaps if that area starts to get to an unhappy place 
we immediately bounce into read-only mode and log even more loudly that 
the SA should be fired. I would think read-only mode is safer and more 
polite than an PANIC crash.


I do not think we should worry about filling up the hard disk except to 
protect against data loss in the event. It is not user unfriendly to 
assume that a user will pay attention to disk space. Really?




JD


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Redesigning checkpoint_segments

On Wed, Jun 5, 2013 at 10:27 PM, Joshua D. Drake  wrote:
>
> On 6/5/2013 10:07 PM, Daniel Farina wrote:
>>
>>
>> If I told you there were some of us who would prefer to attenuate the
>> rate that things get written rather than cancel or delay archiving for
>> a long period of time, would that explain the framing of the problem?
>
>
> I understand that based on what you said above.
>
>
>> Or, is it that you understand that's what I want, but find the notion
>> of such a operation hard to relate to?
>
>
> I think this is where I am at. To me, you don't attenuate the rate that
> things get written, you fix the problem in needing to do so. The problem is
> one of provisioning. Please note that I am not suggesting there aren't
> improvements to be made, there absolutely are. I just wonder if we are
> looking in the right place (outside of some obvious badness like the PANIC
> running out of disk space).

Okay, well, I don't see the fact that the block device is faster than
the archive command as a "problem," it's just an artifact of the
ratios of performance of stuff in the system.  If one views archives
as a must-have, there's not much other choice than to attenuate.

An alternative is to buy a slower block device.  That'd accomplish the
same effect, but it's a pretty bizarre and heavyhanded way to go about
it, and not easily adaptive to, say, if I made the archive command
faster (in my case, I well could, with some work).

So, I don't think it's all that unnatural to allow for the flexibility
of a neat attenuation technique, and it's pretty important too.
Methinks.  Disagree?

Final thought: I can't really tell users to knock off what they're
doing on a large scale.  It's better to not provide abrupt changes in
service (like crashing or turning off everything for extended periods
while the archive uploads).  So, smoothness and predictability is
desirable.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Redesigning checkpoint_segments

2013-06-05 Thread Peter Geoghegan

On Wed, Jun 5, 2013 at 10:27 PM, Joshua D. Drake  wrote:
> I just wonder if we are looking in the right place (outside of some obvious
> badness like the PANIC running out of disk space).

So you don't think we should PANIC on running out of disk space? If
you don't think we should do that, and you don't think that WAL
writing should be throttled, what's the alternative?

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Redesigning checkpoint_segments



On 6/5/2013 10:07 PM, Daniel Farina wrote:


If I told you there were some of us who would prefer to attenuate the
rate that things get written rather than cancel or delay archiving for
a long period of time, would that explain the framing of the problem?


I understand that based on what you said above.


Or, is it that you understand that's what I want, but find the notion
of such a operation hard to relate to?


I think this is where I am at. To me, you don't attenuate the rate that 
things get written, you fix the problem in needing to do so. The problem 
is one of provisioning. Please note that I am not suggesting there 
aren't improvements to be made, there absolutely are. I just wonder if 
we are looking in the right place (outside of some obvious badness like 
the PANIC running out of disk space).



Or, am I misunderstanding your confusion?
To be honest part of my confusion was just trying to parse all the bits 
that people were talking about into a cohesive, "this is the actual 
problem".


Sincerely,

JD


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Redesigning checkpoint_segments

On Wed, Jun 5, 2013 at 8:23 PM, Joshua D. Drake  wrote:
>> It's not as insane as introducing an archiving gap, PANICing and
>> crashing, or running this hunk o junk I wrote
>> http://github.com/fdr/ratchet
>>
>
> Well certainly we shouldn't PANIC and crash but that is a simple fix. You
> have a backup write location and start logging really loudly that you are
> using it.

If I told you there were some of us who would prefer to attenuate the
rate that things get written rather than cancel or delay archiving for
a long period of time, would that explain the framing of the problem?

Or, is it that you understand that's what I want, but find the notion
of such a operation hard to relate to?

Or, am I misunderstanding your confusion?

Or, none of the above?

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Possible bug in cascaded standby

2013-06-05 Thread Pavan Deolasee

On Wed, Jun 5, 2013 at 10:57 PM, Fujii Masao  wrote:

>
>
> I was not able to reproduce the problem. Maybe this is the timing problem.
>

Hmm. I can't reproduce this on my Ubuntu box either. I will retry on the
Mac machine in the evening. Surprisingly, I could reproduce it very easily
on that box. What I'd observed is that the walreceiver on the cascaded
standby is stuck at walreceiver.c:447, which in turn is waiting infinitely
at libpqwalreceiver.c:501 i.e. PQgetResult() call.

I'll retry and report back if I see the problem on the offending platform.

Thanks,
Pavan

-- 
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee

[HACKERS] pgbench: introduce a new automatic variable 'client_number'

2013-06-05 Thread Gurjeet Singh

Please find attached a patch for pgbench that introduces a new
auto-variable 'client_number'. Following in the footsteps of 'scale'
auto-variable, this is not declared if the user has specified this variable
using -D switch.

Since 'clientid' is a very common name a user can use for their own
script's variable, I chose to call this auto-variable client_number; just
to avoid conflicts.

This variable can come in handy when you want to use a different expression
in a query depending on which client is executing it. An example custom
transaction is attached, where the UPDATE statement from any given client
always updates the same logical row.

Best regards,
-- 
Gurjeet Singh

http://gurjeet.singh.im/

EnterpriseDB Inc.


pgbench_add_cleint_number_variable.patch
Description: Binary data


test_update.sql
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Redesigning checkpoint_segments



On 06/05/2013 06:23 PM, Daniel Farina wrote:


On Wed, Jun 5, 2013 at 6:00 PM, Joshua D. Drake  wrote:

I didn't see that proposal, link? Because the idea of slowing down
wal-writing sounds insane.


It's not as insane as introducing an archiving gap, PANICing and
crashing, or running this hunk o junk I wrote
http://github.com/fdr/ratchet



Well certainly we shouldn't PANIC and crash but that is a simple fix. 
You have a backup write location and start logging really loudly that 
you are using it.


Sincerely,

JD

--
Command Prompt, Inc. - http://www.commandprompt.com/  509-416-6579
PostgreSQL Support, Training, Professional Services and Development
High Availability, Oracle Conversion, Postgres-XC, @cmdpromptinc
For my dreams of your image that blossoms
   a rose in the deeps of my heart. - W.B. Yeats


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Redesigning checkpoint_segments

On 06/05/2013 05:37 PM, Robert Haas wrote:

- If it looks like we're going to exceed limit #3 before the
checkpoint completes, we start exerting back-pressure on writers by
making them wait every time they write WAL, probably in proportion to
the number of bytes written. We keep ratcheting up the wait until
we've slowed down writers enough that will finish within limit #3. As
we reach limit #3, the wait goes to infinity; only read-only
operations can proceed until the checkpoint finishes.

Alright, perhaps I am dense. I have read both this thread and the other
one on better handling of archive command
(http://www.postgresql.org/message-id/cam3swzqcynxvpaskr-pxm8deqh7_qevw7uqbhpcsg1fpsxk...@mail.gmail.com).
I recognize there are brighter minds than mine on this thread but I just
honestly don't get it.

1. WAL writes are already fast. They are the fastest write we have
because it is sequential.

2. We don't want them to be slow. We want data written to disk as
quickly as possible without adversely affecting production. That's the
point.

3. The spread checkpoints have always confused me. If anything we want a
checkpoint to be fast and short because:

4. Bgwriter. We should be adjusting bgwriter so that it is writing
everything in a manner that allows any checkpoint to be in the range of
never noticed.

Now perhaps my customers workloads are different but for us:

1. Checkpoint timeout is set as high as reasonable, usually 30 minutes
to an hour. I wish I could set them even further out.

2. Bgwriter is set to be aggressive but not obtrusive. Usually adjusting
based on an actual amount of IO bandwidth it may take per second based
on their IO constraints. (Note I know that wal_writer comes into play
here but I honestly don't remember where and am reading up on it to
refresh my memory).

3. The biggest issue we see with checkpoint segments is not running out
of space because really 10GB is how many checkpoint segments? It is
with wal_keep_segments. If we don't want to fill up the pg_xlog
directory, put the wal logs that are for keep_segments elsewhere.

Other oddities:

Yes checkpoint_segments is awkward. We shouldn't have to set it at all.
It should be gone. Basically we start with X amount perhaps to be set at
initdb time. That X amount changes dynamically based on the amount of
data being written. In order to not suffer from recycling and creation
penalties we always keep X+N where N is enough to keep up with new data.

Along with the above, I don't see any reason for checkpoint_timeout.
Because of bgwriter we should be able to rather indefinitely not worry
about checkpoints (with a few exceptions such as pg_start_backup()).
Perhaps a setting that causes a checkpoint to happen based on some
non-artificial threshold (timeout) such as amount of data currently in
need of a checkpoint?

Heikki said, "I propose that we do something similar, but not exactly
the same. Let's have a setting, max_wal_size, to control the max. disk
space reserved for WAL. Once that's reached (or you get close enough, so
that there are still some segments left to consume while the checkpoint
runs), a checkpoint is triggered.

In this proposal, the number of segments preallocated is controlled
separately from max_wal_size, so that you can set max_wal_size high,
without actually consuming that much space in normal operation. It's
just a backstop, to avoid completely filling the disk, if there's a
sudden burst of activity. The number of segments preallocated is
auto-tuned, based on the number of segments used in previous checkpoint
cycles. "

This makes sense except I don't see a need for the parameter. Why not
just specify how the algorithm works and adhere to that without the need
for another GUC? Perhaps at any given point we save 10% of available
space (within a 16MB calculation) for pg_xlog, you hit it, we checkpoint
and LOG EXACTLY WHY.

Instead of "running out of disk space PANIC" we should just write to an
emergency location within PGDATA and log very loudly that the SA isn't
paying attention. Perhaps if that area starts to get to an unhappy place
we immediately bounce into read-only mode and log even more loudly that
the SA should be fired. I would think read-only mode is safer and more
polite than an PANIC crash.

I do not think we should worry about filling up the hard disk except to
protect against data loss in the event. It is not user unfriendly to
assume that a user will pay attention to disk space. Really?

Open to people telling me I am off in left field. Sorry if it is noise.

Sincerely,

--
Command Prompt, Inc. - http://www.commandprompt.com/ 509-416-6579
PostgreSQL Support, Training, Professional Services and Development
High Availability, Oracle Conversion, Postgres-XC, @cmdpromptinc
For my dreams of your image that blossoms
a rose in the deeps of my heart. - W.B. Yeats

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.or

Re: [HACKERS] Make targets of doc links used by phpPgAdmin static

2013-06-05 Thread Peter Eisentraut

On Tue, 2013-06-04 at 22:27 -0500, Karl O. Pinc wrote:
> On 06/04/2013 10:16:20 PM, Peter Eisentraut wrote:
> > On Tue, 2013-05-07 at 23:18 -0400, Alvaro Herrera wrote:
> > > Peter Eisentraut wrote:
> > > > On Tue, 2013-05-07 at 00:32 -0500, Karl O. Pinc wrote:
> > > > > Attached is a documentation patch against head which makes
> > > > > static the targets of the on-line PG html documentation that
> > > > > are referenced by the phpPgAdmin help system.e
> > > > 
> > > > done
> > > 
> > > I wonder about backpatching this to 9.2 ?
> > 
> > done
> 
> Will this be in the next point release?  Or just when
> will it go live?

I don't know when it goes to the web site, but it will be in the next
point release.

> This is not a huge problem but it does break some
> existing links into the 9.2 PG docs.

Well, if it doesn't help you, I can back it out again.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Regarding GIN Fast Update Technique

2013-06-05 Thread Amit Langote

Hello,

At what point do the entries in the pending list are moved to the main
GIN data structure?
>From documentation, I read that overflowing work_mem and vacuum are
two such causes; what about when the concerned backend is to exit and
autovacuum has not yet kicked in?


--
Amit Langote


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Redesigning checkpoint_segments

On Wed, Jun 5, 2013 at 6:00 PM, Joshua D. Drake  wrote:
> I didn't see that proposal, link? Because the idea of slowing down
> wal-writing sounds insane.

It's not as insane as introducing an archiving gap, PANICing and
crashing, or running this hunk o junk I wrote
http://github.com/fdr/ratchet


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Redesigning checkpoint_segments

2013-06-05 Thread Michael Paquier

On Thu, Jun 6, 2013 at 10:00 AM, Joshua D. Drake wrote:

>
> On 06/05/2013 05:37 PM, Robert Haas wrote:
>
>>
>> On Wed, Jun 5, 2013 at 3:24 PM, Fujii Masao 
>> wrote:
>>
>>> OTOH, if we use max_wal_size as a hard limit, we can avoid such PANIC
>>> error and long down time. Of course, in this case, once max_wal_size is
>>> reached, we cannot complete any query writing WAL until the checkpoint
>>> has completed and removed old WAL files. During that time, the database
>>> service looks like down from a client, but its down time is shorter than
>>> the
>>> PANIC error case. So I'm thinking that some users might want the hard
>>> limit of pg_xlog size.
>>>
>>
>> I wonder if we could tie this in with the recent proposal from the
>> Heroku guys to have a way to slow down WAL writing.  Maybe we have
>> several limits:
>>
>
> I didn't see that proposal, link? Because the idea of slowing down
> wal-writing sounds insane.
>
Here it is:
http://www.postgresql.org/message-id/cam3swzqcynxvpaskr-pxm8deqh7_qevw7uqbhpcsg1fpsxk...@mail.gmail.com
-- 
Michael

Re: [HACKERS] Redesigning checkpoint_segments



On 06/05/2013 05:37 PM, Robert Haas wrote:


On Wed, Jun 5, 2013 at 3:24 PM, Fujii Masao  wrote:

OTOH, if we use max_wal_size as a hard limit, we can avoid such PANIC
error and long down time. Of course, in this case, once max_wal_size is
reached, we cannot complete any query writing WAL until the checkpoint
has completed and removed old WAL files. During that time, the database
service looks like down from a client, but its down time is shorter than the
PANIC error case. So I'm thinking that some users might want the hard
limit of pg_xlog size.


I wonder if we could tie this in with the recent proposal from the
Heroku guys to have a way to slow down WAL writing.  Maybe we have
several limits:


I didn't see that proposal, link? Because the idea of slowing down 
wal-writing sounds insane.


JD



--
Command Prompt, Inc. - http://www.commandprompt.com/  509-416-6579
PostgreSQL Support, Training, Professional Services and Development
High Availability, Oracle Conversion, Postgres-XC, @cmdpromptinc
For my dreams of your image that blossoms
   a rose in the deeps of my heart. - W.B. Yeats


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] MVCC catalog access

2013-06-05 Thread Robert Haas

On Wed, Jun 5, 2013 at 6:56 PM, Tom Lane  wrote:
> Robert Haas  writes:
>> Now, I did find a couple that I thought should probably stick with
>> SnapshotNow, specifically pgrowlocks and pgstattuple. Those are just
>> gathering statistical information, so there's no harm in having the
>> snapshot change part-way through the scan, and if the scan is long,
>> the user might actually regard the results under SnapshotNow as more
>> accurate.  Whether that's the case or not, holding back xmin for those
>> kinds of scans does not seem wise.
>
> FWIW, I think if we're going to ditch SnapshotNow we should ditch
> SnapshotNow, full stop, even removing the tqual.c routines for it.
> Then we can require that *any* reference to SnapshotNow is replaced by
> an MVCC reference prior to execution, and throw an error if we actually
> try to test a tuple with that snapshot.  If we don't do it like that
> I think we'll have errors of omission all over the place (I had really
> no confidence in your original patch because of that worry).  The fact
> that there are a couple of contrib modules for which there might be an
> arguable advantage in doing it the old way isn't sufficient reason to
> expose ourselves to bugs like that.  If they really want that sort of
> uncertain semantics they could use SnapshotDirty, no?

I had the same thought, initially.  I went through the exercise of
doing a grep for SnapshotNow and trying to eliminate as many
references as possible, but there were a few that I couldn't convince
myself to rip out.  However, if you'd like to apply the patch and grep
for SnapshotNow and suggest what to do about the remaining cases (or
hack the patch up yourself) I think that would be great.  I'd love to
see it completely gone if we can see our way clear to that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Redesigning checkpoint_segments

2013-06-05 Thread Robert Haas

On Wed, Jun 5, 2013 at 3:24 PM, Fujii Masao  wrote:
> OTOH, if we use max_wal_size as a hard limit, we can avoid such PANIC
> error and long down time. Of course, in this case, once max_wal_size is
> reached, we cannot complete any query writing WAL until the checkpoint
> has completed and removed old WAL files. During that time, the database
> service looks like down from a client, but its down time is shorter than the
> PANIC error case. So I'm thinking that some users might want the hard
> limit of pg_xlog size.

I wonder if we could tie this in with the recent proposal from the
Heroku guys to have a way to slow down WAL writing.  Maybe we have
several limits:

- When limit #1 is passed (or checkpoint_timeout elapses), we start a
spread checkpoint.

- If it looks like we're going to exceed limit #2 before the
checkpoint completes, we attempt to perform the checkpoint more
quickly, by reducing the delay between buffer writes.  If we actually
exceed limit #2, we try to complete the checkpoint as fast as
possible.

- If it looks like we're going to exceed limit #3 before the
checkpoint completes, we start exerting back-pressure on writers by
making them wait every time they write WAL, probably in proportion to
the number of bytes written.  We keep ratcheting up the wait until
we've slowed down writers enough that will finish within limit #3.  As
we reach limit #3, the wait goes to infinity; only read-only
operations can proceed until the checkpoint finishes.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] About large objects asynchronous and non-blocking support

2013-06-05 Thread Tatsuo Ishii

> Hi.
> 
> At the moment libpq doesn't seem to support asynchronous and
> non-blocking support for large objects, in the style of
> PQsendQuery/PQgetResult. This makes large objects hardly suited for
> single-threaded programs based on some variant of select().
> 
> I would like to know whether this is a deliberate decision or it is
> considered a bug, and, in case, whether it is scheduled to be fixed.

Certainly not bug, since the doc clearly stats that PQsendQuery can
only be used as a substituation of PQexec.  (see "Asynchronous Command
Processing" section" for more details). The large object API is
completely different from PQexec and its friends, so it cannot be used
with PQsendQuery.

Talking about more details, PQexec and PQsendQuery is designed to
handle only "Q" messsage out of PostgreSQL frontend/backend protocol,
while to access large objects, you need to handle "V" message.

> Though I cannot guarantee anything, I may be interested into working out
> a patch, if no one is already doing the same (of course I understand
> that this patch wouldn't be for 9.3, which is already in its late
> release cycle).
> 
> Do you think this may be of interest?

Yes, I understand your pain, and I myself think we need new APIs for
large objects. Probably that would be not terribly hard. One idea
would be inventing an asynchronous version of PQfn and let
lo_read/lo_write allow to use the new API.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] MVCC catalog access

Robert Haas  writes:
> Now, I did find a couple that I thought should probably stick with
> SnapshotNow, specifically pgrowlocks and pgstattuple. Those are just
> gathering statistical information, so there's no harm in having the
> snapshot change part-way through the scan, and if the scan is long,
> the user might actually regard the results under SnapshotNow as more
> accurate.  Whether that's the case or not, holding back xmin for those
> kinds of scans does not seem wise.

FWIW, I think if we're going to ditch SnapshotNow we should ditch
SnapshotNow, full stop, even removing the tqual.c routines for it.
Then we can require that *any* reference to SnapshotNow is replaced by
an MVCC reference prior to execution, and throw an error if we actually
try to test a tuple with that snapshot.  If we don't do it like that
I think we'll have errors of omission all over the place (I had really
no confidence in your original patch because of that worry).  The fact
that there are a couple of contrib modules for which there might be an
arguable advantage in doing it the old way isn't sufficient reason to
expose ourselves to bugs like that.  If they really want that sort of
uncertain semantics they could use SnapshotDirty, no?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] About large objects asynchronous and non-blocking support

2013-06-05 Thread Dmitriy Igrishin

2013/6/5 Giovanni Mascellani 

> Hi.
>
> At the moment libpq doesn't seem to support asynchronous and
> non-blocking support for large objects, in the style of
> PQsendQuery/PQgetResult. This makes large objects hardly suited for
> single-threaded programs based on some variant of select().
>
According to http://www.postgresql.org/docs/9.2/static/lo-funcs.html
"There are server-side functions callable from SQL that correspond to each
of
the client-side functions". Hence, you can call these functions by using
asynchronous API.

-- 
// Dmitriy.

Re: [HACKERS] MVCC catalog access

2013-06-05 Thread Robert Haas

On Wed, Jun 5, 2013 at 10:28 AM, Greg Stark  wrote:
> On Wed, May 22, 2013 at 3:18 AM, Robert Haas  wrote:
>> We've had a number of discussions about the evils of SnapshotNow.  As
>> far as I can tell, nobody likes it and everybody wants it gone, but
>> there is concern about the performance impact.
>
> I was always under the impression that the problem was we weren't
> quite sure what changes would be needed to make mvcc-snapshots work
> for the catalog lookups. The semantics of SnapshotNow aren't terribly
> clear either but we have years of experience telling us they seem to
> basically work. Most of the problems we've run into we either have
> worked around in the catalog accesses. Nobody really knows how many of
> the call sites will need different logic to behave properly with mvcc
> snapshots.

With all respect, I think this is just plain wrong.  SnapshotNow is
just like an up-to-date MVCC snapshot.  The only difference is that an
MVCC snapshot, once established, stays fixed for the lifetime of the
scan.  On the other hand, the SnapshotNow view in the world changes
the instant another transaction commits, meaning that scans can see
multiple versions of a row, or no versions of a row, where any MVCC
scan would have seen just one.  There are very few places that want
that behavior.

Now, I did find a couple that I thought should probably stick with
SnapshotNow, specifically pgrowlocks and pgstattuple. Those are just
gathering statistical information, so there's no harm in having the
snapshot change part-way through the scan, and if the scan is long,
the user might actually regard the results under SnapshotNow as more
accurate.  Whether that's the case or not, holding back xmin for those
kinds of scans does not seem wise.

But in most other parts of the code, the changes-in-mid-scan behavior
of SnapshotNow is a huge liability.  The fact that it is fully
up-to-date *as of the time the scan starts* is critical for
correctness.  But the fact that it can then change during the scan is
in almost every case something that we do not want.  The patch
preserves the first property while ditching the second one.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Redesigning checkpoint_segments

2013-06-05 Thread Josh Berkus

Heikki,

> We shouldn't impose that calculation on the user. It
> should be possible to just specify "checkpoint_segments=512MB", and the
> system would initiate checkpoints so that the total size of WAL in
> pg_xlog stays below 512MB.

Agreed.

> For limiting the time required to recover after crash,
> checkpoint_segments is awkward because it's difficult to calculate how
> long recovery will take, given checkpoint_segments=X. A bulk load can
> use up segments really fast, and recovery will be fast, while segments
> full of random deletions can need a lot of random I/O to replay, and
> take a long time. IMO checkpoint_timeout is a much better way to control
> that, although it's not perfect either.

This is true, but I don't see that your proposal changes this at all
(for the better or for the worse).

> A third point is that even if you have 10 GB of disk space reserved for
> WAL, you don't want to actually consume all that 10 GB, if it's not
> required to run the database smoothly.

Agreed.

> I propose that we do something similar, but not exactly the same. Let's
> have a setting, max_wal_size, to control the max. disk space reserved
> for WAL. Once that's reached (or you get close enough, so that there are
> still some segments left to consume while the checkpoint runs), a
> checkpoint is triggered.

Refinement of the proposal:

1. max_wal_size is a hard limit
2. checkpointing targets 50% of ( max_wal_size - wal_keep_segments )
   to avoid lockup if checkpoint takes longer than expected.
3. wal_keep_segments is taken out of max_wal_size.
a. it automatically defaults to 20% of max_wal_size if
   max_wal_senders > 0
b. for that reason, we don't allow it to be larger
   than 80% of max_wal_size
4. preallocated WAL isn't allowed to shrink smaller than
wal_keep_segements + (max_wal_size * 0.1).

This would mean that I could set my server to:

max_wal_size = 2GB

and ...

* by default, 26 segments (416MB) would be kept for wal_keep_segments.
* checkpoint target would be 77 segments (1.2GB)
* preallocated WAL will always be at least 39 segments (624MB),
including keep_segments.

now, if I had a fairly low transaction database, but wanted to make sure
I could recover from an 8-hour break in replication, I might bump up
wal_keep_segments to 1GB.  In that case:

* 64 segments (1GB) would be kept.
* checkpoints would target 96 segments (1.5GB)
* preallocated WAL would always be at least 77 segments (1.2GB)

> Hmm, haven't thought about that. I think a better unit to set
> wal_keep_segments in would also be MB, not segments.

Well, the ideal unit from the user's point of view is *time*, not space.
 That is, the user wants the master to keep, say, "8 hours of
transaction logs", not any amount of MB.  I don't want to complicate
this proposal by trying to deliver that, though.

> In this proposal, the number of segments preallocated is controlled
> separately from max_wal_size, so that you can set max_wal_size high,
> without actually consuming that much space in normal operation. It's
> just a backstop, to avoid completely filling the disk, if there's a
> sudden burst of activity. The number of segments preallocated is
> auto-tuned, based on the number of segments used in previous checkpoint
> cycles.

"based on"; can you give me your algorithmic thinking here?  I'm
thinking we should have some calculation of last cycle size and peak
cycle size so that bursty workloads aren't compromised.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] About large objects asynchronous and non-blocking support

2013-06-05 Thread Giovanni Mascellani

Hi.

At the moment libpq doesn't seem to support asynchronous and
non-blocking support for large objects, in the style of
PQsendQuery/PQgetResult. This makes large objects hardly suited for
single-threaded programs based on some variant of select().

I would like to know whether this is a deliberate decision or it is
considered a bug, and, in case, whether it is scheduled to be fixed.

Though I cannot guarantee anything, I may be interested into working out
a patch, if no one is already doing the same (of course I understand
that this patch wouldn't be for 9.3, which is already in its late
release cycle).

Do you think this may be of interest?

Thanks, Giovanni.
-- 
Giovanni Mascellani 
Pisa, Italy

Web: http://poisson.phc.unipi.it/~mascellani
Jabber: g.mascell...@jabber.org / giova...@elabor.homelinux.org



signature.asc
Description: OpenPGP digital signature

Re: [HACKERS] Redesigning checkpoint_segments

2013-06-05 Thread Fujii Masao

On Thu, Jun 6, 2013 at 3:35 AM, Heikki Linnakangas
 wrote:
> On 05.06.2013 21:16, Fujii Masao wrote:
>>
>> On Wed, Jun 5, 2013 at 9:16 PM, Heikki Linnakangas
>>   wrote:
>>>
>>> I propose that we do something similar, but not exactly the same. Let's
>>> have
>>>
>>> a setting, max_wal_size, to control the max. disk space reserved for WAL.
>>> Once that's reached (or you get close enough, so that there are still
>>> some
>>> segments left to consume while the checkpoint runs), a checkpoint is
>>> triggered.
>>
>>
>> What if max_wal_size is reached while the checkpoint is running? We should
>> change the checkpoint from spread mode to fast mode?
>
>
> The checkpoint spreading code already tracks if the checkpoint is "on
> schedule", and it takes into account both checkpoint_timeout and
> checkpoint_segments. Ie. if you consume segments faster than expected, the
> checkpoint will speed up as well. Once checkpoint_segments is reached, the
> checkpoint will complete ASAP, with no delays to spread it out.

Yep, right. One problem is that this mechanism doesn't work in the standby.
So, are you planning to 'fix' that so that max_wal_size works well even in
the standby? Or just leave that as it is? According to the remaining part of
your email, you seem to choose the latter, though.

>
> This would still work the same with max_wal_size. A new checkpoint would be
> started well before reaching max_wal_size, so that it has enough time to
> complete. If the checkpoint "falls behind", it will hurry up until it's back
> on schedule. If max_wal_size is reached anyway, it will complete ASAP.
>
>
>> Or, if max_wal_size
>> is hard limit, we should keep the allocation of new WAL file waiting until
>> the checkpoint has finished and removed some old WAL files?
>
>
> I was not thinking of making it a hard limit. It would be just like
> checkpoint_segments from that point of view - if a checkpoint takes a long
> time, max_wal_size might still be exceeded.

So, if the archive command keeps failing or its speed is very slow
(e.g., because
of using compression tool), max_wal_size can still be extremely exceeded. Right?

I'm wondering if it's worth exposing the option specifying whether to use
max_wal_size as the hard limit or not. If it's not hard limit, the
disk space can
be filled up with WAL files and PANIC can happen. In this case, in order to
restart the database service, we need to enlarge the disk space or relocate
some WAL files to another disk space, and then we need to start up the server.
The normal crash recovery needs to be done. This would lead lots of service
down time.

OTOH, if we use max_wal_size as a hard limit, we can avoid such PANIC
error and long down time. Of course, in this case, once max_wal_size is
reached, we cannot complete any query writing WAL until the checkpoint
has completed and removed old WAL files. During that time, the database
service looks like down from a client, but its down time is shorter than the
PANIC error case. So I'm thinking that some users might want the hard
limit of pg_xlog size.

>>> In this proposal, the number of segments preallocated is controlled
>>> separately from max_wal_size, so that you can set max_wal_size high,
>>> without
>>> actually consuming that much space in normal operation. It's just a
>>> backstop, to avoid completely filling the disk, if there's a sudden burst
>>> of
>>> activity. The number of segments preallocated is auto-tuned, based on the
>>> number of segments used in previous checkpoint cycles.
>>
>>
>> How is wal_keep_segments handled in your approach?
>
>
> Hmm, haven't thought about that. I think a better unit to set
> wal_keep_segments in would also be MB, not segments.

+1

Regards,

-- 
Fujii Masao

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Redesigning checkpoint_segments

2013-06-05 Thread Kevin Grittner

Heikki Linnakangas  wrote:

> I was not thinking of making it a hard limit. It would be just
> like checkpoint_segments from that point of view - if a
> checkpoint takes a long time, max_wal_size might still be
> exceeded.

Then I suggest we not use exactly that name.  I feel quite sure we
would get complaints from people if something labeled as "max" was
exceeded -- especially if they set that to the actual size of a
filesystem dedicated to WAL files.

--
Kevin Grittner
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Configurable location for extension .control files

2013-06-05 Thread Josh Berkus

Tom,

> Yeah, if the config option were to be superuser-only, the security issue
> would be ameliorated --- not removed entirely, IMO, but at least
> weakened.  However, this seems to me to be missing the point, which is
> that the extensions feature is designed to let the DBA have control over
> which extensions are potentially installable.  If we allow extension
> control files to be loaded from any random directory then we lose that.
> Part of the argument for not requiring superuser permissions to execute
> CREATE EXTENSION was based on that restriction, so we'd need to go back
> and rethink the permissions needed for CREATE EXTENSION.

I do see the utility in having the extension folder relocatable by
packagers; I could really use this for vagrant builds of PostgreSQL,
which I use for testing.  Right now I do a lot of file copying of .so
files.  In my case, though, I only need to change the whole extension
folder location, I don't need to have multiple locations, a dirpath, or
anything sophisticated.  That is, a super-user, cold-start only option
of "extension_path='/vagrant/extensions/'" would work for my case, and I
suspect most packaging cases as well.

This seems like it would work for Oliver's case.  And I don't see how
making the folder relocatable as an on-start option hurts our security
at all; we're simply doing something which the same user could do with
symlinks, only much more neatly.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Redesigning checkpoint_segments

2013-06-05 Thread Heikki Linnakangas


On 05.06.2013 21:16, Fujii Masao wrote:

On Wed, Jun 5, 2013 at 9:16 PM, Heikki Linnakangas
  wrote:

I propose that we do something similar, but not exactly the same. Let's have
a setting, max_wal_size, to control the max. disk space reserved for WAL.
Once that's reached (or you get close enough, so that there are still some
segments left to consume while the checkpoint runs), a checkpoint is
triggered.


What if max_wal_size is reached while the checkpoint is running? We should
change the checkpoint from spread mode to fast mode?


The checkpoint spreading code already tracks if the checkpoint is "on 
schedule", and it takes into account both checkpoint_timeout and 
checkpoint_segments. Ie. if you consume segments faster than expected, 
the checkpoint will speed up as well. Once checkpoint_segments is 
reached, the checkpoint will complete ASAP, with no delays to spread it out.


This would still work the same with max_wal_size. A new checkpoint would 
be started well before reaching max_wal_size, so that it has enough time 
to complete. If the checkpoint "falls behind", it will hurry up until 
it's back on schedule. If max_wal_size is reached anyway, it will 
complete ASAP.



Or, if max_wal_size
is hard limit, we should keep the allocation of new WAL file waiting until
the checkpoint has finished and removed some old WAL files?


I was not thinking of making it a hard limit. It would be just like 
checkpoint_segments from that point of view - if a checkpoint takes a 
long time, max_wal_size might still be exceeded.



In this proposal, the number of segments preallocated is controlled
separately from max_wal_size, so that you can set max_wal_size high, without
actually consuming that much space in normal operation. It's just a
backstop, to avoid completely filling the disk, if there's a sudden burst of
activity. The number of segments preallocated is auto-tuned, based on the
number of segments used in previous checkpoint cycles.


How is wal_keep_segments handled in your approach?


Hmm, haven't thought about that. I think a better unit to set 
wal_keep_segments in would also be MB, not segments. Perhaps 
max_wal_size should include WAL retained for wal_keep_segments, leaving 
less room for checkpoints. Ie. when you you set wal_keep_segments 
higher, a xlog-based checkpoint would be triggered earlier, because the 
old segments kept for replication would leave less room for new 
segments. And setting wal_keep_segments higher than max_wal_size would 
be an error.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Redesigning checkpoint_segments

2013-06-05 Thread Fujii Masao

On Wed, Jun 5, 2013 at 9:16 PM, Heikki Linnakangas
 wrote:
> checkpoint_segments is awkward. From an admin's point of view, it controls
> two things:
>
> 1. it limits the amount of disk space needed for pg_xlog. (it's a soft
> limit, but still)
> 2. it limits the time required to recover after a crash.
>
> For limiting the disk space needed for pg_xlog, checkpoint_segments is
> awkward because it's defined in terms of 16MB segments between checkpoints.
> It takes a fair amount of arithmetic to calculate the disk space required to
> hold the specified number of segments. The manual gives the formula: (2 +
> checkpoint_completion_target) * checkpoint_segments + 1, which amounts to
> about 1GB per 20 segments as a rule of thumb. We shouldn't impose that
> calculation on the user. It should be possible to just specify
> "checkpoint_segments=512MB", and the system would initiate checkpoints so
> that the total size of WAL in pg_xlog stays below 512MB.
>
> For limiting the time required to recover after crash, checkpoint_segments
> is awkward because it's difficult to calculate how long recovery will take,
> given checkpoint_segments=X. A bulk load can use up segments really fast,
> and recovery will be fast, while segments full of random deletions can need
> a lot of random I/O to replay, and take a long time. IMO checkpoint_timeout
> is a much better way to control that, although it's not perfect either.
>
> A third point is that even if you have 10 GB of disk space reserved for WAL,
> you don't want to actually consume all that 10 GB, if it's not required to
> run the database smoothly. There are several reasons for that: backups based
> on a filesystem-level snapshot are larger than necessary, if there are a lot
> of preallocated WAL segments and in a virtualized or shared system, there
> might be other VMs or applications that could make use of the disk space. On
> the other hand, you don't want to run out of disk space while writing WAL -
> that can lead to a PANIC in the worst case.
>
>
> In VMware's vPostgres fork, we've hacked the way that works, so that there
> is a new setting, checkpoint_segments_max that can be set by the user, but
> checkpoint_segments is adjusted automatically, on the fly. The system counts
> how many segments were consumed during the last checkpoint cycle, and that
> becomes the checkpoint_segments setting for the next cycle. That means that
> in a system with a steady load, checkpoints are triggered by
> checkpoint_timeout, and the effective checkpoint_segments value converges at
> the exact number of segments needed for that. That's simple but very
> effective. It doesn't behave too well with bursty load, however; during
> quiet times, checkpoint_segments is dialed way down, and when the next burst
> comes along, you get several checkpoints in quick succession, until
> checkpoint_segments is dialed back up again.
>
>
> I propose that we do something similar, but not exactly the same. Let's have
> a setting, max_wal_size, to control the max. disk space reserved for WAL.
> Once that's reached (or you get close enough, so that there are still some
> segments left to consume while the checkpoint runs), a checkpoint is
> triggered.

What if max_wal_size is reached while the checkpoint is running? We should
change the checkpoint from spread mode to fast mode? Or, if max_wal_size
is hard limit, we should keep the allocation of new WAL file waiting until
the checkpoint has finished and removed some old WAL files?

> In this proposal, the number of segments preallocated is controlled
> separately from max_wal_size, so that you can set max_wal_size high, without
> actually consuming that much space in normal operation. It's just a
> backstop, to avoid completely filling the disk, if there's a sudden burst of
> activity. The number of segments preallocated is auto-tuned, based on the
> number of segments used in previous checkpoint cycles.

How is wal_keep_segments handled in your approach?

> I'll write up a patch to do that, but before I do, does anyone disagree on
> those tuning principles?

No at least from me. I like your idea.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_rewind, a tool for resynchronizing an old master after failover

2013-06-05 Thread Dave Page

On Wed, Jun 5, 2013 at 6:10 PM, Josh Berkus  wrote:
>
>> I'm not a lawyer and I make no judgement on how solid a practice this
>>> is but that's VMware doesn't seem to be doing anything special here.
>>> They can retain copyright ownership of their contributions as long as
>>> they're happy releasing it under the Postgres copyright. Ideally they
>>> wold also be happy with a copyright notice that includes all of the
>>> PGDG just to reduce the maintenance headache.
>
> Many other projects also take this approach: Linux Kernel, Drizzle, etc.
>  There's some legal advantages, as well as disadvantages, in having the
> copyright rest with the original contributors.  Mostly, it prevents
> relicensing of the whole project.

No it doesn't - it just makes it a pain in the arse (I know, I've done it).

--
Dave Page
Blog: http://pgsnake.blogspot.com
Twitter: @pgsnake

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Possible bug in cascaded standby

2013-06-05 Thread Fujii Masao

On Thu, Jun 6, 2013 at 1:03 AM, Pavan Deolasee  wrote:
> Hello,
>
> I am experimenting with the cascade standby and hit a problem which is
> reproducible with the current HEAD. I haven't tried other branches, but not
> sure if the test setup I am trying even works for older releases because of
> the timeline ID issue.
>
> Anyways, I set up a cascaded standby such that it streams from the first
> standby and then stopped the original master and promoted the first standby
> to be the new master. If I then try to smart shutdown the cascaded standby,
> it fails after waiting for the walreceiver to terminate. What's worse, the
> walsender on the first standby gets into an infinite loop consuming 100%
> CPU.
>
> I tried to investigate this a bit, but haven't made progress worth
> reporting. I can spend more time, but just wanted to make sure that I'm not
> trying something which is a known issue or limitation. BTW, this is on my
> Macbook Pro. Attached is the script that I used to set up the environment.
> You will need to modify it for your setup though.

I was not able to reproduce the problem. Maybe this is the timing problem.
Could you share the server log of each server at the time when the problem
happened? Just in case, I attached the server logs which I got when I ran
the script to reproduce the problem.

Regards,

-- 
Fujii Masao


master.log
Description: Binary data


sb1.log
Description: Binary data


sb2.log
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] MVCC catalog access

2013-06-05 Thread Andres Freund

On 2013-06-05 11:35:58 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On 2013-06-05 15:28:09 +0100, Greg Stark wrote:
> >> I thought there were many call sites that were specifically depending
> >> on seeing dirty reads to avoid race conditions with other backends --
> >> which probably just narrowed the race condition or created different
> >> ones.
> 
> > But SnapshotNow doesn't allow you to do actual dirty reads?
> 
> Yeah.  I believe the issue is that we can't simply do MVCC catalog reads
> with a snapshot taken at transaction start time or statement start time,
> as we would do if executing an MVCC scan for a user query.  Rather, the
> snapshot has to be recent enough to ensure we see the current definition
> of any table we've just acquired lock on, *even if that's newer than the
> snapshot prevailing for the user's purposes*.  Otherwise we might be
> using the wrong rowtype definition or failing to enforce a just-added
> constraint.

Oh, definitely. At least Robert's previous prototype tried to do that
(although I am not sure if it went far enough). And I'd be surprised the
current one wouldn't do so.

> The last time we talked about this, we were batting around ideas of
> keeping a "current snapshot for catalog purposes", which we'd update
> or at least invalidate anytime we acquired a new lock.  (In principle,
> if that isn't new enough, we have a race condition that we'd better fix
> by adding some more locking.)  Robert's results seem to say that that
> might be unnecessary optimization, and that it'd be sufficient to just
> take a new snap each time we need to do a catalog scan.  TBH I'm not
> sure I believe that; it seems to me that this approach is surely going
> to create a great deal more contention from concurrent GetSnapshotData
> calls.

I still have a hard time believing those results as well, but I think we
might have underestimated the effectiveness of the syscache during
workloads which are sufficiently concurrent to make locking in
GetSnapshotData() a problem.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_rewind, a tool for resynchronizing an old master after failover

2013-06-05 Thread Josh Berkus


> I'm not a lawyer and I make no judgement on how solid a practice this
>> is but that's VMware doesn't seem to be doing anything special here.
>> They can retain copyright ownership of their contributions as long as
>> they're happy releasing it under the Postgres copyright. Ideally they
>> wold also be happy with a copyright notice that includes all of the
>> PGDG just to reduce the maintenance headache.

Many other projects also take this approach: Linux Kernel, Drizzle, etc.
 There's some legal advantages, as well as disadvantages, in having the
copyright rest with the original contributors.  Mostly, it prevents
relicensing of the whole project.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] erroneous restore into pg_catalog schema

On Tue, May 14, 2013 at 11:59 AM, Stephen Frost  wrote:
> * Dimitri Fontaine (dimi...@2ndquadrant.fr) wrote:
>> I'm not sure I agree with that view about pg_catalog. Sometimes we talk
>> about moving some parts of core in pre-installed extensions instead, and
>> if we do that we will want those extensions to install themselves into
>> pg_catalog.
>
> For my part, I'd still prefer to have those go into a different schema
> than into pg_catalog.  Perhaps that's overkill but I really do like the
> seperation of system tables from extensions which can be added and
> removed..

This was discussed previously. It's a bad idea. It's very tempting but
it doesn't scale. Then every user needs to know every schema for every
extension they might want to use.

It's exactly equivalent to the very common pattern of sysadmins
installing things into /usr/local/apache, /usr/local/kde,
/usr/local/gnome, /usr/local/pgsql, etc. Then every user needs a
mile-long PATH, LD_LIBRARY_PATH, JAVACLASSPATH, etc. And every user
has a slightly different ordering and slightly different subset of
directories in their paths resulting in different behaviours and
errors for each user. A correctly integrated package will use standard
locations and then users can simply refer to the standard locations
and find what's been installed. It would be ok to have a schema for
all extensions separately from the core, but it can't be a schema for
each extension or else we might as well not have the extension
mechanism at all. Users would still need to "install" the extension by
editing their config to refer to it.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Possible bug in cascaded standby

2013-06-05 Thread Pavan Deolasee

Hello,

I am experimenting with the cascade standby and hit a problem which is
reproducible with the current HEAD. I haven't tried other branches, but not
sure if the test setup I am trying even works for older releases because of
the timeline ID issue.

Anyways, I set up a cascaded standby such that it streams from the first
standby and then stopped the original master and promoted the first standby
to be the new master. If I then try to smart shutdown the cascaded standby,
it fails after waiting for the walreceiver to terminate. What's worse, the
walsender on the first standby gets into an infinite loop consuming 100%
CPU.

I tried to investigate this a bit, but haven't made progress worth
reporting. I can spend more time, but just wanted to make sure that I'm not
trying something which is a known issue or limitation. BTW, this is on my
Macbook Pro. Attached is the script that I used to set up the environment.
You will need to modify it for your setup though.

Thanks,
Pavan

-- 
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee


test_cascade_stdby.sh
Description: Bourne shell script

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_rewind, a tool for resynchronizing an old master after failover

2013-06-05 Thread Bruce Momjian

On Wed, Jun  5, 2013 at 10:12:17AM +0100, Greg Stark wrote:
> On Thu, May 23, 2013 at 1:03 PM, Simon Riggs  wrote:
> > The COPYRIGHT file shows that VMware is claiming copyright on unstated
> > parts of the code for this. As such, its not a normal submission to
> > the PostgreSQL project, which involves placing copyright with the
> > PGDG.
> 
> 
> Fwiw I was under the same misconception when I started at Google. But
> this is wrong.
> 
> We have no copyright assignments to any entity named PGDG. All the
> code is copyright the original authors. The PGDG is just a collective
> noun for all the the people and organizations who have contributed to
> Postgres. As long as all those people or organizations release the
> code under the Postgres license then Postgres is ok with it. They
> retain ownership of the copyright for the code they wrote but we don't
> generally note it at that level of detail and just say everything is
> owned by the PGDG.
> 
> I'm not a lawyer and I make no judgement on how solid a practice this
> is but that's VMware doesn't seem to be doing anything special here.
> They can retain copyright ownership of their contributions as long as
> they're happy releasing it under the Postgres copyright. Ideally they
> wold also be happy with a copyright notice that includes all of the
> PGDG just to reduce the maintenance headache.

Yes, completely true, and I was not clear on that myself either. 
Several people pointed out similar user copyrights in our existing code,
which I then realized were not a problem.  As long as the copyright
details are the same as our code, anyone can hold the copyright, I
think.

Part of my concern was patents.  Because VMWare asserts patents on
Postgres enhancements, when I saw VMWare copyright code, my "concern"
antenna went up and was glad to find it had all be handled by Heikki
already.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_rewind, a tool for resynchronizing an old master after failover

Bruce Momjian  writes:
> On Wed, Jun  5, 2013 at 10:12:17AM +0100, Greg Stark wrote:
>> I'm not a lawyer and I make no judgement on how solid a practice this
>> is but that's VMware doesn't seem to be doing anything special here.
>> They can retain copyright ownership of their contributions as long as
>> they're happy releasing it under the Postgres copyright. Ideally they
>> wold also be happy with a copyright notice that includes all of the
>> PGDG just to reduce the maintenance headache.

> Yes, completely true, and I was not clear on that myself either. 
> Several people pointed out similar user copyrights in our existing code,
> which I then realized were not a problem.  As long as the copyright
> details are the same as our code, anyone can hold the copyright, I
> think.

You're both being quite sloppy about the difference between "copyright"
and "license".  The point is correct though: what we care about is that
everybody releases their work under the same *license terms*.  As long
as that's the case, we don't care terribly much exactly who holds
copyright on which parts of the code.  (In this analysis, "PGDG" is
basically a shorthand for "everybody who's ever contributed anything".)

> Part of my concern was patents.  Because VMWare asserts patents on
> Postgres enhancements, when I saw VMWare copyright code, my "concern"
> antenna went up and was glad to find it had all be handled by Heikki
> already.

Yes, patents are a different and much nastier can of worms.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] MVCC catalog access

Andres Freund  writes:
> On 2013-06-05 15:28:09 +0100, Greg Stark wrote:
>> I thought there were many call sites that were specifically depending
>> on seeing dirty reads to avoid race conditions with other backends --
>> which probably just narrowed the race condition or created different
>> ones.

> But SnapshotNow doesn't allow you to do actual dirty reads?

Yeah.  I believe the issue is that we can't simply do MVCC catalog reads
with a snapshot taken at transaction start time or statement start time,
as we would do if executing an MVCC scan for a user query.  Rather, the
snapshot has to be recent enough to ensure we see the current definition
of any table we've just acquired lock on, *even if that's newer than the
snapshot prevailing for the user's purposes*.  Otherwise we might be
using the wrong rowtype definition or failing to enforce a just-added
constraint.

The last time we talked about this, we were batting around ideas of
keeping a "current snapshot for catalog purposes", which we'd update
or at least invalidate anytime we acquired a new lock.  (In principle,
if that isn't new enough, we have a race condition that we'd better fix
by adding some more locking.)  Robert's results seem to say that that
might be unnecessary optimization, and that it'd be sufficient to just
take a new snap each time we need to do a catalog scan.  TBH I'm not
sure I believe that; it seems to me that this approach is surely going
to create a great deal more contention from concurrent GetSnapshotData
calls.  But at the very least, this says we can experiment with the
behavioral aspects without bothering to build infrastructure for
tracking an appropriate catalog snapshot.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] EXPLAIN (ANALYZE) broken

Kevin Grittner  writes:
> Commit 2c92edad48796119c83d7dbe6c33425d1924626d has broken
> EXPLAIN's ANALYZE option in parentheses, which also makes some
> other options unusable.

> test=# EXPLAIN (ANALYZE) SELECT 1;
> ERROR:  syntax error at or near "ANALYZE"
> LINE 1: EXPLAIN (ANALYZE) SELECT 1;
>    ^

[ scratches head... ]  Coulda sworn I checked that.  Will look into it.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] EXPLAIN (ANALYZE) broken

2013-06-05 Thread Kevin Grittner

Commit 2c92edad48796119c83d7dbe6c33425d1924626d has broken
EXPLAIN's ANALYZE option in parentheses, which also makes some
other options unusable.

test=# EXPLAIN (ANALYZE) SELECT 1;
ERROR:  syntax error at or near "ANALYZE"
LINE 1: EXPLAIN (ANALYZE) SELECT 1;   ^

--
Kevin Grittner
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] MVCC catalog access

2013-06-05 Thread Andres Freund

On 2013-06-05 15:28:09 +0100, Greg Stark wrote:
> On Wed, May 22, 2013 at 3:18 AM, Robert Haas  wrote:
> > We've had a number of discussions about the evils of SnapshotNow.  As
> > far as I can tell, nobody likes it and everybody wants it gone, but
> > there is concern about the performance impact.
 
> I thought there were many call sites that were specifically depending
> on seeing dirty reads to avoid race conditions with other backends --
> which probably just narrowed the race condition or created different
> ones.

But SnapshotNow doesn't allow you to do actual dirty reads? It only
gives you rows back that were actually visible when we checked. The
difference to SnapshotMVCC is that during a scan the picture of which
transactions are committed can change.

> I'm not even sure what "clean them up" means. You can replace checks
> with things like constraints and locks but the implementation of
> constraints and locks will still need to use SnapshotNow surely?

The places that require this should already use HeapTupleSatisfiesDirty
which is something different.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] JSON and unicode surrogate pairs

2013-06-05 Thread Andrew Dunstan



In 9.2, the JSON parser didn't check the validity of the use of unicode 
escapes other than that it required 4 hex digits to follow '\u'. In 9.3, 
that is still the case. However, the JSON accessor functions and 
operators also try to turn JSON strings into text in the server 
encoding, and this includes de-escaping \u sequences. This works fine 
except when there is a pair of sequences representing a UTF-16 type 
surrogate pair, something that is explicitly permitted in the JSON spec.


The attached patch is an attempt to remedy that, and a surrogate pair is 
turned into the correct code point before converting it to whatever the 
server encoding is.


Note that this would mean we can still put JSON with incorrect use of 
surrogates into the database, as now (9.2 and later), and they will 
cause almost all the accessor functions to raise an error, as now (9.3). 
All this does is allow JSON that uses surrogates correctly not to fail 
when applying the accessor functions and operators. That's a possible 
violation of POLA, and at least worth of a note in the docs, but I'm not 
sure what else we can do now - adding this check to the input lexer 
would possibly cause restores to fail, which users might not thank us for.


cheers

andrew
diff --git a/src/backend/utils/adt/json.c b/src/backend/utils/adt/json.c
index aaf99bd..28868fb 100644
--- a/src/backend/utils/adt/json.c
+++ b/src/backend/utils/adt/json.c
@@ -646,6 +646,7 @@ json_lex_string(JsonLexContext *lex)
 {
 	char	   *s;
 	int			len;
+	int hi_surrogate = -1;
 
 	if (lex->strval != NULL)
 		resetStringInfo(lex->strval);
@@ -718,6 +719,36 @@ json_lex_string(JsonLexContext *lex)
 	int			utf8len;
 	char	   *converted;
 
+	if (ch >= 0xd800 && ch <= 0xdbff)
+	{
+		if (hi_surrogate != -1)
+			ereport(ERROR,
+	(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+	 errmsg("invalid input syntax for type json"),
+	 errdetail("high order surrogate must not follow a high order surrogate."),
+	 report_json_context(lex)));
+		hi_surrogate = (ch & 0x3ff) << 10;
+		continue;
+	}
+	else if (ch >= 0xdc00 && ch <= 0xdfff)
+	{
+		if (hi_surrogate == -1)
+			ereport(ERROR,
+	(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+	 errmsg("invalid input syntax for type json"),
+	 errdetail("low order surrogate must follow a high order surrogate."),
+	 report_json_context(lex)));
+		ch = 0x1 + hi_surrogate + (ch & 0x3ff);
+		hi_surrogate = -1;
+	}
+
+	if (hi_surrogate != -1)
+		ereport(ERROR,
+(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+ errmsg("invalid input syntax for type json"),
+ errdetail("low order surrogate must follow a high order surrogate."),
+ report_json_context(lex)));
+
 	unicode_to_utf8(ch, (unsigned char *) utf8str);
 	utf8len = pg_utf_mblen((unsigned char *) utf8str);
 	utf8str[utf8len] = '\0';
@@ -730,6 +761,13 @@ json_lex_string(JsonLexContext *lex)
 			}
 			else if (lex->strval != NULL)
 			{
+if (hi_surrogate != -1)
+	ereport(ERROR,
+			(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+			 errmsg("invalid input syntax for type json"),
+			 errdetail("low order surrogate must follow a high order surrogate."),
+			 report_json_context(lex)));
+
 switch (*s)
 {
 	case '"':
@@ -784,11 +822,25 @@ json_lex_string(JsonLexContext *lex)
 		}
 		else if (lex->strval != NULL)
 		{
+			if (hi_surrogate != -1)
+ereport(ERROR,
+		(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+		 errmsg("invalid input syntax for type json"),
+		 errdetail("low order surrogate must follow a high order surrogate."),
+		 report_json_context(lex)));
+			
 			appendStringInfoChar(lex->strval, *s);
 		}
 
 	}
 
+	if (hi_surrogate != -1)
+		ereport(ERROR,
+(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+ errmsg("invalid input syntax for type json"),
+ errdetail("low order surrogate must follow a high order surrogate."),
+ report_json_context(lex)));
+
 	/* Hooray, we found the end of the string! */
 	lex->prev_token_terminator = lex->token_terminator;
 	lex->token_terminator = s + 1;

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] MVCC catalog access

2013-06-05 Thread Hannu Krosing

On 06/05/2013 04:28 PM, Greg Stark wrote:
> On Wed, May 22, 2013 at 3:18 AM, Robert Haas  wrote:
>> We've had a number of discussions about the evils of SnapshotNow.  As
>> far as I can tell, nobody likes it and everybody wants it gone, but
>> there is concern about the performance impact.
> I was always under the impression that the problem was we weren't
> quite sure what changes would be needed to make mvcc-snapshots work
> for the catalog lookups. The semantics of SnapshotNow aren't terribly
> clear either but we have years of experience telling us they seem to
> basically work. Most of the problems we've run into we either have
> worked around in the catalog accesses. Nobody really knows how many of
> the call sites will need different logic to behave properly with mvcc
> snapshots.
>
> I thought there were many call sites that were specifically depending
> on seeing dirty reads to avoid race conditions with other backends --
> which probably just narrowed the race condition or created different
> ones. If you clean those all up it will be probably be cleaner and
> better but we don't know how many such sites will need to be modified.
> I'm not even sure what "clean them up" means. You can replace checks
> with things like constraints and locks but the implementation of
> constraints and locks will still need to use SnapshotNow surely?
I guess that anything that does *not* write should be happier with
MVCC catalogue, especially if there has been any DDL after its snapshot.

For writers the ability to compare MVCC and SnapshotNow snapshots
would tell if they need to take extra steps.

But undoubtedly the whole thing would be lot of work.

-- 
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic OÜ



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Vacuum, Freeze and Analyze: the big picture

On Thu, May 30, 2013 at 7:48 PM, Josh Berkus  wrote:
> The big, big picture is this:
>
>90% of our users need to think about VACUUM/ANALYZE
>at least 10% of the time
>and 10% of our users need to think about it
>almost 90% of the time.
>
> That's considerably better than was the case 5 years ago, when vacuum
> management was a daily or weekly responsibility for nearly 100% of our
> users,

Fwiw I think this is not the right picture. I think the current
situation an accurate description of the way things are and have
always been.

It's an arms race. We've raised the bar of how large and busy your
database has to be before vacuum becomes a pain and users scale their
databases up. As long as we stay one step ahead of the users 90% of
users won't have to think about vacuum/analyze much. There will always
be outliers.

When the visibility map went in the argument was that wraparound was
so rare that it wasn't worth doubling the size of the visibility map
to have a second bit. If the table gets even a low amount of traffic
nearly all blocks will need to be frozen anyways by that time. To do
something like the visibility map for freezing we would need something
like a map that stores the high 8 bits of the oldest unfrozen xid in
the block. That be a lot more complex and take a lot more space.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] MVCC catalog access

On Wed, May 22, 2013 at 3:18 AM, Robert Haas  wrote:
> We've had a number of discussions about the evils of SnapshotNow.  As
> far as I can tell, nobody likes it and everybody wants it gone, but
> there is concern about the performance impact.

I was always under the impression that the problem was we weren't
quite sure what changes would be needed to make mvcc-snapshots work
for the catalog lookups. The semantics of SnapshotNow aren't terribly
clear either but we have years of experience telling us they seem to
basically work. Most of the problems we've run into we either have
worked around in the catalog accesses. Nobody really knows how many of
the call sites will need different logic to behave properly with mvcc
snapshots.

I thought there were many call sites that were specifically depending
on seeing dirty reads to avoid race conditions with other backends --
which probably just narrowed the race condition or created different
ones. If you clean those all up it will be probably be cleaner and
better but we don't know how many such sites will need to be modified.
I'm not even sure what "clean them up" means. You can replace checks
with things like constraints and locks but the implementation of
constraints and locks will still need to use SnapshotNow surely?

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Time for beta2 ?

2013-06-05 Thread Andres Freund

On 2013-06-05 09:50:07 -0400, Tom Lane wrote:
> Heikki Linnakangas  writes:
> > We've almost cleared the open items list, and I'm not aware of any other 
> > unfixed issues in 9.3beta1. Could we make a beta2 release soon? There 
> > have been a bunch of recovery-related fixes since beta1, it would be 
> > nice to get those fixes in the hands of testers.
> 
> There are at least two initdb-forcing things on the table, which I think
> we need to resolve before we push beta2:

> Has anyone got anything else of that ilk?

While obviously not really dependent on beta2 I'd like to see
20130402182644.gj2...@alap2.anarazel.de applied to HEAD in time for beta
2 so it can get some coverage before the next set of back branch releases.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Time for beta2 ?

Heikki Linnakangas  writes:
> We've almost cleared the open items list, and I'm not aware of any other 
> unfixed issues in 9.3beta1. Could we make a beta2 release soon? There 
> have been a bunch of recovery-related fixes since beta1, it would be 
> nice to get those fixes in the hands of testers.

There are at least two initdb-forcing things on the table, which I think
we need to resolve before we push beta2:

1. The business around being able to tell whether views and foreign
tables are updatable; all of the proposals for this will require
adding/changing some pg_proc entries and probably changing system view
definitions.

2. Simon was suggesting some changes in the functions for reporting
whether a backup is in progress.

Has anyone got anything else of that ilk?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Redesigning checkpoint_segments

2013-06-05 Thread Heikki Linnakangas

checkpoint_segments is awkward. From an admin's point of view, it 
controls two things:


1. it limits the amount of disk space needed for pg_xlog. (it's a soft 
limit, but still)

2. it limits the time required to recover after a crash.

For limiting the disk space needed for pg_xlog, checkpoint_segments is 
awkward because it's defined in terms of 16MB segments between 
checkpoints. It takes a fair amount of arithmetic to calculate the disk 
space required to hold the specified number of segments. The manual 
gives the formula: (2 + checkpoint_completion_target) * 
checkpoint_segments + 1, which amounts to about 1GB per 20 segments as a 
rule of thumb. We shouldn't impose that calculation on the user. It 
should be possible to just specify "checkpoint_segments=512MB", and the 
system would initiate checkpoints so that the total size of WAL in 
pg_xlog stays below 512MB.


For limiting the time required to recover after crash, 
checkpoint_segments is awkward because it's difficult to calculate how 
long recovery will take, given checkpoint_segments=X. A bulk load can 
use up segments really fast, and recovery will be fast, while segments 
full of random deletions can need a lot of random I/O to replay, and 
take a long time. IMO checkpoint_timeout is a much better way to control 
that, although it's not perfect either.


A third point is that even if you have 10 GB of disk space reserved for 
WAL, you don't want to actually consume all that 10 GB, if it's not 
required to run the database smoothly. There are several reasons for 
that: backups based on a filesystem-level snapshot are larger than 
necessary, if there are a lot of preallocated WAL segments and in a 
virtualized or shared system, there might be other VMs or applications 
that could make use of the disk space. On the other hand, you don't want 
to run out of disk space while writing WAL - that can lead to a PANIC in 
the worst case.



In VMware's vPostgres fork, we've hacked the way that works, so that 
there is a new setting, checkpoint_segments_max that can be set by the 
user, but checkpoint_segments is adjusted automatically, on the fly. The 
system counts how many segments were consumed during the last checkpoint 
cycle, and that becomes the checkpoint_segments setting for the next 
cycle. That means that in a system with a steady load, checkpoints are 
triggered by checkpoint_timeout, and the effective checkpoint_segments 
value converges at the exact number of segments needed for that. That's 
simple but very effective. It doesn't behave too well with bursty load, 
however; during quiet times, checkpoint_segments is dialed way down, and 
when the next burst comes along, you get several checkpoints in quick 
succession, until checkpoint_segments is dialed back up again.



I propose that we do something similar, but not exactly the same. Let's 
have a setting, max_wal_size, to control the max. disk space reserved 
for WAL. Once that's reached (or you get close enough, so that there are 
still some segments left to consume while the checkpoint runs), a 
checkpoint is triggered.


In this proposal, the number of segments preallocated is controlled 
separately from max_wal_size, so that you can set max_wal_size high, 
without actually consuming that much space in normal operation. It's 
just a backstop, to avoid completely filling the disk, if there's a 
sudden burst of activity. The number of segments preallocated is 
auto-tuned, based on the number of segments used in previous checkpoint 
cycles.


I'll write up a patch to do that, but before I do, does anyone disagree 
on those tuning principles? How do you typically tune 
checkpoint_segments on your servers? If the system was to tune it 
automatically, what formula should it use?


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal for Allow postgresql.conf values to be changed via SQL [review]

2013-06-05 Thread Amit Kapila

On Monday, May 27, 2013 4:17 PM Amit Kapila wrote:
> On Wednesday, April 03, 2013 11:55 AM Amit Kapila wote:
> > On Tuesday, April 02, 2013 9:49 PM Peter Eisentraut wrote:
> 

There are 2 options to proceed for this patch for 9.4

1. Upload the SET PERSISTENT syntax patch for coming CF by fixing existing
review comments
2. Implement new syntax ALTER SYSTEM as proposed in below mail

Could you suggest me what could be best way to proceed for this patch?

> This feature was discussed for 9.3, but couldn't get committed.
> History for this patch is that in the last, Peter Eisentraut has given
> quite a few review comments, most of which I have closed in the patch
> attached.
> 
> Apart from that Robert and Tom intended to use ALTER SYSTEM or variant
> of ALTER rather than SET. Syntax can be as below:
> 
> ALTER SYSTEM SET configuration_parameter {TO | =} {value, | 'value'};
> 
> Some Comments of Peter E which needs to be discussed before changing
> are as
> below:
> 
> 1) Name of directory should be postgresql.conf.d rather than config.
> 2) Name of the file persistent.auto.conf should not have auto word.
> 
> I have not modified the code, as these had been previously discussed in
> this mail chain and as a conclusion of that, I had kept the current
> names for file and directory.
> 
> > > I'm going to ignore most of the discussion that led up to this and
> > give
> > > this patch a fresh look.
> >
> > Thank you.
> >
> > > + 
> > > + SET PERSISTENT max_connections To 10; 
> > >
> > > The To should probably be capitalized.
> >
> > Okay, shall fix it.
> 
> Changed to
> SET PERSISTENT checkpoint_timeout TO 600
> 
> > > I doubt this example actually works because changing
> max_connections
> > > requires a restart.  Try to find a better example.
> >
> > Any parameter's changed with this command can be effective either
> > after restart or SIGHUP.
> > So I will change it to SIGHUP parameter.
> >
> > > It's weird that SET LOCAL and SET SESSION actually *set* the value,
> > and
> > > the second key word determines how long the setting will last.  SET
> > > PERSISTENT doesn't actually set the value.  I predict that this
> will
> > be
> > > a new favorite help-it-doesn't-work FAQ.
> > >
> > >   
> > >SCHEMA
> > >
> > > !   SET [ PERSISTENT ] SCHEMA
> > > 'value' is an alias for
> > >  SET search_path TO value.
> Only
> > > one
> > >  schema can be specified using this syntax.
> > > 
> > >
> > > I don't think [ PERSISTENT ] needs to be added in these and similar
> > > snippets.  We don't mention LOCAL etc. here either.
> >
> > Agreed, shall fix this.
> 
> Removed from all similar places.
> 
> > > --- 34,41 
> > >   postgresql.conf,
> > > pg_hba.conf, and
> > >   pg_ident.conf are traditionally stored in
> > >   PGDATA (although in
> > PostgreSQL
> > > 8.0 and
> > > ! later, it is possible to place them elsewhere). By default the
> > > directory config is stored ! in PGDATA, however it
> > > should be kept along with postgresql.conf.
> > >   
> > >
> > >   
> > >
> > > This chunk doesn't make any sense to me.  "Should" is always
> tricky.
> > > Why should I, and why might I not?
> >
> > I mean to say here, that user needs to move config directory and its
> > contents along with postgresql.conf.
> > Shall we change as below:
> > By default config directory is stored in PGDATA, however it needs to
> > be kept along with postgresql.conf
> 
> Changed as proposed above.
> 
> >
> > >   
> > > +  config
> > > +  Subdirectory containing automatically generated
> > configuration
> > > files
> > > + 
> > > +
> > > + 
> > >base
> > >Subdirectory containing per-database
> subdirectories
> > >   
> > >
> > > Only automatically generated ones?
> >
> > No, any other files can also be present.
> > How about change it as :
> > Subdirectory containing generated configuration files.
> > Any other suggestions?
> 
> Changed as above.
> 
> > This new directory's will be used to place generated files,
> >
> > > COPY_STRING_FIELD(name);
> > > COPY_NODE_FIELD(args);
> > > COPY_SCALAR_FIELD(is_local);
> > > +   COPY_SCALAR_FIELD(is_persistent);
> > >
> > > return newnode;
> > >   }
> > >
> > > I suggest changing is_local into a new trivalued field that stores
> > > LOCAL or SESSION or PERSISTENT.
> > >
> > > n->is_local = false;
> > > $$ = (Node *) n;
> > > }
> >
> > Okay, I will change it.
> >
> > > +   | SET PERSISTENT set_persistent
> > > +   {
> > > +   VariableSetStmt *n = $3;
> > > +   n->is_persistent = true;
> > > +   $$ = (Node *) n;
> > > +   }
> > > ;
> > >
> > >   set_rest:
> > >
> > > Why can't you use SET PERSISTENT set_rest?
> >
>

Re: [HACKERS] Optimising Foreign Key checks

2013-06-05 Thread Hannu Krosing

On 06/05/2013 11:37 AM, Greg Stark wrote:
> On Sat, Jun 1, 2013 at 9:41 AM, Simon Riggs  wrote:
>> COMMIT;
>> The inserts into order_line repeatedly execute checks against the same
>> ordid. Deferring and then de-duplicating the checks would optimise the
>> transaction.
>>
>> Proposal: De-duplicate multiple checks against same value. This would
>> be implemented by keeping a hash of rows that we had already either
>> inserted and/or locked as the transaction progresses, so we can use
>> the hash to avoid queuing up after triggers.
>
> Fwiw the reason we don't do that now is that the rows might be later
> deleted within the same transaction (or even the same statement I
> think). If they are then the trigger needs to be skipped for that row
> but still needs to happen for other rows. So you need to do some kind
> of book-keeping to keep track of that. The easiest way was just to do
> the check independently for each row. I think there's a comment about
> this in the code.
A simple counter on each value should also solve this.
Increment for each row, decrement for each deletion,
then run the tests on values where counter is > 0
> I think you're right that this should be optimized because in the vast
> majority of cases you don't end up deleting rows and we're currently
> doing lots of redundant checks. But you need to make sure you don't
> break the unusual case entirely.
>


-- 
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic OÜ



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Optimising Foreign Key checks

On Sat, Jun 1, 2013 at 9:41 AM, Simon Riggs  wrote:
> COMMIT;
> The inserts into order_line repeatedly execute checks against the same
> ordid. Deferring and then de-duplicating the checks would optimise the
> transaction.
>
> Proposal: De-duplicate multiple checks against same value. This would
> be implemented by keeping a hash of rows that we had already either
> inserted and/or locked as the transaction progresses, so we can use
> the hash to avoid queuing up after triggers.

Fwiw the reason we don't do that now is that the rows might be later
deleted within the same transaction (or even the same statement I
think). If they are then the trigger needs to be skipped for that row
but still needs to happen for other rows. So you need to do some kind
of book-keeping to keep track of that. The easiest way was just to do
the check independently for each row. I think there's a comment about
this in the code.

I think you're right that this should be optimized because in the vast
majority of cases you don't end up deleting rows and we're currently
doing lots of redundant checks. But you need to make sure you don't
break the unusual case entirely.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_rewind, a tool for resynchronizing an old master after failover