Re: [HACKERS] Synchronous replication, reading WAL for sending

2008-12-24 Thread Fujii Masao
Hi,

On Wed, Dec 24, 2008 at 2:34 PM, Pavan Deolasee
pavan.deola...@gmail.com wrote:
 On Tue, Dec 23, 2008 at 9:12 PM, Heikki Linnakangas
 heikki.linnakan...@enterprisedb.com wrote:
 As the patch stands, whenever XLOG segment is switched in XLogInsert, we
 wait for the segment to be sent to the standby server. That's not good.
 Particularly in asynchronous mode, you'd expect the standby to not have any
 significant ill effect on the master. But in case of a flaky network
 connection, or a busy or dead standby, it can take a long time for the
 standby to respond, or the primary to give up. During that time, all WAL
 insertions on the primary are blocked. (How long is the default TCP timeout
 again?)

 Another point is that in the future, we really shouldn't require setting up
 archiving and file-based log shipping using external scripts, when all you
 want is replication. It should be enough to restore a base backup on the
 standby, and point it to the IP address of the primary, and have it catch
 up. This is very important, IMHO. It's quite a lot of work to set up
 archiving and log-file shipping, for no obvious reason. It's really only
 needed at the moment because we're building this feature from spare parts.


 I had similar suggestions when I first wrote the high level design doc.
 From the wiki page:

 - WALSender reads from WAL buffers and/or WAL files and sends the
 buffers to WALReceiver. In phase one, we may assume that WALSender can
 only read from WAL buffers and WAL files in pg_xlog directory. Later
 on, this can be improved so that WALSender can temporarily restore
 archived files and read from that too.

You mean that only walsender performs xlog streaming and copying
from pg_xlog serially? I think that this would degrade the performance.
And, I'm worried about the situation that the speed to generate xlog
on the primary is higher than that to copy them to the standby. We
might not be able to start xlog streaming forever.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronous replication, reading WAL for sending

2008-12-24 Thread Pavan Deolasee
On Wed, Dec 24, 2008 at 1:50 PM, Fujii Masao masao.fu...@gmail.com wrote:


 And, I'm worried about the situation that the speed to generate xlog
 on the primary is higher than that to copy them to the standby. We
 might not be able to start xlog streaming forever.


If that's the case, how do you expect the standby to keep pace with
the primary after initial sync up ? Frankly, I myself have every doubt
that on a relatively high load setup, the standby will not be able
keep pace with the primary for two reasons:

- Lack of read ahead of data blocks (Suzuki-san's work may help this)
- Single threaded recovery

But then these are general problems which may impact any log-based replication.

Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronous replication, reading WAL for sending

2008-12-24 Thread Fujii Masao
Hi,

On Wed, Dec 24, 2008 at 5:48 PM, Pavan Deolasee
pavan.deola...@gmail.com wrote:
 On Wed, Dec 24, 2008 at 1:50 PM, Fujii Masao masao.fu...@gmail.com wrote:


 And, I'm worried about the situation that the speed to generate xlog
 on the primary is higher than that to copy them to the standby. We
 might not be able to start xlog streaming forever.


 If that's the case, how do you expect the standby to keep pace with
 the primary after initial sync up ?

Good question. If streaming and copying are performed parallelly,
such situation doesn't happen because the speed to generate xlog
also depends on streaming. This is a price to pay. I think that the
serial operations would need a pace maker. And, I don't know
better pace maker than concurrent streaming.

 Frankly, I myself have every doubt
 that on a relatively high load setup, the standby will not be able
 keep pace with the primary for two reasons:

 - Lack of read ahead of data blocks (Suzuki-san's work may help this)
 - Single threaded recovery

 But then these are general problems which may impact any log-based 
 replication.

Right. Completely high load setup is probably impossible. There is
certainly a price to pay. But, in order to reduce a price as much as
possible, I think that we should not focus two or more operations
on single process (walsender) just like single threaded recovery.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-24 Thread Simon Riggs

On Wed, 2008-12-24 at 11:39 +0900, Fujii Masao wrote:

  We might ask why pg_start_backup() needs to perform checkpoint though,
  since you have remarked that is a problem also.
 
  The answer is that it doesn't really need to, we just need to be certain
  that archiving has been running since whenever we choose as the start
  time. So we could easily just use the last normal checkpoint time, as
  long as we had some way of tracking the archiving.
 
  ISTM we can solve the checkpoint problem more easily and it would
  potentially save much more time than tuning rsync for Postgres, which
  is what the other idea amounted to. So I do see a solution that is both
  better and more quickly achievable for 8.4.
 
 Sounds good. I agree that pg_start_backup basically doesn't need
 checkpoint. But, for full_page_write == off, we probably cannot get
 rid of it. Even if full_page_write == on, since we cannot make out
 whether all indispensable full pages were written after last checkpoint,
 pg_start_backup must do checkpoint with forcePageWrite = on.

Yes, OK. So I think it would only work when full_page_writes = on, and
has been on since last checkpoint. So two changes:

* We just need a boolean that starts at true every checkpoint and gets
set to false anytime someone resets full_page_writes or archive_command.
If the flag is set  full_page_writes = on then we skip the checkpoint
entirely and use the value from the last checkpoint.

* My infra patch also had a modified version of pg_start_backup() that
allowed you to specify IMMEDIATE checkpoint or not. Reworking that seems
a waste of time, and I want to listen to everybody else now and change
pg_start_backup() so it throws an IMMEDIATE CHECKPOINT and leave it
there.

Can you work on those also?

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Archiving control (a part of synch rep patches)

2008-12-24 Thread Fujii Masao
Hi,

On Wed, Dec 24, 2008 at 4:39 PM, Fujii Masao masao.fu...@gmail.com wrote:
 (3)
 Revert the arhive status file of the specified file from .done to .ready.
 This would try to archive the corresponding xlog file again. The archive
 status of only skipped xlog file can be reverted.

 We resume xlog archiving by using (2) and (3), when replication
 ends.

Umm.. reverting is very tricky, then it might become hotbeds of bugs.
So, I will reconsider the method of resuming archive.

http://archives.postgresql.org/pgsql-hackers/2008-12/msg00723.php
 That conceivably might mean we need
 to write a .ready message after a WAL file filled, which might mean we
 have problems if the replication timeout is longer than the checkpoint
 timeout, but that seems an unlikely configuration. And if anybody has a
 problem with that we just recommend they use the always mode.

It's also problem that reverting competes with xlog deletion by
checkpoint.

Perhaps we should create .ready after the xlog file fills and it's
replicated during replication. I will consider this course deeply.

Happy Holidays!

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronous replication, reading WAL for sending

2008-12-24 Thread Simon Riggs

On Wed, 2008-12-24 at 18:31 +0900, Fujii Masao wrote:

  Frankly, I myself have every doubt
  that on a relatively high load setup, the standby will not be able
  keep pace with the primary for two reasons:
 
  - Lack of read ahead of data blocks (Suzuki-san's work may help this)
  - Single threaded recovery
 
  But then these are general problems which may impact any log-based 
  replication.
 
 Right. Completely high load setup is probably impossible. There is
 certainly a price to pay. But, in order to reduce a price as much as
 possible, I think that we should not focus two or more operations
 on single process (walsender) just like single threaded recovery.

I think we may be pleasantly surprised. 

In 8.3 there were two main sources of wait:

* restartpoints
* waiting for archive files

Restartpoints will now be handled by bgwriter, giving probably 20% gain,
plus the WAL data is streamed directly into memory by walreceiver. So I
think the startup process may achieve a better steady state and perform
very quickly.

Suzuki-san's numbers show that full_page_writes = on does not benefit
significantly from having read ahead and we already know that is
effective in reducing the I/O bottleneck during recovery.

If we want to speed up recovery more, I think we'll see the need for an
additional process to do WAL CRC checks. 

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronous replication, reading WAL for sending

2008-12-24 Thread Pavan Deolasee
On Wed, Dec 24, 2008 at 3:01 PM, Fujii Masao masao.fu...@gmail.com wrote:


 Good question. If streaming and copying are performed parallelly,
 such situation doesn't happen because the speed to generate xlog
 also depends on streaming. This is a price to pay. I think that the
 serial operations would need a pace maker. And, I don't know
 better pace maker than concurrent streaming.


These operations need not be even parallel. My apologies if this has
been discussed before, but what we are talking about is just a stream
of WAL starting at some LSN. The only difference is that the LSN
itself may be in buffers or in the files. So walsender would send as
much as it can from the files and then switch to read from buffers.

Also, I think you are underestimating the power of network for most
practical purposes. Networks are usually not bottlenecks unless we are
talking about slow WAN setups which I am not sure how common for PG
users.
.

 Right. Completely high load setup is probably impossible.

If that's the case, I don't think you need to worry too much about
network or the walsender being a bottleneck for initial sync up (and
note that we are only talking about WAL sync up and not the base
backup).

Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronous replication, reading WAL for sending

2008-12-24 Thread Pavan Deolasee
On Wed, Dec 24, 2008 at 3:40 PM, Simon Riggs si...@2ndquadrant.com wrote:


 If we want to speed up recovery more, I think we'll see the need for an
 additional process to do WAL CRC checks.


Yeah, any such helper process along with other optimizations would
certainly help. But I can't still believe that on a high load, high
end setup, single recovery process without any read-ahead for data
blocks, can keep pace with the WAL generated by hundreds of processes
at the primary and shipped over a high speed link to standby.

BTW, on a completely different note, given that the entire recovery is
based on physical redo, are there any inherent limitations that we
can't do parallel recovery where different recovery processes apply
redo logs to completely independent set of data blocks ? I also
sometimes wonder why we don't have block level recovery when a single
block in the database is corrupted. Can't this be done by just
selectively applying WAL records to that particular block ? If it's
just because nobody had time/interest to do this, then it's OK, but I
wonder if there are any design issues.

Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Archiving control (a part of synch rep patches)

2008-12-24 Thread Simon Riggs

On Wed, 2008-12-24 at 19:04 +0900, Fujii Masao wrote:
 Hi,
 
 On Wed, Dec 24, 2008 at 4:39 PM, Fujii Masao masao.fu...@gmail.com wrote:
  (3)
  Revert the arhive status file of the specified file from .done to .ready.
  This would try to archive the corresponding xlog file again. The archive
  status of only skipped xlog file can be reverted.
 
  We resume xlog archiving by using (2) and (3), when replication
  ends.
 
 Umm.. reverting is very tricky, then it might become hotbeds of bugs.
 So, I will reconsider the method of resuming archive.

Yeh, don't do that.

No worries if we send a few files too many. Tears if we miss just one.

* Turn it off for the *next* file to be archived. Keep track of last
file sent.

* Turn it on again, mark *all* WAL files as .ready in sequence low to
high, as long as they are higher than the last file sent.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] generic reloptions improvement

2008-12-24 Thread Peter Eisentraut
On Monday 22 December 2008 18:24:53 Alvaro Herrera wrote:
  Alvaro Herrera alvhe...@commandprompt.com wrote:
   Here's a patch for improving the general reloptions mechanism.  What
   this patch does is add a table-based option parser.  This allows adding
   new options very easily, and stops the business of having to pass the
   minimum and default fillfactor each time you want the reloptions
   processed.

 Here's the next version, which also fixes some particularly embarrasing
 bugs.

I'm not sure how important this is, but if you are enumerating the access 
methods (RELOPT_KIND_BTREE, etc.), how will this work with user-defined 
access methods?

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Synchronous replication, reading WAL for sending

2008-12-24 Thread Simon Riggs

On Wed, 2008-12-24 at 15:51 +0530, Pavan Deolasee wrote:
 On Wed, Dec 24, 2008 at 3:40 PM, Simon Riggs si...@2ndquadrant.com wrote:
 
 
  If we want to speed up recovery more, I think we'll see the need for an
  additional process to do WAL CRC checks.
 
 
 Yeah, any such helper process along with other optimizations would
 certainly help. But I can't still believe that on a high load, high
 end setup, single recovery process without any read-ahead for data
 blocks, can keep pace with the WAL generated by hundreds of processes
 at the primary and shipped over a high speed link to standby.

Suzuki-san has provided measurements. I think we need more. With
bgwriter performing restartpoints, we'll find that more RAM helps much
more than it did previously.

 BTW, on a completely different note, given that the entire recovery is
 based on physical redo, are there any inherent limitations that we
 can't do parallel recovery where different recovery processes apply
 redo logs to completely independent set of data blocks ? 

That's possible, but will significantly complicate the recovery code.
Retaining the ability to do standby queries would be almost impossible
in that case, since you would need to parallelise the WAL stream without
changing the commit order of transactions.

The main CPU bottleneck is CRC, by a long way. If we move effort away
from startup process that is the best next action, AFAICS.

 I also
 sometimes wonder why we don't have block level recovery when a single
 block in the database is corrupted. Can't this be done by just
 selectively applying WAL records to that particular block ? If it's
 just because nobody had time/interest to do this, then it's OK, but I
 wonder if there are any design issues.

You'll be able to do this with my rmgr patch. Selective application of
WAL records is one of the primary use cases, but there are many others.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] generic reloptions improvement

2008-12-24 Thread Alvaro Herrera
Peter Eisentraut wrote:

 I'm not sure how important this is, but if you are enumerating the access 
 methods (RELOPT_KIND_BTREE, etc.), how will this work with user-defined 
 access methods?

It is important.

I'm intending to have a new routine which would reserve a value at
runtime.  This value would be later be passed by the AM to create new
options on the table.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot standby and b-tree killed items

2008-12-24 Thread Simon Riggs

On Tue, 2008-12-23 at 23:59 -0500, Robert Treat wrote:
 On Friday 19 December 2008 19:36:42 Simon Riggs wrote:
  Perhaps we should listen to the people that have said they don't want
  queries cancelled, even if the alternative is inconsistent answers. That
  is easily possible yet is not currently an option. Plus we have the
  option I referred to up thread, which is to defer query cancel until the
  query reads a modified data block. I'm OK with implementing either of
  those, as non-default options. Do we need those options or are we ok?
 
 
 Haven't seen any feed back on this, but I think the two options of cancel 
 query for replay, and pause replay for queries, are probably enough for a 
 first go around (especially if you can get the query canceling to work only 
 when changes are made to the specific database in question)

Thanks for picking up on this. This question is the #1 usability issue
for Hot Standby, since at least May 2008. There are many potential
additions and we need to track this carefully over the next few months
to see if we have it just right. I'll take viewpoints at any time on
that; this door is never closed, though tempus fugit.

Greg and Heikki have highlighted in this thread some aspects of btree
garbage collection that will increase the chance of queries being
cancelled in various circumstances. If this is important enough to
trigger additional actions then we need to highlight that now so we have
time to take those corrective actions. 

I've listened to many different viewpoints on and off list. Everybody
takes a slightly different angle on it and I'm in favour of giving
everybody what they want with the right set of options.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot standby and b-tree killed items

2008-12-24 Thread Pavan Deolasee
On Wed, Dec 24, 2008 at 4:41 PM, Simon Riggs si...@2ndquadrant.com wrote:


 Greg and Heikki have highlighted in this thread some aspects of btree
 garbage collection that will increase the chance of queries being
 cancelled in various circumstances

Even HOT-prune may lead to frequent query cancellations and unlike
VACUUM there is no way user can control the frequency of prune
operations.

Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Archiving control (a part of synch rep patches)

2008-12-24 Thread Fujii Masao
Hi,

On Wed, Dec 24, 2008 at 7:26 PM, Simon Riggs si...@2ndquadrant.com wrote:

 On Wed, 2008-12-24 at 19:04 +0900, Fujii Masao wrote:
 Hi,

 On Wed, Dec 24, 2008 at 4:39 PM, Fujii Masao masao.fu...@gmail.com wrote:
  (3)
  Revert the arhive status file of the specified file from .done to .ready.
  This would try to archive the corresponding xlog file again. The archive
  status of only skipped xlog file can be reverted.
 
  We resume xlog archiving by using (2) and (3), when replication
  ends.

 Umm.. reverting is very tricky, then it might become hotbeds of bugs.
 So, I will reconsider the method of resuming archive.

 Yeh, don't do that.

 No worries if we send a few files too many. Tears if we miss just one.

 * Turn it off for the *next* file to be archived. Keep track of last
 file sent.

OK

 * Turn it on again, mark *all* WAL files as .ready in sequence low to
 high, as long as they are higher than the last file sent.

You mean as follows?

- we don't create .ready during replication.
- we create .ready of the xlog files which are higher than the
  last sent when replication ends, and forbid that subsequent
  checkpoint create .ready of lower xlog files.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-24 Thread Fujii Masao
Hi,

On Wed, Dec 24, 2008 at 6:57 PM, Simon Riggs si...@2ndquadrant.com wrote:
 Yes, OK. So I think it would only work when full_page_writes = on, and
 has been on since last checkpoint. So two changes:

 * We just need a boolean that starts at true every checkpoint and gets
 set to false anytime someone resets full_page_writes or archive_command.
 If the flag is set  full_page_writes = on then we skip the checkpoint
 entirely and use the value from the last checkpoint.

Sounds good.

pg_start_backup on the standby (probably you are planning?) also needs
this logic? If so, resetting full_page_writes or archive_command should
generate its xlog.

I have another thought: should we forbid the reset of archive_command
during online backup? Currently we can do. If we don't need to do so,
we also don't need to track the reset of archiving for fast pg_start_backup.


 * My infra patch also had a modified version of pg_start_backup() that
 allowed you to specify IMMEDIATE checkpoint or not. Reworking that seems
 a waste of time, and I want to listen to everybody else now and change
 pg_start_backup() so it throws an IMMEDIATE CHECKPOINT and leave it
 there.

 Can you work on those also?

Umm.. I'm busy. Of course, I will try it if no one raises his or her hand.
But, I'd like to put coding the core of synch rep ahead of this.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] [idea] a copied relkind in pg_attribute

2008-12-24 Thread KaiGai Kohei
It is an idea to improve the implementation of SE-PostgreSQL.
I need a copied relkind in pg_attribute, to help its decision
making.

When we access on pg_attribute via ALTER TABLE or DML statement
directly, SE-PostgreSQL checks privilleges for the fetched tuples.
If the relation pointed by its attrelid has RELKIND_RELATION,
the fetched tuple is a column, so db_column object class should
be applied. Otherwise, db_tuple object class should be applied,
because it is not a column, like an entity of composite type.

The current implementation need to lookup RELOID system cache to
identify the relkind of the relation, because pg_attribtue does
not have any information about relkind. However, I also think
it is not an ideal implementation, even if its frequency is enough
small.

So, I have a plan to put a new member named as attkind into
pg_attribute to hold a copied value of its relkind.
It enables to identify the attribute without looking up system
cache, so it is better than current one.
The pg_class.relkind is always unchanged, so it does not have
any problematical point.

Please comment anything, if you have alternative idea or opinions.
If reviewer intend to comment about the point, it can be fixed with
the above way.

Thanks,
-- 
KaiGai Kohei kai...@kaigai.gr.jp

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot standby and b-tree killed items

2008-12-24 Thread Simon Riggs

On Wed, 2008-12-24 at 16:48 +0530, Pavan Deolasee wrote:
 On Wed, Dec 24, 2008 at 4:41 PM, Simon Riggs si...@2ndquadrant.com wrote:
 
 
  Greg and Heikki have highlighted in this thread some aspects of btree
  garbage collection that will increase the chance of queries being
  cancelled in various circumstances
 
 Even HOT-prune may lead to frequent query cancellations and unlike
 VACUUM there is no way user can control the frequency of prune
 operations.

The patch does go to some trouble to handle that case, as I'm sure
you've seen. Are you saying that part of the patch is ineffective and
should be removed, or?

Should/could there be a way to control frequency of prune operations? We
could maintain cleanupxmin as a constant minimum distance from xmax, for
example.

Are we saying we should take further measures, as I asked upthread? If
it is a consensus that I take some action, then I will.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot standby and b-tree killed items

2008-12-24 Thread Pavan Deolasee
On Wed, Dec 24, 2008 at 5:26 PM, Simon Riggs si...@2ndquadrant.com wrote:



 The patch does go to some trouble to handle that case, as I'm sure
 you've seen. Are you saying that part of the patch is ineffective and
 should be removed, or?


Umm.. are you talking about the wait mechanism ? That's the only
thing I remember. Otherwise, prune record is pretty much same as any
vacuum cleanup record.

 Should/could there be a way to control frequency of prune operations? We
 could maintain cleanupxmin as a constant minimum distance from xmax, for
 example.


Well, there can be. But tuning any such thing might be difficult and
would have implications on the primary. I am not saying we can do
that, but we will need additional tests to see its impact.

 Are we saying we should take further measures, as I asked upthread? If
 it is a consensus that I take some action, then I will.


Again, I haven't seen how frequently queries may get canceled. Or if
the delay is set to a large value, how far behind standby may get
during replication, so I can't really comment. Have you done any tests
on a reasonable hardware and checked if moderately long read queries
can be run on standby without standby lagging behind a lot.

I would prefer to have a solution which can be smarter than canceling
all queries as soon as a cleanup record comes and timeout occurs. For
example, if the queries are being run on a completely different set of
tables where as the updates/deletes are happening on another set of
tables, there is no reason why those queries should be canceled. I
think it would be very common to have large history tables which may
receive long read-only queries, but no updates/deletes. Whereas other
frequently updated tables which receive very few queries on the
standby.

Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot standby and b-tree killed items

2008-12-24 Thread Simon Riggs

On Wed, 2008-12-24 at 17:56 +0530, Pavan Deolasee wrote:
 On Wed, Dec 24, 2008 at 5:26 PM, Simon Riggs si...@2ndquadrant.com wrote:
 
 
 
  The patch does go to some trouble to handle that case, as I'm sure
  you've seen. Are you saying that part of the patch is ineffective and
  should be removed, or?
 
 
 Umm.. are you talking about the wait mechanism ? That's the only
 thing I remember. Otherwise, prune record is pretty much same as any
 vacuum cleanup record.

With respect, I was hoping you might look in the patch and see if you
agree with the way it is handled. No need to remember. The whole
latestRemovedXid concept is designed to do help.

  Should/could there be a way to control frequency of prune operations? We
  could maintain cleanupxmin as a constant minimum distance from xmax, for
  example.
 
 
 Well, there can be. But tuning any such thing might be difficult and
 would have implications on the primary. I am not saying we can do
 that, but we will need additional tests to see its impact.
 
  Are we saying we should take further measures, as I asked upthread? If
  it is a consensus that I take some action, then I will.
 
 
 Again, I haven't seen how frequently queries may get canceled. Or if
 the delay is set to a large value, how far behind standby may get
 during replication, so I can't really comment. Have you done any tests
 on a reasonable hardware and checked if moderately long read queries
 can be run on standby without standby lagging behind a lot.

Queries get cancelled if data they need to see if removed and the
max_standby_delay expires. So lag will be max_standby_delay, by
definition.

Not sure what further tests would show. Queries that run for longer than
max_standby delay plus mean time between cleanup records will currently
end up being cancelled.

 I would prefer to have a solution which can be smarter than canceling
 all queries as soon as a cleanup record comes and timeout occurs. 

Currently, it was the consensus view that queries should be cancelled,
though there are other options still on the table.

It's discussed in Design Notes on the Wiki. Simply ignoring WAL removal
has been discussed and rejected (so far).
http://archives.postgresql.org/pgsql-hackers/2008-05/msg00753.php
Explicitly defining the tables a transaction wishes to see has also been
discussed and rejected (so far).
http://archives.postgresql.org/pgsql-hackers/2008-08/msg00268.php;

 For
 example, if the queries are being run on a completely different set of
 tables where as the updates/deletes are happening on another set of
 tables, there is no reason why those queries should be canceled. I
 think it would be very common to have large history tables which may
 receive long read-only queries, but no updates/deletes. Whereas other
 frequently updated tables which receive very few queries on the
 standby.

There is currently no way to tell which tables a query will touch during
the course of its execution. Nor is there likely to be because of
user-defined volatile functions.

I attempted to find ways to explicitly limit the set of tables over
which a query might venture, but that cam to nothing also.

We've also discussed storing lastCleanedLSN for each buffer, so queries
can cancel themselves if they need to read a buffer that has had data
removed from it that they would have needed to see. I'll write that up
also.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-24 Thread Fujii Masao
Hi,

On Wed, Dec 24, 2008 at 7:58 PM, Fujii Masao masao.fu...@gmail.com wrote:
 Hi,

 On Wed, Dec 24, 2008 at 6:57 PM, Simon Riggs si...@2ndquadrant.com wrote:
 Yes, OK. So I think it would only work when full_page_writes = on, and
 has been on since last checkpoint. So two changes:

 * We just need a boolean that starts at true every checkpoint and gets
 set to false anytime someone resets full_page_writes or archive_command.
 If the flag is set  full_page_writes = on then we skip the checkpoint
 entirely and use the value from the last checkpoint.

 Sounds good.

I attached the self-contained patch to skip checkpoint at pg_start_backup.


 pg_start_backup on the standby (probably you are planning?) also needs
 this logic? If so, resetting full_page_writes or archive_command should
 generate its xlog.

Now, the patch doesn't care about this.


 I have another thought: should we forbid the reset of archive_command
 during online backup? Currently we can do. If we don't need to do so,
 we also don't need to track the reset of archiving for fast pg_start_backup.

Now, doesn't care too.

Happy Holidays!

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
? GNUmakefile
? config.log
? config.status
? contrib/pgbench/pgbench
? src/Makefile.global
? src/backend/postgres
? src/backend/catalog/postgres.bki
? src/backend/catalog/postgres.description
? src/backend/catalog/postgres.shdescription
? src/backend/postmaster/walreceiver.c
? src/backend/postmaster/walsender.c
? src/backend/snowball/snowball_create.sql
? src/backend/utils/probes.h
? src/backend/utils/mb/conversion_procs/conversion_create.sql
? src/bin/initdb/initdb
? src/bin/pg_config/pg_config
? src/bin/pg_controldata/pg_controldata
? src/bin/pg_ctl/pg_ctl
? src/bin/pg_dump/pg_dump
? src/bin/pg_dump/pg_dumpall
? src/bin/pg_dump/pg_restore
? src/bin/pg_resetxlog/pg_resetxlog
? src/bin/psql/psql
? src/bin/scripts/clusterdb
? src/bin/scripts/createdb
? src/bin/scripts/createlang
? src/bin/scripts/createuser
? src/bin/scripts/dropdb
? src/bin/scripts/droplang
? src/bin/scripts/dropuser
? src/bin/scripts/reindexdb
? src/bin/scripts/vacuumdb
? src/include/pg_config.h
? src/include/stamp-h
? src/include/postmaster/walreceiver.h
? src/include/postmaster/walsender.h
? src/interfaces/ecpg/compatlib/exports.list
? src/interfaces/ecpg/compatlib/libecpg_compat.so.3.1
? src/interfaces/ecpg/ecpglib/exports.list
? src/interfaces/ecpg/ecpglib/libecpg.so.6.1
? src/interfaces/ecpg/include/ecpg_config.h
? src/interfaces/ecpg/pgtypeslib/exports.list
? src/interfaces/ecpg/pgtypeslib/libpgtypes.so.3.1
? src/interfaces/ecpg/preproc/ecpg
? src/interfaces/libpq/exports.list
? src/interfaces/libpq/libpq.so.5.2
? src/port/pg_config_paths.h
? src/test/regress/pg_regress
? src/test/regress/testtablespace
? src/timezone/zic
Index: src/backend/access/transam/xlog.c
===
RCS file: /projects/cvsroot/pgsql/src/backend/access/transam/xlog.c,v
retrieving revision 1.324
diff -c -r1.324 xlog.c
*** src/backend/access/transam/xlog.c	17 Dec 2008 01:39:03 -	1.324
--- src/backend/access/transam/xlog.c	24 Dec 2008 14:57:27 -
***
*** 295,300 
--- 295,302 
  	/* Protected by info_lck: */
  	XLogwrtRqst LogwrtRqst;
  	XLogwrtResult LogwrtResult;
+ 	uint32		bkpCount;		/* ID of bkp using the same ckpt */
+ 	bool		bkpForceCkpt;	/* reset full_page_writes since last ckpt? */
  	uint32		ckptXidEpoch;	/* nextXID  epoch of latest checkpoint */
  	TransactionId ckptXid;
  	XLogRecPtr	asyncCommitLSN; /* LSN of newest async commit */
***
*** 6025,6036 
  	UpdateControlFile();
  	LWLockRelease(ControlFileLock);
  
! 	/* Update shared-memory copy of checkpoint XID/epoch */
  	{
  		/* use volatile pointer to prevent code rearrangement */
  		volatile XLogCtlData *xlogctl = XLogCtl;
  
  		SpinLockAcquire(xlogctl-info_lck);
  		xlogctl-ckptXidEpoch = checkPoint.nextXidEpoch;
  		xlogctl-ckptXid = checkPoint.nextXid;
  		SpinLockRelease(xlogctl-info_lck);
--- 6027,6043 
  	UpdateControlFile();
  	LWLockRelease(ControlFileLock);
  
! 	/* 
! 	 * Update shared-memory copy of checkpoint XID/epoch and reset the
! 	 * variables of backup ID/flag.
! 	 */
  	{
  		/* use volatile pointer to prevent code rearrangement */
  		volatile XLogCtlData *xlogctl = XLogCtl;
  
  		SpinLockAcquire(xlogctl-info_lck);
+ 		xlogctl-bkpCount = 0;
+ 		xlogctl-bkpForceCkpt = true;
  		xlogctl-ckptXidEpoch = checkPoint.nextXidEpoch;
  		xlogctl-ckptXid = checkPoint.nextXid;
  		SpinLockRelease(xlogctl-info_lck);
***
*** 6502,6507 
--- 6509,6535 
  	}
  }
  
+ bool
+ assign_full_page_writes(bool newval, bool doit, GucSource source)
+ {
+ 	/*
+ 	 * If full_page_writes is reset, since all indispensable full pages
+ 	 * might not be written since last checkpoint, we force a checkpoint
+ 	 * at pg_start_backup.
+ 	 */
+ 	if (doit  fullPageWrites != newval)
+ 	{
+ 		/* use volatile 

Re: [HACKERS] Synchronous replication, reading WAL for sending

2008-12-24 Thread Mark Mielke

Fujii Masao wrote:

- WALSender reads from WAL buffers and/or WAL files and sends the
buffers to WALReceiver. In phase one, we may assume that WALSender can
only read from WAL buffers and WAL files in pg_xlog directory. Later
on, this can be improved so that WALSender can temporarily restore
archived files and read from that too.


You mean that only walsender performs xlog streaming and copying
from pg_xlog serially? I think that this would degrade the performance.
And, I'm worried about the situation that the speed to generate xlog
on the primary is higher than that to copy them to the standby. We
might not be able to start xlog streaming forever.
  


I've seen a few references to this. Somebody else mentioned how a single 
TCP/IP stream might not have the bandwidth to match changes to the database.


TCP/IP streams do have a window size that adjusts with the load, and 
unless one gets into aggressive networking such as bittorrent which 
arguably reduce performance of the entire network, why shouldn't one 
TCP/IP stream be enough? And if one TCP/IP stream isn't enough, doesn't 
this point to much larger problems, that won't be solved by streaming it 
some other way over the network? As in, it doesn't matter what you do - 
your network pipe isn't big enough?


Over the Internet from my house to a co-located box, I can reliably get 
1.1+ Mbyte/s using a single TCP/IP connection.  The network connection 
at the co-lo is 10Mbit/s and my Internet connection to my house is also 
10Mbit/s. One TCP/IP connection seems pretty capable to stream data to 
the full potential of the network...


Also, I assume that most database loads have peaks and lows. Especially 
for very larger updates, perhaps end of day processing, I see it as a 
guarantee that all of the stand bys will fall more behind for a period 
(a few seconds to a minute?), but they will catch up shortly after the 
peak is over.


Cheers,
mark

--
Mark Mielke m...@mielke.cc


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-24 Thread Simon Riggs

On Thu, 2008-12-25 at 00:10 +0900, Fujii Masao wrote:
 Hi,
 
 On Wed, Dec 24, 2008 at 7:58 PM, Fujii Masao masao.fu...@gmail.com wrote:
  Hi,
 
  On Wed, Dec 24, 2008 at 6:57 PM, Simon Riggs si...@2ndquadrant.com wrote:
  Yes, OK. So I think it would only work when full_page_writes = on, and
  has been on since last checkpoint. So two changes:
 
  * We just need a boolean that starts at true every checkpoint and gets
  set to false anytime someone resets full_page_writes or archive_command.
  If the flag is set  full_page_writes = on then we skip the checkpoint
  entirely and use the value from the last checkpoint.
 
  Sounds good.
 
 I attached the self-contained patch to skip checkpoint at pg_start_backup.

Good.

Can we change to IMMEDIATE when it we need the checkpoint?

What is bkpCount for? I think we should discuss whatever that is for
separately. It isn't used in any if test, AFAICS.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sync Rep: Second thoughts

2008-12-24 Thread Emmanuel Cecchet

Hi Markus,


I didn't have much reliability issues with ensemble, appia or spread, so
far. Although, I admit I didn't ever run any of these in production.
Performance is certainly an issue, yes.
  
I may suggest another reading even though a bit dates, most of the 
results still apply: 
http://jmob.objectweb.org/jgroups/JGroups-middleware-2004.pdf
The baseline is that if you use UDP multicast, you need a dedicated 
switch and the tuning is a nightmare. I discussed these issues with the 
developers of Spread and they have no real magic. TCP seems a more 
reliable alternative (especially predictable performance) but the TCP 
timeouts are also tricky to tune depending on the platform. We worked 
quite a bit with Nuno around Appia in the context of Sequoia and 
performance can be outstanding when properly tuned or absolutely awful 
is some default values are wrong. The chaotic behavior of GCS under 
stress quickly compromises the reliability of the replication system, 
and admission control on UDP multicast has no good solution so far.
It's just a heads up on what is awaiting you in production when the 
system is stressed. There is no good solution so far besides a good 
admission control on top of the GCS (in the application).


I am now off for the holidays.

Cheers,
Emmanuel

--
Emmanuel Cecchet
Aster Data Systems
Web: http://www.asterdata.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Fwd: Re: [HACKERS] Transactions and temp tables]

2008-12-24 Thread Emmanuel Cecchet

Hi Heikki,

The point of using temp tables was performance. Using regular tables in 
our case would hurt performance too much. Well if we cannot get a 
temporary fix in 8.4, we will maintain a separate patch to get that 
functionality just for temp tables that are created and dropped in the 
same transaction. Hopefully we will be able to come up with a working 
solution in 8.5.


Thanks for your help,
Emmanuel


Emmanuel Cecchet wrote:
  

I just saw that this new patch was not considered because the previous
version ended being rejected.
Note that this version of the patch aims at supporting ONLY temp tables
that are created AND dropped in the same transaction. We need to be able
to use temp tables in transactions that are doing 2PC, but the temp
table lifespan does not need to cross transaction boundaries.

Please let me know if this patch could be integrated in 8.4.



IMHO, this is just getting too kludgey. We came up with pretty good
ideas on how to handle temp tables properly, by treating the same as
non-temp tables. That should eliminate all the problems the latest patch
did, and also the issues with sequences, and allow all access to temp
tables, not just a limited subset. I don't think it's worthwhile to
apply the kludge as a stopgap measure, let's do it properly in 8.5.

As a workaround, you can use a regular table instead of a temporary one.
If you create and drop the regular table in the same transaction (that's
the same limitation that latest patch has), you won't end up with a
bogus table in your database if the connection is dropped unexpectedly.
If your application uses multiple connections simultaenously, you'll
need a little bit of code in the application so that you don't try to
create a table with the same name in all backends. You could also create
a different schema for each connection, and do set
search_path='semitempschemaX, public', so that you can use the same
table name and still have separate tables for each connections.

(sorry for the late reply)

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com
  



--
Emmanuel Cecchet
Aster Data Systems
Web: http://www.asterdata.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot standby and b-tree killed items

2008-12-24 Thread Simon Riggs

On Wed, 2008-12-24 at 09:59 -0500, Robert Treat wrote:

 I think the uncertainty comes from peoples experience with typical 
 replication 
 use cases vs a lack of experience with this current implementation.  

Quite possibly.

Publishing user feedback on this will be very important in making this a
usable feature.

I'd be very happy if you were to direct the search for optimal
usability.

 One such example is that it is pretty common to use read-only slaves to do 
 horizontal scaling of read queries across a bunch of slaves. This is not the 
 scenario of running reporting queries on a second machine to lower load; you 
 would be running a large number of read-only, relativly short, oltp-ish 
 queries (think pg_benchs select only test i suppose), but you also have a 
 fairly regular stream of inserts/updates going on with these same tables, its 
 just you have 95/5 split of read/write (or similar). 

One thing to consider also is latency of information. Sending queries to
master or slave may return different answers if querying very recent
data.

 This is standard practice in things like mysql or using slony or what have 
 you. I suspect it's one of the first things people are going to want to do 
 with hot standby. But it's unclear how well this will work because we don't 
 have any experience with it yet, coupled with the two downsides being 
 mentioned as canceled queries and replay lag, which happen to be probably the 
 two worst downsides you would have in the above scenario. :-)
 
 Hmm I'm not sure why I didn't think of running this test before, but  
 read/write pg_bench on a master with pg_bench select test on slave isn't that 
 bad of a scenario to match the above; it might be a little too much activity 
 on the master, but has anyone else run such a test? 
 
-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [idea] a copied relkind in pg_attribute

2008-12-24 Thread Jaime Casanova
On Wed, Dec 24, 2008 at 6:50 AM, KaiGai Kohei kai...@kaigai.gr.jp wrote:

 The current implementation need to lookup RELOID system cache to
 identify the relkind of the relation, because pg_attribtue does
 not have any information about relkind. However, I also think
 it is not an ideal implementation, even if its frequency is enough
 small.


but you still can do it, right?
there are any other advantages, like something you can't do now?
if not, i think you need to probe that there will be any benefit in
speed and that is significant otherwise this seems like premature
optimization and that AFAIR is the root of all evil :)

-- 
Atentamente,
Jaime Casanova
Soporte y capacitación de PostgreSQL
Asesoría y desarrollo de sistemas
Guayaquil - Ecuador
Cel. +59387171157

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] generic reloptions improvement

2008-12-24 Thread Tom Lane
Alvaro Herrera alvhe...@commandprompt.com writes:
 Peter Eisentraut wrote:
 I'm not sure how important this is, but if you are enumerating the access 
 methods (RELOPT_KIND_BTREE, etc.), how will this work with user-defined 
 access methods?

 It is important.

 I'm intending to have a new routine which would reserve a value at
 runtime.  This value would be later be passed by the AM to create new
 options on the table.

What do you mean by at runtime?  Surely the value would have to remain
stable across database restarts, since it's going to relate to stuff
that is in catalog entries.

I'd feel more comfortable with a scheme that used the AM's pg_am OID.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Window-functions patch handling of aggregates

2008-12-24 Thread Tom Lane
Gregory Stark st...@enterprisedb.com writes:
 Tom Lane t...@sss.pgh.pa.us writes:
 Unless we want to move the goalposts on what an aggregate is allowed
 to do internally, we're going to have to change this to re-aggregate
 repeatedly.  Neither prospect is appetizing in the least.

 Does it currently copy the state datum before calling the final function?
 Would that help?

No.  The entire point of what we have now formalized as aggregates with
internal-type transition values is that the transition value isn't
necessarily a single palloc chunk.  For stuff like array_agg, the
performance costs of requiring that are enormous.

On looking at what array_agg does, it seems the issue there is that
the final-function actually deletes the working state when it thinks
it's done with it (see construct_md_array).   It would probably be
possible to fix things so that it doesn't do that when it's called by
a WindowAgg instead of a regular Agg.  What I'm more concerned about
is what third-party code will break.  contrib/intagg has done this
type of thing since forever, and I'm sure people have copied that...

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] uuids on freebsd

2008-12-24 Thread David Lee Lambert
On Dec 17, 2:30 pm, Andrew Gierth and...@tao11.riddles.org.uk wrote:
 Has anyone ever managed to get uuid generation working on FreeBSD? [...]

 ([...] The only solution I could come up with was to knock
 off a quick uuid-freebsd module that uses the base system uuid
 functions rather than the ossp ones. I could put this on pgfoundry if
 there isn't likely to be a real fix in the near future.)

+1 for putting it on pgFoundry.

I managed to hack the old contrib/uuid so it would use the functions
from Linux e2fs uuid on 8.0/8.1/8.2.  I might have time soon to see if
I can get that to work on 8.3.

--
DLL

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sync Rep: First Thoughts on Code

2008-12-24 Thread Fujii Masao
Hi,

I fixed some bugs.

On Thu, Dec 25, 2008 at 12:31 AM, Simon Riggs si...@2ndquadrant.com wrote:

 Can we change to IMMEDIATE when it we need the checkpoint?

Perhaps yes, though current patch doesn't care about it.
I'm not sure if we really need the feature. Yes, as you say,
I'd like to also listen to everybody else.


 What is bkpCount for?

So far, name of a backup history file consists of only
checkpoint redo location. But, in this patch, since some
backups use the same checkpoint, a backup history file
could be overwritten unfortunately. So, I introduced
bkpCount as ID of backups which use the same checkpoint.

 I think we should discuss whatever that is for
 separately. It isn't used in any if test, AFAICS.

Yes, this patch is testbed. We need to discuss more.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
? GNUmakefile
? config.log
? config.status
? contrib/make.log
? contrib/pgbench/pgbench
? src/Makefile.global
? src/backend/postgres
? src/backend/catalog/postgres.bki
? src/backend/catalog/postgres.description
? src/backend/catalog/postgres.shdescription
? src/backend/snowball/snowball_create.sql
? src/backend/utils/probes.h
? src/backend/utils/mb/conversion_procs/conversion_create.sql
? src/bin/initdb/initdb
? src/bin/pg_config/pg_config
? src/bin/pg_controldata/pg_controldata
? src/bin/pg_ctl/pg_ctl
? src/bin/pg_dump/pg_dump
? src/bin/pg_dump/pg_dumpall
? src/bin/pg_dump/pg_restore
? src/bin/pg_resetxlog/pg_resetxlog
? src/bin/psql/psql
? src/bin/scripts/clusterdb
? src/bin/scripts/createdb
? src/bin/scripts/createlang
? src/bin/scripts/createuser
? src/bin/scripts/dropdb
? src/bin/scripts/droplang
? src/bin/scripts/dropuser
? src/bin/scripts/reindexdb
? src/bin/scripts/vacuumdb
? src/include/pg_config.h
? src/include/stamp-h
? src/interfaces/ecpg/compatlib/exports.list
? src/interfaces/ecpg/compatlib/libecpg_compat.so.3.1
? src/interfaces/ecpg/ecpglib/exports.list
? src/interfaces/ecpg/ecpglib/libecpg.so.6.1
? src/interfaces/ecpg/include/ecpg_config.h
? src/interfaces/ecpg/pgtypeslib/exports.list
? src/interfaces/ecpg/pgtypeslib/libpgtypes.so.3.1
? src/interfaces/ecpg/preproc/ecpg
? src/interfaces/libpq/exports.list
? src/interfaces/libpq/libpq.so.5.2
? src/port/pg_config_paths.h
? src/test/regress/log
? src/test/regress/pg_regress
? src/test/regress/results
? src/test/regress/testtablespace
? src/test/regress/tmp_check
? src/test/regress/expected/constraints.out
? src/test/regress/expected/copy.out
? src/test/regress/expected/create_function_1.out
? src/test/regress/expected/create_function_2.out
? src/test/regress/expected/largeobject.out
? src/test/regress/expected/largeobject_1.out
? src/test/regress/expected/misc.out
? src/test/regress/expected/tablespace.out
? src/test/regress/sql/constraints.sql
? src/test/regress/sql/copy.sql
? src/test/regress/sql/create_function_1.sql
? src/test/regress/sql/create_function_2.sql
? src/test/regress/sql/largeobject.sql
? src/test/regress/sql/misc.sql
? src/test/regress/sql/tablespace.sql
? src/timezone/zic
Index: src/backend/access/transam/xlog.c
===
RCS file: /projects/cvsroot/pgsql/src/backend/access/transam/xlog.c,v
retrieving revision 1.324
diff -c -r1.324 xlog.c
*** src/backend/access/transam/xlog.c	17 Dec 2008 01:39:03 -	1.324
--- src/backend/access/transam/xlog.c	24 Dec 2008 18:13:45 -
***
*** 295,300 
--- 295,302 
  	/* Protected by info_lck: */
  	XLogwrtRqst LogwrtRqst;
  	XLogwrtResult LogwrtResult;
+ 	uint32		bkpCount;		/* ID of bkp using the same ckpt */
+ 	bool		bkpForceCkpt;	/* reset full_page_writes since last ckpt? */
  	uint32		ckptXidEpoch;	/* nextXID  epoch of latest checkpoint */
  	TransactionId ckptXid;
  	XLogRecPtr	asyncCommitLSN; /* LSN of newest async commit */
***
*** 318,323 
--- 320,332 
  static XLogCtlData *XLogCtl = NULL;
  
  /*
+  * We don't allow more than MAX_BKP_COUNT backups to use the same checkpoint.
+  * If XLogCtl-bkpCount  MAX_BKP_COUNT, we force new checkpoint at pg_standby
+  * even if there are all indispensable full pages since last checkpoint.
+  */
+ #define MAX_BKP_COUNT 256
+ 
+ /*
   * We maintain an image of pg_control in shared memory.
   */
  static ControlFileData *ControlFile = NULL;
***
*** 6025,6036 
  	UpdateControlFile();
  	LWLockRelease(ControlFileLock);
  
! 	/* Update shared-memory copy of checkpoint XID/epoch */
  	{
  		/* use volatile pointer to prevent code rearrangement */
  		volatile XLogCtlData *xlogctl = XLogCtl;
  
  		SpinLockAcquire(xlogctl-info_lck);
  		xlogctl-ckptXidEpoch = checkPoint.nextXidEpoch;
  		xlogctl-ckptXid = checkPoint.nextXid;
  		SpinLockRelease(xlogctl-info_lck);
--- 6034,6050 
  	UpdateControlFile();
  	LWLockRelease(ControlFileLock);
  
! 	/* 
! 	 * Update shared-memory copy of checkpoint XID/epoch and reset the
! 	 * variables of backup ID/flag.
! 	 */
  	{
  		/* use 

Re: [HACKERS] Hot standby and b-tree killed items

2008-12-24 Thread Pavan Deolasee
On Wed, Dec 24, 2008 at 7:18 PM, Simon Riggs si...@2ndquadrant.com wrote:



 With respect, I was hoping you might look in the patch and see if you
 agree with the way it is handled. No need to remember. The whole
 latestRemovedXid concept is designed to do help.


Well, that's common for all cleanup record including vacuum. But
reading your comment, it seemed as there is something special to
handle HOT prune case which I did not see. Anyways, the trouble with
HOT prune is that uples may be cleaned up very frequently and that can
lead to query cancellation at the standby. That's what I wanted to
emphasize.



 Queries get cancelled if data they need to see if removed and the
 max_standby_delay expires. So lag will be max_standby_delay, by
 definition.

That's per cleanup record, isn't it ?


 We've also discussed storing lastCleanedLSN for each buffer, so queries
 can cancel themselves if they need to read a buffer that has had data
 removed from it that they would have needed to see. I'll write that up
 also.


What if we do that at table level ? So if a query touches a table
which had cleanup activity since the query was started, it cancels
itself automatically,

Happy X'mas to all of you!

Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] incoherent view of serializable transactions

2008-12-24 Thread Kevin Grittner
 Emmanuel Cecchet m...@frogthinker.org 12/22/08 11:42 PM 
 If you want to know how to build SERIALIZABLE with a database that 
 provides SI (Snapshot Isolation), read 
 http://portal.acm.org/citation.cfm?doid=1376616.137669
 
The link didn't seem to work for me, but I think I found the article
you meant:  Serializable Isolation for Snapshot Databases
by Michael J. Cahill, et al

An interesting work.  If nothing else, it will help flesh out the
documentation of anomalies.  If the PostgreSQL community ever
does want to better approach true serializable behavior, this
should provide a good theoretical basis.
 
Thanks very much for the cite.
 
 I still don't 
 get why people would use SERIALIZABLE since there is no efficient 
 implementation of it.
 
Read the last paragraph of the secion 1 (Introduction)
of the article you just cited.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] merry christmas folks

2008-12-24 Thread Grzegorz Jaskiewicz

WITH RECURSIVE tree(b, l, lv) AS
 (
  (
   WITH RECURSIVE t(b, l) AS
(
  select b/11.6, 0 AS l from (select  
generate_series(0,30)::float8 b union all select generate_series(30,0,  
-1)::float8 b) ziew

UNION ALL
  select (b*1.06), l+1 FROM t WHERE l  3
)
select b, l, 1 AS lv from t order by l asc
  )
  UNION ALL
   select b*1.1, l+(lv*62) AS l , lv+1 FROM tree WHERE lv  4
 )
select array_to_string(array_agg(SUBSTRING('  ',  
LEAST(GREATEST(b::integer,1),6), 1)),'') from tree group by l order by  
l asc;



;]



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] DTrace probes patch

2008-12-24 Thread Bruce Momjian

Thanks, applied.

---

Robert Lor wrote:
 Tom Lane wrote:
  Robert Lor robert@sun.com writes:

  Tom Lane wrote:
  
  I agree.  If the probe is meant to track only *some* WAL writes
  then it needs to be named something less generic than
  TRACE_POSTGRESQL_WAL_BUFFER_WRITE.
 

  How about change it to TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY similar to 
  TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY for shared buffers?
  
 
  Works for me...
 
  
 Attached is the patch for the above name change.
 
 
 -Robert

 Index: src/backend/access/transam/xlog.c
 ===
 RCS file: /projects/cvsroot/pgsql/src/backend/access/transam/xlog.c,v
 retrieving revision 1.324
 diff -u -3 -p -r1.324 xlog.c
 --- src/backend/access/transam/xlog.c 17 Dec 2008 01:39:03 -  1.324
 +++ src/backend/access/transam/xlog.c 22 Dec 2008 16:28:00 -
 @@ -1318,14 +1318,14 @@ AdvanceXLInsertBuffer(bool new_segment)
* Have to write buffers while holding insert 
 lock. This is
* not good, so only write as much as we 
 absolutely must.
*/
 - TRACE_POSTGRESQL_WAL_BUFFER_WRITE_START();
 + TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_START();
   WriteRqst.Write = OldPageRqstPtr;
   WriteRqst.Flush.xlogid = 0;
   WriteRqst.Flush.xrecoff = 0;
   XLogWrite(WriteRqst, false, false);
   LWLockRelease(WALWriteLock);
   Insert-LogwrtResult = LogwrtResult;
 - TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DONE();
 + TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_DONE();
   }
   }
   }
 Index: src/backend/utils/probes.d
 ===
 RCS file: /projects/cvsroot/pgsql/src/backend/utils/probes.d,v
 retrieving revision 1.4
 diff -u -3 -p -r1.4 probes.d
 --- src/backend/utils/probes.d17 Dec 2008 01:39:04 -  1.4
 +++ src/backend/utils/probes.d22 Dec 2008 16:28:01 -
 @@ -89,6 +89,6 @@ provider postgresql {
  
   probe xlog__insert(unsigned char, unsigned char);
   probe xlog__switch();
 - probe wal__buffer__write__start();
 - probe wal__buffer__write__done();
 + probe wal__buffer__write__dirty__start();
 + probe wal__buffer__write__dirty__done();
  };

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] incoherent view of serializable transactions

2008-12-24 Thread Emmanuel Cecchet

Hi Kevin,


The link didn't seem to work for me, but I think I found the article
you meant:  Serializable Isolation for Snapshot Databases
by Michael J. Cahill, et al

An interesting work.  If nothing else, it will help flesh out the
documentation of anomalies.  If the PostgreSQL community ever
does want to better approach true serializable behavior, this
should provide a good theoretical basis.
  

Sorry for the broken link. Yes this is the paper.
Note that the paper was not necessarily enthusiastically received by the 
community when presented at the conference. While this is an interesting 
academic paper, it's practicality left a majority of the audience 
perplex. There was an interesting comment by Oracle folks: Oracle does 
not provide serializability but only snapshot isolation, and still users 
prefer to 'downgrade' to read committed for better performance. The 
Oracle guys experience seemed to indicate that there was no need for 
serializability (well, that's also less work for them ;-)) in their 
customer base.
Having both a foot in academia and in industry, I understand the 
intellectual interest for serializability on the academic side, but I 
still have not seen a single use case in industry where this was a 
requirement (but my db experience is probably narrow).


Have nice serializable holidays ;-)
manu

--
Emmanuel Cecchet
FTO @ Frog Thinker 
Open Source Development  Consulting

--
Web: http://www.frogthinker.org
email: m...@frogthinker.org
Skype: emmanuel_cecchet


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] parallel restore vs. windows

2008-12-24 Thread Jaime Casanova
On Sun, Dec 14, 2008 at 12:13 PM, Andrew Dunstan and...@dunslane.net wrote:

 This version completes properly on Windows with the regression database.


actually, this one doesn't apply cleanly on head

-- 
Atentamente,
Jaime Casanova
Soporte y capacitación de PostgreSQL
Asesoría y desarrollo de sistemas
Guayaquil - Ecuador
Cel. +59387171157

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] parallel restore vs. windows

2008-12-24 Thread Andrew Dunstan



Jaime Casanova wrote:

On Sun, Dec 14, 2008 at 12:13 PM, Andrew Dunstan and...@dunslane.net wrote:
  

This version completes properly on Windows with the regression database.




actually, this one doesn't apply cleanly on head
  


I will have a new patch a day or two after Christmas, which I hope will 
be very close to being fully done.


cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] incoherent view of serializable transactions

2008-12-24 Thread Ron Mayer

Robert Haas wrote:

... serializable transaction ...

If we were to construct a database that had one giant lock for the
entire database so that only a single query could execute at one time,
transactions would be serializable (because they'd in fact be
serialized).  However, performance would suck.  


I wonder if this giant-lock-for-isolation-level-serializable
is a mode postgres should support.  ISTM it would meet the
letter of the spec, and at least some of the people using
transaction isolation level serializable are doing so precisely
because they *want* the database to deal with all possible
serialization issues, and accepting performance penalties.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] incoherent view of serializable transactions

2008-12-24 Thread Robert Haas
On Dec 24, 2008, at 6:46 PM, Ron Mayer rm...@cheapcomplexdevices.com  
wrote:



Robert Haas wrote:

... serializable transaction ...

If we were to construct a database that had one giant lock for the
entire database so that only a single query could execute at one  
time,

transactions would be serializable (because they'd in fact be
serialized).  However, performance would suck.


I wonder if this giant-lock-for-isolation-level-serializable
is a mode postgres should support.  ISTM it would meet the
letter of the spec, and at least some of the people using
transaction isolation level serializable are doing so precisely
because they *want* the database to deal with all possible
serialization issues, and accepting performance penalties.


No. :-)

Cheers,

...Robert

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [idea] a copied relkind in pg_attribute

2008-12-24 Thread KaiGai Kohei

Jaime Casanova wrote:

On Wed, Dec 24, 2008 at 6:50 AM, KaiGai Kohei kai...@kaigai.gr.jp wrote:

The current implementation need to lookup RELOID system cache to
identify the relkind of the relation, because pg_attribtue does
not have any information about relkind. However, I also think
it is not an ideal implementation, even if its frequency is enough
small.



but you still can do it, right?
there are any other advantages, like something you can't do now?


Yes, we still can see relkind via system cache.
When we fetch a tuple from pg_attribute, it need to look up the
RELOID system cache to get the relkind of its relation.

The reason why I considered the change is preferable is advantanges
in performance and robustness in security design.

The first one is obvious. If we have attkind member, it is not
necessary to look up the RELOID system cache for each tuples.

The other need an explanation. A database superuser is allowed
to update system catalog by hand, if it is allowed by the security
policy. For example, he will be able to update relkind in some
of pg_class, even if it never happen in general DDLs.
If a relkind is changed from 'r' to 'c', we deal pg_attribute
entries pointing the tuple as db_tuple class, not db_column class,
because they are not already columns.
It means we fundamentally have to check permissions on pg_attribute
when pg_class is updated, or pg_attribute should have its identifier
information. I think the later approach is more simple.

Please consider an another instance. In filesystem, 'x' permission
bit has different meaning between files and directries. If a derectory
without no child files is handled as a regular file suddenly, it can
make a confusion. It is a similar situation.

So, I want to put a needed attribute to identify itself.

Thanks,


if not, i think you need to probe that there will be any benefit in
speed and that is significant otherwise this seems like premature
optimization and that AFAIR is the root of all evil :)


--
OSS Platform Development Division, NEC
KaiGai Kohei kai...@ak.jp.nec.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Window-functions patch handling of aggregates

2008-12-24 Thread Hitoshi Harada
2008/12/25 Tom Lane t...@sss.pgh.pa.us:
 Gregory Stark st...@enterprisedb.com writes:
 Tom Lane t...@sss.pgh.pa.us writes:
 Unless we want to move the goalposts on what an aggregate is allowed
 to do internally, we're going to have to change this to re-aggregate
 repeatedly.  Neither prospect is appetizing in the least.

 Does it currently copy the state datum before calling the final function?
 Would that help?

 No.  The entire point of what we have now formalized as aggregates with
 internal-type transition values is that the transition value isn't
 necessarily a single palloc chunk.  For stuff like array_agg, the
 performance costs of requiring that are enormous.

 On looking at what array_agg does, it seems the issue there is that
 the final-function actually deletes the working state when it thinks
 it's done with it (see construct_md_array).   It would probably be
 possible to fix things so that it doesn't do that when it's called by
 a WindowAgg instead of a regular Agg.  What I'm more concerned about
 is what third-party code will break.  contrib/intagg has done this
 type of thing since forever, and I'm sure people have copied that...

I have concerned it once before on the first design of the window
functions. I don't have much idea about all over the aggregate
functions but at least count(*) does some assumption of AggState in
its context. So I concluded when the window functions are introduced
it must be announced that if you use aggregate in the window context,
you must be sure it supports window as well as aggregate. It is
because currently (= 8.3) aggregates are assumed it is called in
AggState only but the assumption would be broken now. It is designed,
and announcing helps much third party code to support or not to
support window functions (i.e. you can stop by error if window is not
supported by the function).

Regards,

-- 
Hitoshi Harada

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] hot standby on mingw

2008-12-24 Thread Jaime Casanova
Hi,

i'm getting this error when triyng to build the hot standby (v5 from
20081218 obtained in the wiki) on mingw

make[4]: Entering directory `/home/Administrador/pgsql/src/backend/access/heap'
gcc -O2 -Wall -Wmissing-prototypes -Wpointer-arith
-Wdeclaration-after-statement -Wendif-labels -fno-strict-aliasing
-fwrapv -g -I../../../../src/include -I./src/include/port/win32
-DEXEC_BACKEND  -I../../../../src/include/port/win32 -DBUILDING_DLL
-c -o heapam.o heapam.c -MMD -MP -MF .deps/heapam.Po
In file included from heapam.c:51:
../../../../src/include/access/xact.h:109: error: syntax error before uint
../../../../src/include/access/xact.h:109: warning: no semicolon at
end of struct or union
../../../../src/include/access/xact.h:117: error: syntax error before '}' token
../../../../src/include/access/xact.h:117: warning: type defaults to
`int' in declaration of `xl_xact_commit'
../../../../src/include/access/xact.h:117: warning: data definition
has no type or storage class
../../../../src/include/access/xact.h:152: error: syntax error before uint
../../../../src/include/access/xact.h:152: warning: no semicolon at
end of struct or union
../../../../src/include/access/xact.h:158: error: syntax error before '}' token
../../../../src/include/access/xact.h:158: warning: type defaults to
`int' in declaration of `xl_xact_abort'
../../../../src/include/access/xact.h:158: warning: data definition
has no type or storage class
../../../../src/include/access/xact.h:173: error: syntax error before
xl_xact_commit
../../../../src/include/access/xact.h:173: warning: no semicolon at
end of struct or union
../../../../src/include/access/xact.h:175: warning: type defaults to
`int' in declaration of `xl_xact_commit_prepared'
../../../../src/include/access/xact.h:175: warning: data definition
has no type or storage class
../../../../src/include/access/xact.h:182: error: syntax error before
xl_xact_abort
../../../../src/include/access/xact.h:182: warning: no semicolon at
end of struct or union
../../../../src/include/access/xact.h:184: warning: type defaults to
`int' in declaration of `xl_xact_abort_prepared'
../../../../src/include/access/xact.h:184: warning: data definition
has no type or storage class
make[4]: *** [heapam.o] Error 1

i guess the problem is the uint type, when i change that for int the
error vanishes

-- 
Atentamente,
Jaime Casanova
Soporte y capacitación de PostgreSQL
Asesoría y desarrollo de sistemas
Guayaquil - Ecuador
Cel. +59387171157

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers