Re: [HACKERS] incorrect handling of the timeout in pg_receivexlog

2012-02-06 Thread Fujii Masao
On Tue, Feb 7, 2012 at 2:58 PM, Fujii Masao  wrote:
> Hi,
>
> When I compiled HEAD with --disable-integer-datetimes and tested
> pg_receivexlog, I encountered unexpected replication timeout. As
> far as I read the pg_receivexlog code, the cause of this problem is
> that pg_receivexlog handles the standby message timeout incorrectly
> in --disable-integer-datetimes. The attached patch fixes this problem.
> Comments?

receivelog.c
---
timeout.tv_sec = last_status + standby_message_timeout - now - 1;
if (timeout.tv_sec <= 0)
---

Umm.. the above code also handles the timestamp incorrectly. ISTM that the
root cause of these problems is that receivelog.c uses TimestampTz. What
about changing receivelog.c so that it uses time_t instead of TimestampTz?
Which would make the code simpler, I think.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] patch for implementing SPI_gettypemod()

2012-02-06 Thread Chetan Suttraway
On Thu, Feb 2, 2012 at 8:11 PM, Robert Haas  wrote:

> On Wed, Feb 1, 2012 at 5:23 AM, Chetan Suttraway
>  wrote:
> > Hi All,
> >
> > This is regarding the TODO item :
> > "Add SPI_gettypmod() to return a field's typemod from a TupleDesc"
> >
> > The related message is:
> > http://archives.postgresql.org/pgsql-hackers/2005-11/msg00250.php
> >
> > This basically talks about having an SPI_gettypemod() which returns the
> > typmod of a field of tupdesc
> >
> > Please refer the attached patch based on the suggested implementation.
>
> Please add this to the next CommitFest:
>
> https://commitfest.postgresql.org/action/commitfest_view/open
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>

At the given link, I am able to choose only "System administration" under
commitfest topic.
I think there has to be "server features" or "Miscellaneous".


Regards,
Chetan

-- 
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

 Website: www.enterprisedb.com
EnterpriseDB Blog : http://blogs.enterprisedb.com
Follow us on Twitter : http://www.twitter.com/enterprisedb


[HACKERS] incorrect handling of the timeout in pg_receivexlog

2012-02-06 Thread Fujii Masao
Hi,

When I compiled HEAD with --disable-integer-datetimes and tested
pg_receivexlog, I encountered unexpected replication timeout. As
far as I read the pg_receivexlog code, the cause of this problem is
that pg_receivexlog handles the standby message timeout incorrectly
in --disable-integer-datetimes. The attached patch fixes this problem.
Comments?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
diff --git a/src/bin/pg_basebackup/receivelog.c b/src/bin/pg_basebackup/receivelog.c
index 8ca3882..af4add0 100644
--- a/src/bin/pg_basebackup/receivelog.c
+++ b/src/bin/pg_basebackup/receivelog.c
@@ -62,7 +62,7 @@ open_walfile(XLogRecPtr startpoint, uint32 timeline, char *basedir, char *namebu
 	f = open(fn, O_WRONLY | O_CREAT | PG_BINARY, S_IRUSR | S_IWUSR);
 	if (f == -1)
 	{
-		fprintf(stderr, _("%s: Could not open WAL segment %s: %s\n"),
+		fprintf(stderr, _("%s: could not open WAL segment %s: %s\n"),
 progname, fn, strerror(errno));
 		return -1;
 	}
@@ -190,6 +190,24 @@ localGetCurrentTimestamp(void)
 }
 
 /*
+ * Local version of TimestampDifferenceExceeds(), since we are not
+ * linked with backend code.
+ */
+static bool
+localTimestampDifferenceExceeds(TimestampTz start_time,
+		   TimestampTz stop_time,
+		   int msec)
+{
+	TimestampTz diff = stop_time - start_time;
+
+#ifdef HAVE_INT64_TIMESTAMP
+	return (diff >= msec * INT64CONST(1000));
+#else
+	return (diff * 1000.0 >= msec);
+#endif
+}
+
+/*
  * Receive a log stream starting at the specified position.
  *
  * If sysidentifier is specified, validate that both the system
@@ -301,7 +319,8 @@ ReceiveXlogStream(PGconn *conn, XLogRecPtr startpos, uint32 timeline, char *sysi
 		 */
 		now = localGetCurrentTimestamp();
 		if (standby_message_timeout > 0 &&
-			last_status < now - standby_message_timeout * 100)
+			localTimestampDifferenceExceeds(last_status, now,
+			standby_message_timeout * 1000))
 		{
 			/* Time to send feedback! */
 			char		replybuf[sizeof(StandbyReplyMessage) + 1];

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] double writes using "double-write buffer" approach [WIP]

2012-02-06 Thread Dan Scales
I don't know a lot about base backup, but it sounds like full_page_writes must 
be turned on for base backup, in order to deal with the inconsistent reads of 
pages (which you might call torn pages) that can happen when you backup the 
data files while the database is running.  The relevant parts of the WAL log 
are then copied separately (and consistently) once the backup of the data files 
is done, and used to "recover" the database into a consistent state later.

So, yes, good point -- double writes cannot replace the functionality of 
full_page_writes for base backup.  If double writes were in use, they might be 
automatically switched over to full page writes for the duration of the base 
backup.  And the double write file should not be part of the base backup.

Dan

- Original Message -
From: "Fujii Masao" 
To: "Dan Scales" 
Cc: "PG Hackers" 
Sent: Monday, February 6, 2012 3:08:15 AM
Subject: Re: [HACKERS] double writes using "double-write buffer" approach [WIP]

On Sat, Jan 28, 2012 at 7:31 AM, Dan Scales  wrote:
> Let me know if you have any thoughts/comments, etc.  The patch is
> enclosed, and the README.doublewrites is updated a fair bit.

ISTM that the double-write can prevent torn-pages in neither double-write file
nor data file in *base backup*. Because both double-write file and data file can
be backed up while being written. Is this right? To avoid the torn-page problem,
we should write FPI to WAL during online backup even if the double-write has
been committed?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] When do we lose column names?

2012-02-06 Thread Andrew Dunstan



On 11/16/2011 10:38 PM, Tom Lane wrote:

I wrote:

PFC, a patch that does this.

Upon further review, this patch would need some more work even for the
RowExpr case, because there are several places that build RowExprs
without bothering to build a valid colnames list.  It's clearly soluble
if anyone cares to put in the work, but I'm not personally excited
enough to pursue it ...



The patch itself causes a core dump with the current regression tests. 
This test:


   SELECT array_to_json(array_agg(q),false)
  FROM ( SELECT $$a$$ || x AS b, y AS c,
   ARRAY[ROW(x.*,ARRAY[1,2,3]),
   ROW(y.*,ARRAY[4,5,6])] AS z
 FROM generate_series(1,2) x,
  generate_series(4,5) y) q;


causes a failure at line 967 of execTuples.c:

   Assert(list_length(exprList) == list_length(namesList));

I'm investigating further.

I've been looking at the other places that build RowExprs. Most of them 
look OK and none seem clearly in need of fixing at first glance. Which 
do you think need attention?


cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-02-06 Thread Bruce Momjian
On Mon, Feb 06, 2012 at 06:43:04PM -0500, Bruce Momjian wrote:
> On Mon, Feb 06, 2012 at 10:49:10PM +, Peter Geoghegan wrote:
> > On 6 February 2012 21:19, Bruce Momjian  wrote:
> > > Peter Geoghegan obviously has done some serious work in improving
> > > sorting, and worked well with the community process.
> > 
> > Thank you for acknowledging that.
> > 
> > It's unfortunate that C does not support expressing these kinds of
> > ideas in a more natural way.
> 
> Yes, it is a problem, and a benefit.  We have avoided C++ because these
> types of trade-offs that we are discussing are often done invisibly, so
> we can't make the decision ourselves.

Let me add that while it is fine for languages like C++ to make these
decisions for application code automatically, operating systems and
high-performance databases developers prefer to make such decisions
explicitly, which is what we are discussing now.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-02-06 Thread Bruce Momjian
On Mon, Feb 06, 2012 at 10:49:10PM +, Peter Geoghegan wrote:
> On 6 February 2012 21:19, Bruce Momjian  wrote:
> > Peter Geoghegan obviously has done some serious work in improving
> > sorting, and worked well with the community process.
> 
> Thank you for acknowledging that.
> 
> It's unfortunate that C does not support expressing these kinds of
> ideas in a more natural way.

Yes, it is a problem, and a benefit.  We have avoided C++ because these
types of trade-offs that we are discussing are often done invisibly, so
we can't make the decision ourselves.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] semi-PoC: kNN-gist for cubes

2012-02-06 Thread Jay Levitt
I have a rough proof-of-concept for getting nearest-neighbor searches 
working with cubes.  When I say "rough", I mean "I have no idea what I'm 
doing and I haven't written C for 15 years but I hear it got standardized 
please don't hurt me".  It seems to be about 400x faster for a 3D cube with 
1 million rows, more like 10-30x for a 6D cube with 10 million rows.


The patch adds operator <-> (which is just the existing cube_distance 
function) and support function 8, distance (which is just g_cube_distance, a 
wrapper around cube_distance).


The code is in no way production-quality; it is in fact right around "look! 
it compiles!", complete with pasted-in, commented-out code from something I 
was mimicking.  I thought I'd share at this early stage in the hopes I might 
get some pointers, such as:


- What unintended consequences should I be looking for?
- What benchmarks should I do?
- What kind of edge cases might I consider?
- I'm just wrapping cube_distance and calling it through DirectFunctionCall; 
it's probably more proper to extract out the "real" function and call it 
from both cube_distance and g_cube_distance. Right?

- What else don't I know?  (Besides C, funny man.)

The patch, such as it is, is at:

https://github.com/jaylevitt/postgres/commit/9cae4ea6bd4b2e582b95d7e1452de0a7aec12857

with an even-messier test at

https://github.com/jaylevitt/postgres/commit/daa33e30acaa2c99fe554d88a99dd7d78ff6c784

I initially thought this patch made inserting and indexing slower, but then 
I realized the fast version was doing 1 million rows, and the slow one did 
10 million rows.  Which means: dinnertime.


Jay Levitt

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-02-06 Thread Peter Geoghegan
On 6 February 2012 21:19, Bruce Momjian  wrote:
> Peter Geoghegan obviously has done some serious work in improving
> sorting, and worked well with the community process.

Thank you for acknowledging that.

It's unfortunate that C does not support expressing these kinds of
ideas in a more natural way.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-02-06 Thread Bruce Momjian
On Mon, Feb 06, 2012 at 04:19:07PM -0500, Bruce Momjian wrote:
> Peter Geoghegan obviously has done some serious work in improving
> sorting, and worked well with the community process.  He has done enough
> analysis that I am hard-pressed to see how we would get similar
> improvement using a different method, so I think it comes down to
> whether we want the 28% speedup by adding 55k (1%) to the binary.
> 
> I think Peter has shown us how to get that, and what it will cost --- we
> just need to decide now whether it is worth it.  What I am saying is
> there probably isn't a cheaper way to get that speedup, either now or in
> the next few years.  (COPY might need similar help for speedups.)
> 
> I believe this is a big win and well worth the increased binary size
> because the speed up is significant, and because it is of general
> usefulness for a wide range of queries.  Either of these would be enough
> to justify the additional 1% size, but both make it an easy decision for
> me.  
> 
> FYI, I believe COPY needs similar optimizations; we have gotten repeated
> complaints about its performance and this method of optmization might
> also be our only option.

Sorry, that was wordy.  What I am saying is that years ago we did
hot-spot optimization for storage and tuple access using macros.  We
have looked for cheap optimizations for sort (and COPY) for years, but
haven't found it.  If we want optimizations in these areas, we are
probably going to need to do the same sort of hot-spot optimization we
did earlier, and Peter's work is an excellent candidate for that.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Assertion failure in AtCleanup_Portals

2012-02-06 Thread Tom Lane
Pavan Deolasee  writes:
> If I run the following sequence of commands, I get an assertion
> failure in current HEAD.

> postgres=# BEGIN;
> BEGIN
> postgres=# SELECT 1/0;
> ERROR:  division by zero
> postgres=# ROLLBACK TO A;
> ERROR:  no such savepoint
> postgres=# \q

> The process fails when the session is closed and aborted transaction
> gets cleaned at the proc_exit time. The stack trace is as below.

Hmm.  It works fine if you issue an actual ROLLBACK command there,
so my gut reaction is that AbortOutOfAnyTransaction isn't sufficiently
duplicating the full-fledged ROLLBACK code path.  No time to dig further
right now though.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-02-06 Thread Bruce Momjian
On Fri, Jan 27, 2012 at 09:37:37AM -0500, Robert Haas wrote:
> On Fri, Jan 27, 2012 at 9:27 AM, Peter Geoghegan  
> wrote:
> > Well, I don't think it's all that subjective - it's more the case that
> > it is just difficult, or it gets that way as you consider more
> > specialisations.
> 
> Sure it's subjective.  Two well-meaning people could have different
> opinions without either of them being "wrong".  If you do a lot of
> small, in-memory sorts, more of this stuff is going to seem worthwhile
> than if you don't.
> 
> > As for what types/specialisations may not make the cut, I'm
> > increasingly convinced that floats (in the following order: float4,
> > float8) should be the first to go. Aside from the fact that we cannot
> > use their specialisations for anything like dates and timestamps,
> > floats are just way less useful than integers in the context of
> > database applications, or at least those that I've been involved with.
> > As important as floats are in the broad context of computing, it's
> > usually only acceptable to store data in a database as floats within
> > scientific applications, and only then when their limitations are
> > well-understood and acceptable. I think we've all heard anecdotes at
> > one time or another, involving their limitations not being well
> > understood.
> 
> While we're waiting for anyone else to weigh in with an opinion on the
> right place to draw the line here, do you want to post an updated
> patch with the changes previously discussed?

Well, I think we have to ask not only how many people are using
float4/8, but how many people are sorting or creating indexes on them. 
I think it would be few and perhaps should be eliminated.

Peter Geoghegan obviously has done some serious work in improving
sorting, and worked well with the community process.  He has done enough
analysis that I am hard-pressed to see how we would get similar
improvement using a different method, so I think it comes down to
whether we want the 28% speedup by adding 55k (1%) to the binary.

I think Peter has shown us how to get that, and what it will cost --- we
just need to decide now whether it is worth it.  What I am saying is
there probably isn't a cheaper way to get that speedup, either now or in
the next few years.  (COPY might need similar help for speedups.)

I believe this is a big win and well worth the increased binary size
because the speed up is significant, and because it is of general
usefulness for a wide range of queries.  Either of these would be enough
to justify the additional 1% size, but both make it an easy decision for
me.  

FYI, I believe COPY needs similar optimizations; we have gotten repeated
complaints about its performance and this method of optmization might
also be our only option.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] freezing multixacts

2012-02-06 Thread Alvaro Herrera

Excerpts from Robert Haas's message of lun feb 06 13:19:14 -0300 2012:
> 
> On Mon, Feb 6, 2012 at 9:31 AM, Alvaro Herrera
>  wrote:
> >> Suppose you have a tuple A which is locked by a series of transactions
> >> T0, T1, T2, ...; AIUI, each new locker is going to have to create a
> >> new mxid with all the existing entries plus a new one for itself.
> >> But, unless I'm confused, as it's doing so, it can discard any entries
> >> for locks taken by transactions which are no longer running.
> >
> > That's correct.  But the problem is a tuple that is locked or updated by
> > a very old transaction that doesn't commit or rollback, and the tuple is
> > never locked again.  Eventually the Xid could remain live while the mxid
> > is in wraparound danger.
> 
> Ah, I see.  I think we should probably handle that the same way we do
> for XIDs: try to force autovac when things get tight, then start
> issuing warnings, and finally just refuse to assign any more MXIDs.

Agreed.

> Another thing that might make sense, for both XIDs and MXIDs, is to
> start killing transactions that are preventing vacuum/autovacuum from
> doing their thing.  This could mean either killing the people who are
> holding back RecentGlobalXmin, so that we can actually freeze the old
> stuff; or killing people who are holding a conflicting lock, using the
> recovery-conflict stuff or some adaptation of it.

Yeah -- right now we only emit some innocuous-looking messages, which
I've seen most people to ignore until they get bitten by a forced
anti-wraparound vacuum.  It'd be nice to get more agressive about that
as the situation gets more critical.

-- 
Álvaro Herrera 
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ecpglib use PQconnectdbParams

2012-02-06 Thread Michael Meskes
On Fri, Feb 03, 2012 at 01:15:30PM +0100, Christian Ullrich wrote:
> Shouldn't these be [7]? You can have up to 6 items, plus the terminator.

I take it only keywords have to be [7], right? Committed, thanks for spotting 
this.

There seems to be one more problem that I haven't had time to tackle yet. With
this patch the connection regression test (make checktcp) doesn't run through
anymore because one test connection cannot be made. It spits out the error
message:

FATAL:  invalid command-line arguments for server process 
HINT:  Try "postgres --help" for more information.

This connection used to work without the patch and besides the error message is
not really telling in this context.

So if anyone is interested in looking at this fine, if not I will as soon as I
find some spare time.

Michael
-- 
Michael Meskes
Michael at Fam-Meskes dot De, Michael at Meskes dot (De|Com|Net|Org)
Michael at BorussiaFan dot De, Meskes at (Debian|Postgresql) dot Org
Jabber: michael.meskes at googlemail dot com
VfL Borussia! Força Barça! Go SF 49ers! Use Debian GNU/Linux, PostgreSQL

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Report: race conditions in WAL replay routines

2012-02-06 Thread Simon Riggs
On Mon, Feb 6, 2012 at 6:32 PM, Tom Lane  wrote:
> Simon Riggs  writes:
>> The existing errmsg of "tablespace is not empty" doesn't cover all
>> reasons why tablespace was not removed.
>
> Yeah, in fact that particular statement is really pretty bogus for the
> replay case, because as the comment says we know that the tablespace
> *is* empty so far as full-fledged database objects are concerned.
>
>> The final message should have
>> errmsg "tablespace not fully removed"
>> errhint "you should resolve this manually if it causes further problems"
>
> Planning to go with this:
>
>                         errmsg("directories for tablespace %u could not be 
> removed",
>                                xlrec->ts_id),
>                         errhint("You can remove the directories manually if 
> necessary.")));
>
> I thought about an errdetail, but the preceding LOG entries from
> destroy_tablespace_directories should provide the details reasonably
> well.

Looks good.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-02-06 Thread Bruce Momjian
On Mon, Feb 06, 2012 at 12:59:59PM -0500, Bruce Momjian wrote:
> > > I'm also not very comfortable with the idea of having flags on the page
> > > indicating whether it has a checksum or not. It's not hard to imagine a
> > > software of firmware bug or hardware failure that would cause pd_flags 
> > > field
> > > to be zeroed out altogether. It would be more robust if the expected
> > > bit-pattern was not 0-0, but 1-0 or 0-1, but then you need to deal with 
> > > that
> > > at upgrade somehow. And it still feels a bit whacky anyway.
> 
> > Good idea. Lets use
> > 
> > 0-0-0 to represent upgraded from previous version, needs a bit set
> > 0-0-1 to represent new version number of page, no checksum
> > 1-1-1 to represent new version number of page, with checksum
> > 
> > So we have 1 bit dedicated to the page version, 2 bits to the checksum 
> > indicator
> 
> Interesting point that we would not be guarding against a bit flip from
> 1 to 0 for the checksum bit; I agree using two bits is the way to go.  I
> don't see how upgrade figures into this.
> 
> However, I am unclear how Simon's idea above actually works.  We need
> two bits for redundancy, both 1, to mark a page as having a checksum.  I
> don't think mixing the idea of a new page version and checksum enabled
> really makes sense, especially since we have to plan for future page
> version changes.
> 
> I think we dedicate 2 bits to say we have computed a checksum, and 3
> bits to mark up to 8 possible page versions, so the logic is, in
> pd_flags, we use bits 0x8 and 0x16 to indicate that a checksum is stored
> on the page, and we use 0x32 and later for the page version number.  We
> can assume all the remaining bits are for the page version number until
> we need to define new bits, and we can start storing them at the end
> first, and work forward.  If all the version bits are zero, it means the
> page version number is still stored in pd_pagesize_version.

A simpler solution would be to place two bits for checksum after the
existing page bit usage, and place the page version on the last four
bits of the 16-bit field --- that still leaves us with 7 unused bits.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Report: race conditions in WAL replay routines

2012-02-06 Thread Tom Lane
Simon Riggs  writes:
> The existing errmsg of "tablespace is not empty" doesn't cover all
> reasons why tablespace was not removed.

Yeah, in fact that particular statement is really pretty bogus for the
replay case, because as the comment says we know that the tablespace
*is* empty so far as full-fledged database objects are concerned.

> The final message should have
> errmsg "tablespace not fully removed"
> errhint "you should resolve this manually if it causes further problems"

Planning to go with this:

 errmsg("directories for tablespace %u could not be 
removed",
xlrec->ts_id),
 errhint("You can remove the directories manually if 
necessary.")));

I thought about an errdetail, but the preceding LOG entries from
destroy_tablespace_directories should provide the details reasonably
well.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-02-06 Thread Bruce Momjian
On Mon, Feb 06, 2012 at 08:51:34AM +0100, Heikki Linnakangas wrote:
> >I wonder if we should just dedicate 3 page header bits, call that the
> >page version number, and set this new version number to 1, and assume
> >all previous versions were zero, and have them look in the old page
> >version location if the new version number is zero.  I am basically
> >thinking of how we can plan ahead to move the version number to a new
> >location and have a defined way of finding the page version number using
> >old and new schemes.
> 
> Three bits seems short-sighted, but yeah, something like 6-8 bits
> should be enough. On the whole, though. I think we should bite the
> bullet and invent a way to extend the page header at upgrade.

I just emailed a possible layout that allows for future page version
number expansion.  

I don't think there is any magic bullet that will allow for page header
extension by pg_upgrade.  If it is going to be done, it would have to
happen in the backend while the system is running.  Anything that
requires pg_upgrade to do lots of reads or writes basically makes
pg_upgrade useless.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Support for foreign keys with arrays

2012-02-06 Thread Marco Nenciarini
Hi guys,

Please find attached version 3 of our patch. We thoroughly followed your
suggestions and were able to implement "EACH foreign key constraints" 
with multi-column syntax as well.

"EACH foreign key constraints" represent PostgreSQL implementation of 
what are also known as Foreign Key Arrays.

Some limitations occur in this release, but as previously agreed these 
can be refined and defined in future release implementations.

This patch adds:

* support for EACH REFERENCES column constraint on array types 
  - e.g. c1 INT[] EACH REFERENCES t1
* support for EACH foreign key table constraints
  - e.g. FOREIGN KEY (EACH c1) REFERENCES t1
* support for EACH foreign keys in multi-column foreign key table 
  constraints
  - e.g. FOREIGN KEY (c1, EACH c2) REFERENCES t1 (u1, u2)
* support for two new referential actions on update/delete operations
  for single-column only EACH foreign keys:
** EACH CASCADE (deletes or updates elements inside the referencing
   array)
** EACH SET NULL (sets to NULL referencing element inside the foreign
   array)
* support for array_replace and array_remove functions as required by
the above actions

As previously said, we preferred to keep this patch simple for 9.2 and 
forbid EACH CASCADE and EACH SET NULL on multi-column foreign keys.
After all, majority of use cases is represented by EACH foreign key
column constraints (single-column foreign key arrays), and more
complicated use cases can be discussed for 9.3 - should this patch make
it. :)
We can use multi-dimensional arrays as well as referencing columns. In 
that case though, ON DELETE EACH CASCADE will behave like ON DELETE EACH
SET NULL. This is a safe way of implementing the action. 
We have some ideas on how to implement this, but we feel it is better to
limit the behaviour for now.

As far as documentation is concerned, we:
* added actions and constraint info in the catalog
* added an entire section on "EACH foreign key constraints" in the data 
definition language chapter (we've simplified the second example, 
greatly following Noah's advice - let us know if this is ok with you)
* added array_remove (currently limited to single-dimensional arrays) 
and array_replace in the array functions chapter
* modified REFERENCES/FOREIGN KEY section in the CREATE TABLE command's 
documentation and added a special section on the EACH REFERENCES clause 
(using square braces as suggested)

Here follows a short list of notes for Noah:

* You proposed these changes: ARRAY CASCADE -> EACH CASCADE and ARRAY 
SET NULL -> EACH SET NULL. We stack with EACH CASCADE and decided to
prepend the "EACH" keyword to standard's CASCADE and SET NULL. Grammar
is simpler and the emphasis is on the EACH keyword.
* Multi-dimensional arrays: ON DELETE EACH CASCADE -> ON DELETE EACH SET
NULL. We cannot determine the array's number of dimensions at definition
time as it depends on the actual values. As anticipated above, we have 
some ideas on multi-dimensional element removal, but not for this patch 
for the aforementioned reasons.
* Support of EACH CASCADE/SET NULL in ConvertTriggerToFK(): we decided 
to leave it.

Regards,
Marco

-- 
Marco Nenciarini - 2ndQuadrant Italy
PostgreSQL Training, Services and Support
marco.nenciar...@2ndquadrant.it | www.2ndQuadrant.it 



EACH-foreign-key-constraints-aka-foreign-key-arrays.v3.patch.bz2
Description: application/bzip

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SKIP LOCKED DATA

2012-02-06 Thread Merlin Moncure
On Sat, Feb 4, 2012 at 5:38 AM, Thomas Munro  wrote:
> On 16 January 2012 21:30, Josh Berkus  wrote:
>
>> Useful, yes.  Harder than it looks, probably.  I tried to mock up a
>> version of this years ago for a project where I needed it, and ran into
>> all kinds of race conditions.
>
> Can you remember any details about those race conditions?
>
>> Anyway, if it could be made to work, this is extremely useful for any
>> application which needs queueing behavior (with is a large plurality, if
>> not a majority, of applications).
>
> Ok, based on this feedback I decided to push further and try
> implementating this.  See POC/WIP patch attached.  It seems to work
> for simple examples but I haven't yet tried to break it or see how it
> interacts with more complicated queries or high concurrency levels.
> It probably contains at least a few rookie mistakes!  Any feedback
> gratefully received.
>
> The approach is described in my original email.  Short version:
> heap_lock_tuple now takes an enum called wait_policy instead of a
> boolean called nowait, with the following values:
>
>  LockWaitBlock: wait for lock (like nowait = false before),
>
>  LockWaitError: error if not immediately lockable (like nowait = true
>                 before)
>
>  LockWaitSkip:  give up and return HeapTupleWouldBlock if not
>                 immediately lockable (this is a new policy)
>
> The rest of the patch is about getting the appropriate value down to
> that function call, following the example of the existing nowait
> support, and skipping rows if you said SKIP LOCKED DATA and you got
> HeapTupleWouldBlock.
>
> Compared to one very popular commercial database's implementation, I
> think this is a little bit friendlier for the user who wants to
> distribute work.  Let's say you want to lock one row without lock
> contention, which this patch allows with FETCH FIRST 1 ROW ONLY FOR
> UPDATE SKIP LOCKED DATA in an SQL query.  In that other system, the
> mechanism for limiting the number of rows fetched is done in the WHERE
> clause, and therefore the N rows are counted *before* checking if the
> lock can be obtained, so users sometimes have to resort to stored
> procedures so they can control the FETCH from a cursor imperatively.
> In another popular commercial database from Redmond, you can ask for
> the top (first) N rows while using the equivalent of SKIP LOCKED DATA
> and it has the same effect as this patch as far as I can tell, and
> another large blue system is the same.
>
> As discussed in another branch of this thread, you can probably get
> the same effect with transactional advisory locks.  But I personally
> like row skipping better as an explicit feature because:
>
> (1) I think there might be an order-of-evaluation problem with a WHERE
> clause containing both lock testing and row filtering expressions (ie
> it is undefined right?) which you might need subselects to work around
> (ie to be sure to avoid false positive locks, not sure about this).
>
> (2) The advisory technique requires you to introduce an integer
> identifier if you didn't already have one and then lock that despite
> having already said which rows you want to try to lock in standard row
> filtering expressions.
>
> (3) The advisory technique requires all users of the table to
> participate in the advisory lock protocol even if they don't want to
> use the option to skip lock.
>
> (4) It complements NOWAIT (which could also have been done with
> transactional advisory locks, with a function like try_lock_or_fail).
>
> (5) I like the idea of directly porting applications from other
> databases with this feature (that is what led me here).

Yeah -- also your stuff is also going to have much better interactions
with LIMIT.  Your enhancements will beat an advisory lock
implementation all day long.  the real competition is not advisory
locks, but the mvcc tricks played with PGQ.  It's all about
implementing producer consumer queues (arguments that databases should
not do that are 100% bogus) and I can't help but wonder if that's a
better system.

merlin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-02-06 Thread Bruce Momjian
On Mon, Feb 06, 2012 at 09:05:19AM +, Simon Riggs wrote:
> > In any case, I think it's a very bad idea to remove the page version field.
> > How are we supposed to ever be able to change the page format again if we
> > don't even have a version number on the page? I strongly oppose removing it.
> 
> Nobody is removing the version field, nor is anybody suggesting not
> being able to tell which page version we are looking at.

Agreed.  I thought the idea was that we have a 16-bit page _version_
number and 8+ free page _flag_ bits, which are all currently zero.  The
idea was to move the version number from 16-bit field into the unused
flag bits, and use the 16-bit field for the checksum.  I would like to
have some logic that would allow tools inspecting the page to tell if
they should look for the version number in the bits at the beginning of
the page or at the end.

Specifically, this becomes the checksum:

uint16  pd_pagesize_version;

and this holds the page version, if we have updated the page to the new
format:

uint16  pd_flags;   /* flag bits, see below */

Of the 16 bits of pd_flags, these are the only ones used:

#define PD_HAS_FREE_LINES   0x0001  /* are there any unused line 
pointers? */
#define PD_PAGE_FULL0x0002  /* not enough free space for new

 * tuple? */
#define PD_ALL_VISIBLE  0x0004  /* all tuples on page are 
visible to

 * everyone */

#define PD_VALID_FLAG_BITS  0x0007  /* OR of all valid pd_flags 
bits */


> > I'm also not very comfortable with the idea of having flags on the page
> > indicating whether it has a checksum or not. It's not hard to imagine a
> > software of firmware bug or hardware failure that would cause pd_flags field
> > to be zeroed out altogether. It would be more robust if the expected
> > bit-pattern was not 0-0, but 1-0 or 0-1, but then you need to deal with that
> > at upgrade somehow. And it still feels a bit whacky anyway.

> Good idea. Lets use
> 
> 0-0-0 to represent upgraded from previous version, needs a bit set
> 0-0-1 to represent new version number of page, no checksum
> 1-1-1 to represent new version number of page, with checksum
> 
> So we have 1 bit dedicated to the page version, 2 bits to the checksum 
> indicator

Interesting point that we would not be guarding against a bit flip from
1 to 0 for the checksum bit; I agree using two bits is the way to go.  I
don't see how upgrade figures into this.

However, I am unclear how Simon's idea above actually works.  We need
two bits for redundancy, both 1, to mark a page as having a checksum.  I
don't think mixing the idea of a new page version and checksum enabled
really makes sense, especially since we have to plan for future page
version changes.

I think we dedicate 2 bits to say we have computed a checksum, and 3
bits to mark up to 8 possible page versions, so the logic is, in
pd_flags, we use bits 0x8 and 0x16 to indicate that a checksum is stored
on the page, and we use 0x32 and later for the page version number.  We
can assume all the remaining bits are for the page version number until
we need to define new bits, and we can start storing them at the end
first, and work forward.  If all the version bits are zero, it means the
page version number is still stored in pd_pagesize_version.

> >> I wonder if we should just dedicate 3 page header bits, call that the
> >> page version number, and set this new version number to 1, and assume
> >> all previous versions were zero, and have them look in the old page
> >> version location if the new version number is zero.  I am basically
> >> thinking of how we can plan ahead to move the version number to a new
> >> location and have a defined way of finding the page version number using
> >> old and new schemes.
> >
> >
> > Three bits seems short-sighted, but yeah, something like 6-8 bits should be
> > enough. On the whole, though. I think we should bite the bullet and invent a
> > way to extend the page header at upgrade.
> 
> There are currently many spare bits. I don't see any need to allocate
> them to this specific use ahead of time - especially since that is the
> exact decision we took last time when we reserved 16 bits for the
> version.

Right, but I am thinking we should set things up so we can grow the page
version number into the unused bit, rather than box it between bits we
are already using.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [GENERAL] pg_dump -s dumps data?!

2012-02-06 Thread Andrew Dunstan



On 01/31/2012 11:10 PM, Andrew Dunstan wrote:



Here's a possible patch for the exclude-table-data problem along the 
lines you suggest.



Should I apply this?

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] about type cast

2012-02-06 Thread Tom Lane
Robert Haas  writes:
> 2012/2/6 zoulx1982 :
>> my question is why int can cast to bit , but bot for bit varying?

> Off the top of my head, I'm guessing that it's just a case of nobody
> having implemented it.

I have some vague recollection that we didn't want to be squishy about
the length of the resulting bitstring.  When you cast to bit you're more
or less forced to specify what you want, since the spec-mandated default
length is only 1.  varbit would leave us having to make a decision in
the code.

Anyway, try searching the pgsql-hackers archives if you want some
history.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] about type cast

2012-02-06 Thread Robert Haas
2012/2/6 zoulx1982 :
> hi,
> there is a problem about type cast that i don't understand, follow is my
> test.
>
> postgres=# select 10::bit(3);
>  bit
> -
>  010
> (1 row)
> postgres=# select 10::bit varying(3);
> ERROR:  cannot cast type integer to bit varying
> LINE 1: select 10::bit varying(3);
>  ^
> postgres=#
>
> my question is why int can cast to bit , but bot for bit varying?
> i want to know the reason.
> thank you for your timing.

Off the top of my head, I'm guessing that it's just a case of nobody
having implemented it.  You can do this:

rhaas=# select 10::bit(3)::varbit;
 varbit

 010
(1 row)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ecpglib use PQconnectdbParams

2012-02-06 Thread Marko Kreen
On Fri, Feb 03, 2012 at 01:15:30PM +0100, Christian Ullrich wrote:
> * Peter Eisentraut wrote:
> >I noticed ecpglib still uses PQconnectdb() with a craftily assembled
> >connection string.  Here is a patch to use PQconnectdbParams() instead.
> 
> + const char *conn_keywords[6];
> + const char *conn_values[6];
> 
> Shouldn't these be [7]? You can have up to 6 items, plus the terminator.

This bug is now in GIT.

-- 
marko


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] freezing multixacts

2012-02-06 Thread Robert Haas
On Mon, Feb 6, 2012 at 9:31 AM, Alvaro Herrera
 wrote:
>> Suppose you have a tuple A which is locked by a series of transactions
>> T0, T1, T2, ...; AIUI, each new locker is going to have to create a
>> new mxid with all the existing entries plus a new one for itself.
>> But, unless I'm confused, as it's doing so, it can discard any entries
>> for locks taken by transactions which are no longer running.
>
> That's correct.  But the problem is a tuple that is locked or updated by
> a very old transaction that doesn't commit or rollback, and the tuple is
> never locked again.  Eventually the Xid could remain live while the mxid
> is in wraparound danger.

Ah, I see.  I think we should probably handle that the same way we do
for XIDs: try to force autovac when things get tight, then start
issuing warnings, and finally just refuse to assign any more MXIDs.

Another thing that might make sense, for both XIDs and MXIDs, is to
start killing transactions that are preventing vacuum/autovacuum from
doing their thing.  This could mean either killing the people who are
holding back RecentGlobalXmin, so that we can actually freeze the old
stuff; or killing people who are holding a conflicting lock, using the
recovery-conflict stuff or some adaptation of it.  We've made it
fairly difficult to avoid having autovacuum run at all with the
xidVacLimit/xidStopLimit stuff, but there's still no real defense
against autovacuum running but failing to mitigate the problem, either
because there's a long-running transaction holding a snapshot open, or
because someone is sitting on a relation or buffer lock.  This of
course is off-topic from your patch here...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] freezing multixacts

2012-02-06 Thread Simon Riggs
On Thu, Feb 2, 2012 at 4:33 AM, Alvaro Herrera  wrote:

> However, there are cases where not even that is possible -- consider
> tuple freezing during WAL recovery.  Recovery is going to need to
> replace those multis with other multis, but it cannot create new multis
> itself.  The only solution here appears to be that when multis are
> frozen in the master, replacement multis have to be logged too.  So the
> heap_freeze_tuple Xlog record will have a map of old multi to new.  That
> way, recovery can just determine the new multi to use for any particular
> old multi; since multixact creation is also logged, we're certain that
> the replacement value has already been defined.

Multixacts are ignored during recovery. Why do anything at all?

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Dry-run mode for pg_archivecleanup

2012-02-06 Thread Gabriele Bartolini

Ciao Alvaro,

Il 06/02/12 16:02, Alvaro Herrera ha scritto:

FWIW I committed this last week, though I changed the debug message
wording slightly -- hope that's OK; it can be adjusted of course.

http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=b2e431a4db81a735d1474c4d1565a20b835878c9



Beautiful! :)

Thank you very much.

Ciao,
Gabriele

--
 Gabriele Bartolini - 2ndQuadrant Italia
 PostgreSQL Training, Services and Support
 gabriele.bartol...@2ndquadrant.it | www.2ndQuadrant.it


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Dry-run mode for pg_archivecleanup

2012-02-06 Thread Alvaro Herrera

Excerpts from Gabriele Bartolini's message of mar ene 31 12:29:51 -0300 2012:
> Hi Robert,
> 
> sorry for the delay.
> 
> Il 27/01/12 15:47, Robert Haas ha scritto:
> > This email thread seems to have trailed off without reaching a 
> > conclusion. The patch is marked as Waiting on Author in the CommitFest 
> > application, but I'm not sure that's accurate. Can we try to nail this 
> > down? 
> Here is my final version which embeds comments from Josh. I have also 
> added debug information to be printed in case '-d' is given.

FWIW I committed this last week, though I changed the debug message
wording slightly -- hope that's OK; it can be adjusted of course.

http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=b2e431a4db81a735d1474c4d1565a20b835878c9

-- 
Álvaro Herrera 
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] some longer, larger pgbench tests with various performance-related patches

2012-02-06 Thread Robert Haas
On Sat, Feb 4, 2012 at 2:13 PM, Jeff Janes  wrote:
> We really need to nail that down.  Could you post the scripts (on the
> wiki) you use for running the benchmark and making the graph?  I'd
> like to see how much work it would be for me to change it to detect
> checkpoints and do something like color the markers blue during
> checkpoints and red elsewhen.

They're pretty crude - I've attached them here.

> Also, I'm not sure how bad that graph really is.  The overall
> throughput is more variable, and there are a few latency spikes but
> they are few.  The dominant feature is simply that the long-term
> average is less than the initial burst.Of course the goal is to have
> a high level of throughput with a smooth latency under sustained
> conditions.  But to expect that that long-sustained smooth level of
> throughput be identical to the "initial burst throughput" sounds like
> more of a fantasy than a goal.

That's probably true, but the drop-off is currently quite extreme.
The fact that disabling full_page_writes causes throughput to increase
by >4x is dismaying, at least to me.

> If we want to accept the lowered
> throughput and work on the what variability/spikes are there, I think
> a good approach would be to take the long term TPS average, and dial
> the number of clients back until the initial burst TPS matches that
> long term average.  Then see if the spikes still exist over the long
> term using that dialed back number of clients.

Hmm, I might be able to do that.

> I don't think the full-page-writes are leading to WALInsert
> contention, for example, because that would probably lead to smooth
> throughput decline, but not those latency spikes in which those entire
> seconds go by without transactions.

Right.

> I doubt it is leading to general
> IO compaction, as the IO at that point should be pretty much
> sequential (the checkpoint has not yet reached the sync stage, the WAL
> is sequential).  So I bet that that is caused by fsyncs occurring at
> xlog segment switches, and the locking that that entails.

That's definitely possible.

> If I
> recall, we can have a segment which is completely written to OS and in
> the process of being fsynced, and we can have another segment which is
> in some state of partially in wal_buffers and partly written out to OS
> cache, but that we can't start reusing the wal_buffers that were
> already written to OS for that segment (and therefore are
> theoretically available for reuse by the upcoming 3rd segment)  until
> the previous segments fsync has completed.  So all WALInsert's freeze.
>  Or something like that.  This code has changed a bit since last time
> I studied it.

Yeah, I need to better-characterize where the pauses are coming from,
but I'm reluctant to invest too much effort in until Heikki's xlog
scaling patch goes in, because I think that's going to change things
enough that any work done now will mostly be wasted.

It might be worth trying a run with wal_buffers=32MB or something like
that, just to see whether that mitigates any of the locking pile-ups.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


runtestl
Description: Binary data


makeplot
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] freezing multixacts

2012-02-06 Thread Alvaro Herrera

Excerpts from Robert Haas's message of jue feb 02 11:24:08 -0300 2012:
> On Wed, Feb 1, 2012 at 11:33 PM, Alvaro Herrera  
> wrote:

> > If there's only one remaining member, the problem is easy: replace it
> > with that transaction's xid, and set the appropriate hint bits.  But if
> > there's more than one, the only way out is to create a new multi.  This
> > increases multixactid consumption, but I don't see any other option.
> 
> Why do we need to freeze anything if the transactions are still
> running?  We certainly don't freeze regular transaction IDs while the
> transactions are still running; it would give wrong answers.  It's
> probably possible to do it for mxids, but why would you need to?

Well, I was thinking that we could continue generating the mxids
continuously and if we didn't freeze the old running ones, we could
overflow.  So one way to deal with the problem would be rewriting the
old ones into new ones.  But it has occurred to me that instead of doing
that we could simply disallow creation of new ones until the oldest ones
have been closed and removed from tables -- which is more in line with
what we do for Xids anyway.

> Suppose you have a tuple A which is locked by a series of transactions
> T0, T1, T2, ...; AIUI, each new locker is going to have to create a
> new mxid with all the existing entries plus a new one for itself.
> But, unless I'm confused, as it's doing so, it can discard any entries
> for locks taken by transactions which are no longer running.

That's correct.  But the problem is a tuple that is locked or updated by
a very old transaction that doesn't commit or rollback, and the tuple is
never locked again.  Eventually the Xid could remain live while the mxid
is in wraparound danger.

> So given
> an mxid with living members, any dead member in that mxid must have
> been living at the time the newest member was added.  Surely we can't
> be consuming mxids anywhere near fast enough for that to be a problem.

Well, the problem is that while it should be rare to consume mxids as
fast as necessary for this problem to show up, it *is* possible --
unless we add some protection that they are not created until the old
ones are frozen (which now means "removed").

> > However, there are cases where not even that is possible -- consider
> > tuple freezing during WAL recovery.  Recovery is going to need to
> > replace those multis with other multis, but it cannot create new multis
> > itself.  The only solution here appears to be that when multis are
> > frozen in the master, replacement multis have to be logged too.  So the
> > heap_freeze_tuple Xlog record will have a map of old multi to new.  That
> > way, recovery can just determine the new multi to use for any particular
> > old multi; since multixact creation is also logged, we're certain that
> > the replacement value has already been defined.
> 
> This doesn't sound right.  Why would recovery need to create a multi
> that didn't exist on the master?  Any multi it applies to a record
> should be one that it was told to apply by the master; and the master
> should have already WAL-logged the creation of that multi.  I don't
> think that "replacement" mxids have to be logged; I think that *all*
> mxids have to be logged.  Am I all wet?

Well, yeah, all mxids are logged, in particular those that would have
been used for replacement.  However I think I've discarded the idea of
replacement altogether now, because it makes simpler both on master and
slave.

-- 
Álvaro Herrera 
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] libpq parallel build

2012-02-06 Thread Lionel Elie Mamane
src/interfaces/libpq/fe-misc.c #include-s pg_config_paths.h, but in
the makefile that dependency is not declared.

This breaks the build with 'make -jN' (with a "big" N) ... sometimes
(non-deterministically).

Patch attached. Made from 9.1.1, but from browsing the git master on
the web interface, the facts above seem to still be true.

-- 
Lionel
diff --recursive -u misc/build/postgresql-9.1.1/src/interfaces/libpq/Makefile misc/build/postgresql-9.1.1.patch/src/interfaces/libpq/Makefile
--- misc/build/postgresql-9.1.1/src/interfaces/libpq/Makefile	2012-02-06 15:11:19.0 +0100
+++ misc/build/postgresql-9.1.1.patch/src/interfaces/libpq/Makefile	2012-02-06 15:02:51.0 +0100
@@ -109,6 +109,7 @@
 libpq.rc: $(top_builddir)/src/Makefile.global
 
 fe-connect.o: fe-connect.c $(top_builddir)/src/port/pg_config_paths.h
+fe-misc.o: fe-misc.c $(top_builddir)/src/port/pg_config_paths.h
 
 $(top_builddir)/src/port/pg_config_paths.h:
 	$(MAKE) -C $(top_builddir)/src/port pg_config_paths.h

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] double writes using "double-write buffer" approach [WIP]

2012-02-06 Thread Fujii Masao
On Sat, Jan 28, 2012 at 7:31 AM, Dan Scales  wrote:
> Let me know if you have any thoughts/comments, etc.  The patch is
> enclosed, and the README.doublewrites is updated a fair bit.

ISTM that the double-write can prevent torn-pages in neither double-write file
nor data file in *base backup*. Because both double-write file and data file can
be backed up while being written. Is this right? To avoid the torn-page problem,
we should write FPI to WAL during online backup even if the double-write has
been committed?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-02-06 Thread Heikki Linnakangas

On 06.02.2012 11:25, Simon Riggs wrote:

On Mon, Feb 6, 2012 at 10:02 AM, Heikki Linnakangas
  wrote:

Good idea. However, the followup idea to that discussion was to not only
avoid the I/O needed to mark tuples as frozen, but to avoid xid wraparound
altogether, by allowing clog to grow indefinitely. You do want to freeze at
some point of course, to truncate the clog, but it would be nice to not have
a hard limit. The way to do that is to store an xid "epoch" in the page
header, so that Xids are effectively 64-bits wide, even though the xid
fields on the tuple header are only 32-bits wide. That does require a new
field in the page header.


We wouldn't need to do that would we?


Huh? Do you mean that we wouldn't need to implement that feature?

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-02-06 Thread Simon Riggs
On Mon, Feb 6, 2012 at 10:02 AM, Heikki Linnakangas
 wrote:
> On 06.02.2012 10:05, Simon Riggs wrote:
>>
>> On Mon, Feb 6, 2012 at 7:51 AM, Heikki Linnakangas
>>   wrote:
>>>
>>> On 06.02.2012 05:48, Bruce Momjian wrote:


 Thanks.  Clearly we don't need 16 bits to represent our page version
 number because we rarely change it. The other good thing is I don't
 remember anyone wanting additional per-page storage in the past few
 years except for a checksum.
>>>
>>>
>>> There's this idea that I'd like to see implemented:
>>> http://archives.postgresql.org/pgsql-hackers/2010-05/msg01176.php
>>
>>
>> As you'll note, adding that field will change the page format and is
>> therefore ruled out for 9.2.
>>
>> ISTM there is a different way to handle that anyway. All we need to do
>> is to record the LSN of the last wraparound in shared memory/control
>> file. Any block with page LSN older than that has all-frozen rows.
>> That doesn't use any space nor does it require another field to be
>> set.
>
>
> Good idea. However, the followup idea to that discussion was to not only
> avoid the I/O needed to mark tuples as frozen, but to avoid xid wraparound
> altogether, by allowing clog to grow indefinitely. You do want to freeze at
> some point of course, to truncate the clog, but it would be nice to not have
> a hard limit. The way to do that is to store an xid "epoch" in the page
> header, so that Xids are effectively 64-bits wide, even though the xid
> fields on the tuple header are only 32-bits wide. That does require a new
> field in the page header.

We wouldn't need to do that would we?

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-02-06 Thread Heikki Linnakangas

On 06.02.2012 10:05, Simon Riggs wrote:

On Mon, Feb 6, 2012 at 7:51 AM, Heikki Linnakangas
  wrote:

On 06.02.2012 05:48, Bruce Momjian wrote:


Thanks.  Clearly we don't need 16 bits to represent our page version
number because we rarely change it. The other good thing is I don't
remember anyone wanting additional per-page storage in the past few
years except for a checksum.


There's this idea that I'd like to see implemented:
http://archives.postgresql.org/pgsql-hackers/2010-05/msg01176.php


As you'll note, adding that field will change the page format and is
therefore ruled out for 9.2.

ISTM there is a different way to handle that anyway. All we need to do
is to record the LSN of the last wraparound in shared memory/control
file. Any block with page LSN older than that has all-frozen rows.
That doesn't use any space nor does it require another field to be
set.


Good idea. However, the followup idea to that discussion was to not only 
avoid the I/O needed to mark tuples as frozen, but to avoid xid 
wraparound altogether, by allowing clog to grow indefinitely. You do 
want to freeze at some point of course, to truncate the clog, but it 
would be nice to not have a hard limit. The way to do that is to store 
an xid "epoch" in the page header, so that Xids are effectively 64-bits 
wide, even though the xid fields on the tuple header are only 32-bits 
wide. That does require a new field in the page header.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Report: race conditions in WAL replay routines

2012-02-06 Thread Simon Riggs
On Sun, Feb 5, 2012 at 11:14 PM, Tom Lane  wrote:
> Simon Riggs  writes:
>> Please post the patch rather than fixing directly. There's some subtle
>> stuff there and it would be best to discuss first.
>
> And here's a proposed patch for not throwing ERROR during replay of DROP
> TABLESPACE.  I had originally thought this would be a one-liner
> s/ERROR/LOG/, but on inspection destroy_tablespace_directories() really
> needs to be changed too, so that it doesn't throw error for unremovable
> directories.

Looks good.

The existing errmsg of "tablespace is not empty" doesn't cover all
reasons why tablespace was not removed.

The final message should have
errmsg "tablespace not fully removed"
errhint "you should resolve this manually if it causes further problems"

The errdetail is good.


-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Report: race conditions in WAL replay routines

2012-02-06 Thread Simon Riggs
On Sun, Feb 5, 2012 at 10:23 PM, Tom Lane  wrote:
> Simon Riggs  writes:
>> Please post the patch rather than fixing directly. There's some subtle
>> stuff there and it would be best to discuss first.
>
> Here's a proposed patch for the issues around unlocked updates of
> shared-memory state.  After going through this I believe that there is
> little risk of any real problems in the current state of the code; this
> is more in the nature of future-proofing against foreseeable changes.
> (One such is that we'd discussed fixing the age() function to work
> during Hot Standby.)  So I suggest applying this to HEAD but not
> back-patching.

All looks very good to me. Agreed.

> Except for one thing.  I realized while looking at the NEXTOID replay
> code that it is completely broken: it only advances
> ShmemVariableCache->nextOid when that's less than the value in the WAL
> record.  So that comparison fails if the OID counter wraps around during
> replay.  I've fixed this in the attached patch by just forcibly
> assigning the new value instead of trying to be smart, and I think
> probably that aspect of it needs to be back-patched.

Ouch! Well spotted.

Suggest fixing that as a separate patch; looks like backpatch to 8.0.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-02-06 Thread Simon Riggs
On Mon, Feb 6, 2012 at 7:51 AM, Heikki Linnakangas
 wrote:
> On 06.02.2012 05:48, Bruce Momjian wrote:
>>
>> On Sun, Feb 05, 2012 at 08:40:09PM +, Simon Riggs wrote:
>>>
>>> In the proposed scheme there are two flag bits set on the page to
>>> indicate whether the field is used as a checksum rather than a version
>>> number. So its possible the checksum could look like a valid page
>>> version number, but we'd still be able to tell the difference.
>>
>>
>> Thanks.  Clearly we don't need 16 bits to represent our page version
>> number because we rarely change it. The other good thing is I don't
>> remember anyone wanting additional per-page storage in the past few
>> years except for a checksum.
>
>
> There's this idea that I'd like to see implemented:
> http://archives.postgresql.org/pgsql-hackers/2010-05/msg01176.php

As you'll note, adding that field will change the page format and is
therefore ruled out for 9.2.

ISTM there is a different way to handle that anyway. All we need to do
is to record the LSN of the last wraparound in shared memory/control
file. Any block with page LSN older than that has all-frozen rows.
That doesn't use any space nor does it require another field to be
set.

That leaves the same problem that if someone writes to the page you
need to freeze the rows first.

> In any case, I think it's a very bad idea to remove the page version field.
> How are we supposed to ever be able to change the page format again if we
> don't even have a version number on the page? I strongly oppose removing it.

Nobody is removing the version field, nor is anybody suggesting not
being able to tell which page version we are looking at.

> I'm also not very comfortable with the idea of having flags on the page
> indicating whether it has a checksum or not. It's not hard to imagine a
> software of firmware bug or hardware failure that would cause pd_flags field
> to be zeroed out altogether. It would be more robust if the expected
> bit-pattern was not 0-0, but 1-0 or 0-1, but then you need to deal with that
> at upgrade somehow. And it still feels a bit whacky anyway.

Good idea. Lets use

0-0-0 to represent upgraded from previous version, needs a bit set
0-0-1 to represent new version number of page, no checksum
1-1-1 to represent new version number of page, with checksum

So we have 1 bit dedicated to the page version, 2 bits to the checksum indicator

>> I wonder if we should just dedicate 3 page header bits, call that the
>> page version number, and set this new version number to 1, and assume
>> all previous versions were zero, and have them look in the old page
>> version location if the new version number is zero.  I am basically
>> thinking of how we can plan ahead to move the version number to a new
>> location and have a defined way of finding the page version number using
>> old and new schemes.
>
>
> Three bits seems short-sighted, but yeah, something like 6-8 bits should be
> enough. On the whole, though. I think we should bite the bullet and invent a
> way to extend the page header at upgrade.

There are currently many spare bits. I don't see any need to allocate
them to this specific use ahead of time - especially since that is the
exact decision we took last time when we reserved 16 bits for the
version.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-02-06 Thread Simon Riggs
On Mon, Feb 6, 2012 at 4:48 AM, Bruce Momjian  wrote:
> On Sun, Feb 05, 2012 at 08:40:09PM +, Simon Riggs wrote:
>> On Sun, Feb 5, 2012 at 3:59 AM, Bruce Momjian  wrote:
>> > On Sat, Dec 24, 2011 at 03:56:58PM +, Simon Riggs wrote:
>> >> > Also, as far as I can see this patch usurps the page version field,
>> >> > which I find unacceptably short-sighted.  Do you really think this is
>> >> > the last page layout change we'll ever make?
>> >>
>> >> No, I don't. I hope and expect the next page layout change to
>> >> reintroduce such a field.
>> >>
>> >> But since we're agreed now that upgrading is important, changing page
>> >> format isn't likely to be happening until we get an online upgrade
>> >> process. So future changes are much less likely. If they do happen, we
>> >> have some flag bits spare that can be used to indicate later versions.
>> >> It's not the prettiest thing in the world, but it's a small ugliness
>> >> in return for an important feature. If there was a way without that, I
>> >> would have chosen it.
>> >
>> > Have you considered the CRC might match a valuid page version number?
>> > Is that safe?
>>
>> In the proposed scheme there are two flag bits set on the page to
>> indicate whether the field is used as a checksum rather than a version
>> number. So its possible the checksum could look like a valid page
>> version number, but we'd still be able to tell the difference.
>
> Thanks.  Clearly we don't need 16 bits to represent our page version
> number because we rarely change it.  The other good thing is I don't
> remember anyone wanting additional per-page storage in the past few
> years except for a checksum.
>
> I wonder if we should just dedicate 3 page header bits, call that the
> page version number, and set this new version number to 1, and assume
> all previous versions were zero, and have them look in the old page
> version location if the new version number is zero.  I am basically
> thinking of how we can plan ahead to move the version number to a new
> location and have a defined way of finding the page version number using
> old and new schemes.
>
> I don't want to get to a point where we need a bit per page number
> version.

Agreed, though I think we need at least 2 bits set if we are using the
checksum, so we should start at version 2 to ensure that.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] about type cast

2012-02-06 Thread zoulx1982
hi,
there is a problem about type cast that i don't understand, follow is my test.
 
postgres=# select 10::bit(3);
 bit
-
 010
(1 row)
postgres=# select 10::bit varying(3);
ERROR:  cannot cast type integer to bit varying
LINE 1: select 10::bit varying(3);
 ^
postgres=#
 
my question is why int can cast to bit , but bot for bit varying?
i want to know the reason.
thank you for your timing.

Re: [HACKERS] pgsql_fdw, FDW for PostgreSQL server

2012-02-06 Thread Shigeru Hanada
Thanks for the comments!

(2012/02/06 5:08), Kohei KaiGai wrote:
> Yes. In my opinion, one significant benefit of pgsql_fdw is to execute
> qualifiers on the distributed nodes; that enables to utilize multiple
> CPU resources efficiently.
> Duplicate checks are reliable way to keep invisible tuples being filtered
> out, indeed. But it shall degrade one competitive characteristics of the
> pgsql_fdw.
> 
> https://github.com/kaigai/pg_strom/blob/master/plan.c#L693
> In my module, qualifiers being executable on device side are detached
> from the baserel->baserestrictinfo, and remaining qualifiers are chained
> to the list.
> The is_device_executable_qual() is equivalent to is_foreign_expr() in
> the pgsql_fdw module.

Agreed, I too think that pushed-down qualifiers should not be evaluated
on local side again from the viewpoint of performance.

> Of course, it is your decision, and I might miss something.
> 
> BTW, what is the undesirable behavior on your previous implementation?

In early development, maybe during testing PREPARE/EXECUTE or DECALRE, I
saw that iterated execution of foreign scan produce wrong result which
includes rows which are NOT match pushed-down qualifiers.  And, at last,
I could recall what happened at that time.  It was just trivial bug I
made.  Perhaps I've removed pushed-down qualifiers in Path generation
phase, so generated plan node has lost qualifiers permanently.

In short, I'll remove remote qualifiers from baserestrictinfo, like
pg_storm.

>>> [Design comment]
>>> I'm not sure the reason why store_result() uses MessageContext to save
>>> the Tuplestorestate within PgsqlFdwExecutionState.
>>> The source code comment says it is used to avoid memory leaks in error
>>> cases. I also have a similar experience on implementation of my fdw module,
>>> so, I could understand per-scan context is already cleared at the timing of
>>> resource-release-callback, thus, handlers to external resource have to be
>>> saved on separated memory context.
>>> In my personal opinion, the right design is to declare a memory context for
>>> pgsql_fdw itself, instead of the abuse of existing memory context.
>>> (More wise design is to define sub-memory-context for each foreign-scan,
>>> then, remove the sub-memory-context after release handlers.)
>>
>> I simply chose built-in context which has enough lifespan, but now I
>> think that using MessageContext directly is not recommended way.  As you
>> say, creating new context as child of MessageContext for each scan in
>> BeginForeignScan (or first IterateForeignScan) would be better.  Please
>> see attached patch.
>>
>> One other option is getting rid of tuplestore by holding result rows as
>> PGresult, and track it for error cases which might happen.
>> ResourceOwner callback can be used to release PGresult on error, similar
>> to PGconn.
>>
> If we could have set of results on per-query memory context (thus,
> no need to explicit release on error timing), it is more ideal design.
> It it possible to implement based on the libpq APIs?

Currently no, so I used tuplestore even though it needs coping results.
However, Kyotaro Horiguchi's patch might make it possible. I'm reading
his patch to determine whether it suits pgsql_fdw.

http://archives.postgresql.org/message-id/20120202143057.ga12...@gmail.com

> Please note that per-query memory context is already released on
> ResourceOwner callback is launched, so, it is unavailable to implement
> if libpq requires to release some resource.

I see.  We need to use context which has longer lifetime if we want to
track malloc'ed PQresult.  I already use CacheContext for connection
pooling, so linking PGreslts to its source connection would be a solutions.

>>> [Design comment]
>>> When "BEGIN" should be issued on the remote-side?
>>> The connect_pg_server() is an only chance to issue "BEGIN" command
>>> at the remote-side on connection being opened. However, the connection
>>> shall be kept unless an error is not raised. Thus, the remote-side will
>>> continue to work within a single transaction block, even if local-side 
>>> iterates
>>> a pair of "BEGIN" and "COMMIT".
>>> I'd like to suggest to close the transaction block at the timing of either
>>> end of the scan, transaction or sub-transaction.
>>
>> Indeed, remote transactions should be terminated at some timing.
>> Terminating at the end of a scan seems troublesome because a connection
>> might be shared by multiple scans in a query.  I'd prefer aborting
>> remote transaction at the end of local query.  Please see
>> abort_remote_tx in attached patch.
>>
> It seems to me abort_remote_tx in ReleaseConnection() is reasonable.
> However, isn't it needed to have ABORT in GetConnection() at first time?

Hm, forcing overhead of aborting transaction to all local queries is
unreasonable.  Redundant BEGIN doesn't cause error but just generate
WARNING, so I'll remove abort_remote_tx preceding begin_remote_tx.

>>> [Comment to improve]
>>> It seems to me the design of