[HACKERS] Single pass vacuum - take 2

2011-08-22 Thread Pavan Deolasee
Hi All,

Here is a revised patch based on our earlier discussion. I implemented
Robert's idea of tracking the vacuum generation number in the line
pointer itself. For LP_DEAD line pointers, the lp_off/lp_len is unused
(and always set to 0 for heap tuples). We use those 30 bits to store
the generation number of the vacuum which would have potentially
removed the corresponding index pointers, if the vacuum finished
successfully. The pg_class information is used to know the status of
the vacuum, whether it failed or succeeded. 30-bit numbers are large
enough that we can ignore any wrap-around related issues. With this
change, we don't need any additional header or special space in the
page which was one of the main objection to the previous version.

Other than this major change, I have added code commentary at relevant
places and also fixed the item.h comments to reflect the change. I
think the patch is ready for a serious review now.

Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB     http://www.enterprisedb.com
diff --git a/contrib/pageinspect/heapfuncs.c b/contrib/pageinspect/heapfuncs.c
index 20bca0d..6213631 100644
--- a/contrib/pageinspect/heapfuncs.c
+++ b/contrib/pageinspect/heapfuncs.c
@@ -155,6 +155,7 @@ heap_page_items(PG_FUNCTION_ARGS)
 		 * many other ways, but at least we won't crash.
 		 */
 		if (ItemIdHasStorage(id) 
+			!ItemIdIsDead(id) 
 			lp_len = sizeof(HeapTupleHeader) 
 			lp_offset == MAXALIGN(lp_offset) 
 			lp_offset + lp_len = raw_page_size)
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 06db65d..cf65c05 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -3984,7 +3984,8 @@ log_heap_clean(Relation reln, Buffer buffer,
 			   OffsetNumber *redirected, int nredirected,
 			   OffsetNumber *nowdead, int ndead,
 			   OffsetNumber *nowunused, int nunused,
-			   TransactionId latestRemovedXid)
+			   TransactionId latestRemovedXid,
+			   uint32 vacgen)
 {
 	xl_heap_clean xlrec;
 	uint8		info;
@@ -3999,6 +4000,7 @@ log_heap_clean(Relation reln, Buffer buffer,
 	xlrec.latestRemovedXid = latestRemovedXid;
 	xlrec.nredirected = nredirected;
 	xlrec.ndead = ndead;
+	xlrec.vacgen = vacgen;
 
 	rdata[0].data = (char *) xlrec;
 	rdata[0].len = SizeOfHeapClean;
@@ -4300,6 +4302,7 @@ heap_xlog_clean(XLogRecPtr lsn, XLogRecord *record)
 	int			ndead;
 	int			nunused;
 	Size		freespace;
+	uint32		vacgen;
 
 	/*
 	 * We're about to remove tuples. In Hot Standby mode, ensure that there's
@@ -4332,6 +4335,7 @@ heap_xlog_clean(XLogRecPtr lsn, XLogRecord *record)
 
 	nredirected = xlrec-nredirected;
 	ndead = xlrec-ndead;
+	vacgen = xlrec-vacgen;
 	end = (OffsetNumber *) ((char *) xlrec + record-xl_len);
 	redirected = (OffsetNumber *) ((char *) xlrec + SizeOfHeapClean);
 	nowdead = redirected + (nredirected * 2);
@@ -4343,7 +4347,8 @@ heap_xlog_clean(XLogRecPtr lsn, XLogRecord *record)
 	heap_page_prune_execute(buffer,
 			redirected, nredirected,
 			nowdead, ndead,
-			nowunused, nunused);
+			nowunused, nunused,
+			vacgen);
 
 	freespace = PageGetHeapFreeSpace(page);		/* needed to update FSM below */
 
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 0cfa866..00ac676 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -31,9 +31,12 @@ typedef struct
 	TransactionId new_prune_xid;	/* new prune hint value for page */
 	TransactionId latestRemovedXid;		/* latest xid to be removed by this
 		 * prune */
+	int			already_dead;		/* number of already dead line pointers */
+
 	int			nredirected;	/* numbers of entries in arrays below */
 	int			ndead;
 	int			nunused;
+
 	/* arrays that accumulate indexes of items to be changed */
 	OffsetNumber redirected[MaxHeapTuplesPerPage * 2];
 	OffsetNumber nowdead[MaxHeapTuplesPerPage];
@@ -125,8 +128,8 @@ heap_page_prune_opt(Relation relation, Buffer buffer, TransactionId OldestXmin)
 			TransactionId ignore = InvalidTransactionId;		/* return value not
  * needed */
 
-			/* OK to prune */
-			(void) heap_page_prune(relation, buffer, OldestXmin, true, ignore);
+			/* OK to prune - pass invalid vacuum generation number */
+			(void) heap_page_prune(relation, buffer, OldestXmin, true, ignore, 0);
 		}
 
 		/* And release buffer lock */
@@ -153,13 +156,15 @@ heap_page_prune_opt(Relation relation, Buffer buffer, TransactionId OldestXmin)
  */
 int
 heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
-bool report_stats, TransactionId *latestRemovedXid)
+bool report_stats, TransactionId *latestRemovedXid,
+uint32 current_vacgen)
 {
 	int			ndeleted = 0;
 	Page		page = BufferGetPage(buffer);
 	OffsetNumber offnum,
 maxoff;
 	PruneState	prstate;
+	uint32		last_finished_vacgen = RelationGetLastVacGen(relation);
 
 	/*
 	 * Our strategy is to scan the page and make lists of items to change,
@@ -175,6 +180,7 @@ 

[HACKERS] tab stop in README

2011-08-22 Thread YAMAMOTO Takashi
hi,

i know that postgresql uses ts=4 for C source code.
but how about documatation?

src/backend/access/transam/README seems to have both of
ts=4 and ts=8 mixed.

YAMAMOTO Takashi

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] FATAL: lock AccessShareLock on object 0/1260/0 is already held

2011-08-22 Thread daveg
On Fri, Aug 12, 2011 at 04:19:37PM -0700, daveg wrote:
 
 This seems to be bug month for my client. Now there are seeing periods
 where all new connections fail immediately with the error:
 
FATAL:  lock AccessShareLock on object 0/1260/0 is already held 
 
 This happens on postgresql 8.4.7 on a large (512GB, 32 core) system that has
 been up for months. It started happening sporadicly a few days ago. It will
 do this for a period of several minutes to an hour and then go back to
 normal for hours or days.
 
 One complete failing session out of several hundred around that time:
 -
 2011-08-09 00:01:04.446  8823  [unknown]  [unknown]  LOG:  connection 
 received: host=op05.xxx port=34067
 2011-08-09 00:01:04.446  8823  c77  apps  LOG:  connection authorized: 
 user=apps database=c77
 2011-08-09 00:01:04.449  8823  c77  apps  FATAL:  lock AccessShareLock on 
 object 0/1260/0 is already held
 --

This is to add additional information to the original report:

For a while this was happening on many different databases in one postgresql
8.4.7 instance on a single large host ('U2' 512GB 64cpu) running RH 5.
That has been quiet for several days and the newest batches of errors have
happened on only on a single database 'c23', in a postgresql 9.0.4 instance
on a smaller host ('A', 64GB 8cpu) running SuSE 10.2.

No memory errors or other misbehaviour have been seen on either of these
hosts in recent months.

The original error was:

  lock AccessShareLock on object 0/1260/0 is already held

which is for pg_database. The recent errors are:

  lock AccessShareLock on object 16403/2615/0 is already held

which is for pg_namespace in database c23.

All of the orginal and most of the recent batchs of errors were immediately
after connecting to a database and being authorized, that is, before any
statements were attempted. However, some of the most recent are on the first
query statement. That is after logging in and doing things like set
transaction ...  the first select would hit this error.

It seems to come in clusters, sometimes, which suggests something shared
by multiple processes. For example, here are the times for the errors
on c23 in the afternoon of August 20:

20 07:14:12.722

20 16:05:07.798
20 16:05:07.808

20 16:05:10.519

20 16:07:07.726
20 16:07:08.722
20 16:07:09.734
20 16:07:10.656

20 16:07:25.436

20 16:22:23.983
20 16:22:24.014
20 16:22:24.335
20 16:22:24.409
20 16:22:24.477
20 16:22:24.499
20 16:22:24.516

20 16:30:58.210

20 16:31:15.261
20 16:31:15.296
20 16:31:15.324
20 16:31:15.348

20 18:06:16.515

20 18:06:49.198
20 18:06:49.204

20 18:06:51.444

20 21:03:05.940

So far I've got:

  - affects system tables
  - happens very soon after process startup
  - in 8.4.7 and 9.0.4
  - not likely to be hardware or OS related
  - happens in clusters for period of a few second to many minutes

I'll work on printing the LOCK and LOCALLOCK when it happens, but it's
hard to get downtime to pick up new builds. Any other ideas on getting to
the bottom of this?

Thanks

-dg

-- 
David Gould   da...@sonic.net  510 536 1443510 282 0869
If simplicity worked, the world would be overrun with insects.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Question: CREATE EXTENSION and create schema permission?

2011-08-22 Thread Kohei KaiGai
The attached patch adds permission check at the scenario that I
explained bellow.

Unlike CreateSchemaCommand(), we don't have check_is_member_of_role() here
because the extowner is obviously same with the current user in this code path.

I hope this patch being also back ported to v9.1 tree, not only v9.2
development.

Thanks,

2011/8/21 Dimitri Fontaine dimi...@2ndquadrant.fr:
 Kohei KaiGai kai...@kaigai.gr.jp writes:
 The current implementation set the current user as owner of the new schema.
 The default permission check of schema allows owner to create several kinds
 of underlying objects.

 In the result, we may consider a scenario that a user without permissions to
 create new objects possibly get a schema created by CREATE EXTENSION
 that allows him to create new objects (such as table, function, ...).

 I don't think it is a desirable behavior. :-(

 Agreed,
 --
 Dimitri Fontaine
 http://2ndQuadrant.fr     PostgreSQL : Expertise, Formation et Support

-- 
KaiGai Kohei kai...@kaigai.gr.jp


pgsql-create-extension-permission-checks.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: Fast GiST index build

2011-08-22 Thread Alexander Korotkov
On Wed, Aug 17, 2011 at 11:11 AM, Alexander Korotkov
aekorot...@gmail.comwrote:

 On Tue, Aug 16, 2011 at 11:15 PM, Heikki Linnakangas 
 heikki.linnakan...@enterprisedb.com wrote:

 On 16.08.2011 22:10, Heikki Linnakangas wrote:

 Here's an version of the patch with a bunch of minor changes:


 And here it really is, this time with an attachment...

 Thanks a lot. I'm going to start rerunning the tests now.


First bunch of test results will be available soon (tests running and
results processing take some time). While there is a patch with few small
bugfixes.

--
With best regards,
Alexander Korotkov.


gist_fast_build-0.14.2.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] UUID datatype GiST index support

2011-08-22 Thread Misa Simic
Hi,

Hopefully someone can help me and point me in right direction :)

I have been looking for GiST support extension for UUID datatype... since I
could not find it... I wanted to write it myself.

I need it more for EXCLUSION constraint - than to use GIST index just on
UUID column...

i.e:

CREATE TABLE test_exclude
(
id serial NOT NULL,
guid uuid NOT NULL,
valid_period period NOT NULL,
CONSTRAINT test_excludepk PRIMARY KEY (id),
EXCLUDE USING gist (guid WITH =, valid_period WITH ) --for the same guid,
period must not overlap...
)

Has taken a look on btree_gist contrib source code... there are Gist support
functions for many datatypes, so I wanted to  take the same pattern and
make it...

however, problem happend in first line of code :) (tough I am comming from
totally different world - .Net)

pattern is:

typedef struct
{
  ADTdataType lower;
  ADTdataType upper;
} datatypeKEY;

i.e. for Date:

typedef struct
{
  DateADT lower;
  DateADT upper;
} dateKEY;


So I guessed for uuid would be:

typedef struct
{
  pg_uuid_t lower;
  pg_uuid_t upper;
} uuidKEY;

because of in pg uuid.h says:
 *In C, we use the name pg_uuid_t,
 *  to avoid conflicts with any uuid_t type that might be defined by the
system headers...

and there is:

/* opaque struct; defined in uuid.c */
typedef struct pg_uuid_t pg_uuid_t;


But compiler shows error: Field lower (and upper) has incopmplete
datatype

Succeded to avoid error with adding:

struct pg_uuid_t
{
  unsigned char data[UUID_LEN];
}

but then getting errors in compare functions:

i.e.

static int
m4_uuidkey_cmp(const void *a, const void *b)
{
uuidKEY*ia = (uuidKEY *) (((Usrt *) a)-t);
uuidKEY*ib = (uuidKEY *) (((Usrt *) b)-t);
int res;

res = DatumGetInt32(DirectFunctionCall2(uuid_cmp, UUIDPGetDatum(ia-upper),
UUIDPGetDatum(ia-upper)));
 if (res == 0)
return DatumGetInt32(DirectFunctionCall2(uuid_cmp, UUIDPGetDatum(ia-upper),
UUIDPGetDatum(ib-upper)));

return res;
}


Getting error: aggregate error used where an integer was expected!

It would be a lot appreciated if anyone could help me and suggest the best
way to make Gist support for UUID datatype...

Many thanks,

Misa


Re: [HACKERS] UUID datatype GiST index support

2011-08-22 Thread Alexander Korotkov
Hi!

On Mon, Aug 22, 2011 at 2:54 PM, Misa Simic misa.si...@gmail.com wrote:

 static int
 m4_uuidkey_cmp(const void *a, const void *b)
 {
 uuidKEY*ia = (uuidKEY *) (((Usrt *) a)-t);
  uuidKEY*ib = (uuidKEY *) (((Usrt *) b)-t);
 int res;

 res = DatumGetInt32(DirectFunctionCall2(uuid_cmp, UUIDPGetDatum(ia-upper),
 UUIDPGetDatum(ia-upper)));
  if (res == 0)
 return DatumGetInt32(DirectFunctionCall2(uuid_cmp,
 UUIDPGetDatum(ia-upper), UUIDPGetDatum(ib-upper)));

 return res;
 }


 Getting error: aggregate error used where an integer was expected!

Seems that you need the address-of operator before ia-upper and ia-lower
(likely one of operands should be ia-lower). UUIDPGetDatum except pointer
as an argument, i.e. UUIDPGetDatum(ia-upper).


 It would be a lot appreciated if anyone could help me and suggest the best
 way to make Gist support for UUID datatype...

I think you're on the right way. btree_gist is an extension which provides
GiST indexing of scalar datatype. UUID is one of them. So, the module you
are writing should be quite similar.

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] UUID datatype GiST index support

2011-08-22 Thread Misa Simic
Thanks Alexander,

'Scalar datatype' - given me a hint...

Looking further in btree_gist source, for inet datatype, which seems a bit
complexier then uuid... (char, char, char[16]) structure for inet, compared
to jut char[16] for uuid.

GiST pattern works with double datatype... and there is method
convert_network_to_scalar(Datum, Oid), whick converts an inet value - to
scalar value... and then all index compare functions are based on the
doubles which leads me to conclusion (maybe is wrong) if I can convert
UUID value to double it would make a job a lot easier... and pretty
straight forward...

Any suggestion, how to convert UUID (char[16]) to scalar?

looking into convert inet to scalar, what confuses me, even there is
char[16] for an ip address... code is:

if (ip_family 
http://doxygen.postgresql.org/network_8c.html#a7dc77a7bc93b675d36eca352d589b314(ip)
== PGSQL_AF_INET
http://doxygen.postgresql.org/utils_2inet_8h.html#a8ba3e5fe500d587d3eb8699968450b18)
len = 4;
else
len = 5;

res = ip_family
http://doxygen.postgresql.org/network_8c.html#a7dc77a7bc93b675d36eca352d589b314(ip);
for (i = 0; i  len; i++)
{
res *= 256;
res += ip_addr
http://doxygen.postgresql.org/network_8c.html#a54558e944989cddebdb93f2f6cd965a4(ip)[i];
}
return res;



takes just first 4, or 5 values from ipaddres even there is 16 - (decalred
as char[16])

Many thanks,

Misa


2011/8/22 Alexander Korotkov aekorot...@gmail.com

 Hi!

 On Mon, Aug 22, 2011 at 2:54 PM, Misa Simic misa.si...@gmail.com wrote:

 static int
 m4_uuidkey_cmp(const void *a, const void *b)
 {
 uuidKEY*ia = (uuidKEY *) (((Usrt *) a)-t);
  uuidKEY*ib = (uuidKEY *) (((Usrt *) b)-t);
 int res;

 res = DatumGetInt32(DirectFunctionCall2(uuid_cmp,
 UUIDPGetDatum(ia-upper), UUIDPGetDatum(ia-upper)));
  if (res == 0)
 return DatumGetInt32(DirectFunctionCall2(uuid_cmp,
 UUIDPGetDatum(ia-upper), UUIDPGetDatum(ib-upper)));

 return res;
 }


 Getting error: aggregate error used where an integer was expected!

 Seems that you need the address-of operator before ia-upper and ia-lower
 (likely one of operands should be ia-lower). UUIDPGetDatum except pointer
 as an argument, i.e. UUIDPGetDatum(ia-upper).


 It would be a lot appreciated if anyone could help me and suggest the best
 way to make Gist support for UUID datatype...

 I think you're on the right way. btree_gist is an extension which provides
 GiST indexing of scalar datatype. UUID is one of them. So, the module you
 are writing should be quite similar.

 --
 With best regards,
 Alexander Korotkov.


Re: [HACKERS] Rethinking sinval callback hook API

2011-08-22 Thread Robert Haas
On Sun, Aug 21, 2011 at 6:24 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 On Fri, Aug 19, 2011 at 2:13 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 1. Robert was speculating the other day about wanting to be able to
 snoop the inval traffic.  Right now, callbacks can only snoop a fairly
 small subset of it.

 Is that true?  It appears to me that the events that aren't exposed at
 all are smgr and relmap invalidations, which don't seem terribly
 important, and presumably not a majority of the traffic.

 Well, important is in the eye of the beholder here --- if you did need
 to see one of those, you're flat outta luck.  It's also the case that
 you can only snoop one catcache per registration, so there aren't enough
 slots available in the fixed-size list to watch all the catcache
 traffic.

Yeah, I'm not opposed to making this more generic; regardless of
whether we have an immediate use case for it, it seems like a pretty
good idea.  I was just surprised that you described the available
portion as a small subset.

The one-catcache-per-registration limitation is an interesting point.
I doubt that we want to move the is this the relevant catcache? test
inside all the callbacks, but we might want to have a special value
that means call me back when there's a change that affects ANY
catcache... or even more generally call me back when there's a
change that affects ANY system catalog, regardless of whether there is
an associated catcache or not.  sepgsql, for example, really wants to
be able to get a callback when pg_seclabel or pg_shseclabel is
updated, precisely because it wants to maintain its own
special-purpose cache on a catalog that on which we DON'T want to add
a catcache.

 Exposing SharedInvalidationMessage could be going too far in the other
 direction, but I'm thinking we really ought to do something.

I think the best option might be to expose it as an opaque struct.  In
other words, the header file available to other backends would have
something like:

struct SharedInvalidationMessage;
typedef struct SharedInvalidationMessage SharedInvalidationMessage;

typedef enum
{
SIM_CATCACHE,
SIM_CATALOG,
SIM_RELCACHE,
SIM_SMGR,
SIM_RELMAP
} SIMType;

SIMType SIMGetType(SharedInvalidationMessage *);
Oid SIMGetDatabase(SharedInvalidationMessage *);
BackendId SIMGetBackendId(SharedInvalidationMessage *);
/* etc. */

That allows us to do things like change the number of bits we use to
store the backend ID (e.g. from the current 24 to 32 or 16) without
needing to change the callers.   In fact, you could probably even add
whole new message types and most callers wouldn't need to care, since
the typical caller is going to do something like ... if
(SIMGetType(msg) != SIM_SOMETHING) return right off the bat.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PushActiveSnapshot(GetTransactionSnapshot())

2011-08-22 Thread Alvaro Herrera
Excerpts from Simon Riggs's message of dom ago 21 16:23:39 -0300 2011:
 In common cases of snapshot use we run GetSnapshotData() into a
 statically allocated snapshot, then immediately copy the static struct
 into a dynamically allocated copy.
 
 The static allocation was designed to remove the overhead of dynamic
 allocation, but then we do it anyway.
 
 The snapmgr code does this explicitly, but the reason isn't
 documented, it just says we must do this.

IIRC the active snapshot is scribbled onto by some operations, which is
why the copy is mandatory.  Maybe there's some way to optimize things so
that the copy is done only when necessary.  IIRC the copying of the
ActiveSnapshot was only introduced because some subtle bugs were
detected in the code without copy.  When I introduced the mandatory
copy, I don't remember thinking about the statically allocated struct.

The fact that PushActiveSnapshot and GetTransactionSnapshot are in two
completely separate modules complicates optimization.  Note I'm not
saying it's impossible -- I just didn't look into it.

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] 9.1rc1: TRAP: FailedAssertion(!(item_width 0), File: costsize.c, Line: 3274)

2011-08-22 Thread Erik Rijkers
I am developing against 9.1beta3 and encountered this problem. I repeated it 
against rc1 (or
actually, REL9_1_STABLE taken from git today).

The below SQL causes:

TRAP: FailedAssertion(!(item_width  0), File: costsize.c, Line: 3274)

and the session crashes.


select *
from (
  select
   col1
,  col2
,  reviewed
  from
(values
( , 'aweyu', 3)
  , ( , 'bwetu', 2)
  , ( , 'vwxyz', 1)
 )
as r(col1,col2,col3)
  left join (
select 1 as reviewed, c2 from ( values ('abcde') ) as t1(c2)
  union all select 0 as reviewed, c2 from ( values ('vwxyz') ) as t2(c2)
  )
 as up_all on up_all.c2 = r.col2
  order by col3
)
  as f
;


To simplify I used 'values' for the actual tables that I am using, but the 
above still causes the
FailedAssertion.

Removing the outer select avoids the error and
removing order by avoids the error


Erik Rijkers







-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Single pass vacuum - take 2

2011-08-22 Thread Jim Nasby
On Aug 22, 2011, at 1:22 AM, Pavan Deolasee wrote:
 Hi All,
 
 Here is a revised patch based on our earlier discussion. I implemented
 Robert's idea of tracking the vacuum generation number in the line
 pointer itself. For LP_DEAD line pointers, the lp_off/lp_len is unused
 (and always set to 0 for heap tuples). We use those 30 bits to store
 the generation number of the vacuum which would have potentially
 removed the corresponding index pointers, if the vacuum finished
 successfully. The pg_class information is used to know the status of
 the vacuum, whether it failed or succeeded. 30-bit numbers are large
 enough that we can ignore any wrap-around related issues. With this

+* Note: We don't worry about the wrap-around issues here since it would
+* take a 1 Billion vacuums on the same relation for the vacuum 
generation
+* to wrap-around. That would take ages to happen and even if it 
happens,
+* the chances that we might have dead-vacuumed line pointers still
+* stamped with the old (failed) vacuum are infinitely small since some
+* other vacuum cycle would have taken care of them.

It would be good if some comment explained how we're safe in the case of an 
aborted vacuum. I'm guessing that when vacuum finds any line pointers that 
don't match the last successful vacuum exactly it will go and re-examine them 
from scratch?

I'm thinking that there should be a single comment somewhere that explains 
exactly how the 2-pass algorithm works. The comment in vacuum_log_cleanup_info 
seems to have the most info, but there's a few pieces still missing.

Also, found a typo:

+ * pass anyways). But this gives us two lareg benefits:

--
Jim C. Nasby, Database Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] cheaper snapshots redux

2011-08-22 Thread Robert Haas
On Wed, Jul 27, 2011 at 10:51 PM, Robert Haas robertmh...@gmail.com wrote:
 On Wed, Oct 20, 2010 at 10:07 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 I wonder whether we could do something involving WAL properties --- the
 current tuple visibility logic was designed before WAL existed, so it's
 not exploiting that resource at all.  I'm imagining that the kernel of a
 snapshot is just a WAL position, ie the end of WAL as of the time you
 take the snapshot (easy to get in O(1) time).  Visibility tests then
 reduce to did this transaction commit with a WAL record located before
 the specified position?.  You'd need some index datastructure that made
 it reasonably cheap to find out the commit locations of recently
 committed transactions, where recent means back to recentGlobalXmin.
 That seems possibly do-able, though I don't have a concrete design in
 mind.

 [discussion of why I don't think an LSN will work]

 But having said that an LSN can't work, I don't see why we can't just
 use a 64-bit counter.  In fact, the predicate locking code already
 does something much like this, using an SLRU, for serializable
 transactions only.  In more detail, what I'm imagining is an array
 with 4 billion entries, one per XID, probably broken up into files of
 say 16MB each with 2 million entries per file.  Each entry is a 64-bit
 value.  It is 0 if the XID has not yet started, is still running, or
 has aborted.  Otherwise, it is the commit sequence number of the
 transaction.

I've been giving this quite a bit more thought, and have decided to
abandon the scheme described above, at least for now.  It has the
advantage of avoiding virtually all locking, but it's extremely
inefficient in its use of memory in the presence of long-running
transactions.  For example, if there's an open transaction that's been
sitting around for 10 million transactions or so and has an XID
assigned, any new snapshot is going to need to probe into the big
array for any XID in that range.  At 8 bytes per entry, that means
we're randomly accessing about ~80MB of memory-mapped data.  That
seems problematic both in terms of blowing out the cache and (on small
machines) possibly even blowing out RAM.  Nor is that the worst case
scenario: a transaction could sit open for 100 million transactions.

Heikki has made the suggestion a few times (and a few other people
have since made somewhat similar suggestions in different words) of
keeping an-up-to-date snapshot in shared memory such that transactions
that need a snapshot can simply copy it.  I've since noted that in Hot
Standby mode, that's more or less what the KnownAssignedXids stuff
already does.  I objected that, first, the overhead of updating the
snapshot for every commit would be too great, and second, it didn't
seem to do a whole lot to reduce the size of the critical section, and
therefore probably wouldn't improve performance that much.  But I'm
coming around to the view that these might be solvable problems rather
than reasons to give up on the idea altogether.

With respect to the first problem, what I'm imagining is that we not
do a complete rewrite of the snapshot in shared memory on every
commit.  Instead, when a transaction ends, we'll decide whether to (a)
write a new snapshot or (b) just record the XIDs that ended.  If we do
(b), then any backend that wants a snapshot will need to copy from
shared memory both the most recently written snapshot and the XIDs
that have subsequently ended.  From there, it can figure out which
XIDs are still running.  Of course, if the list of recently-ended XIDs
gets too long, then taking a snapshot will start to get expensive, so
we'll need to periodically do (a) instead.  There are other ways that
this could be done as well; for example, the KnownAssignedXids stuff
just flags XIDs that should be ignored and then periodically compacts
away the ignored entries.

I think the real trick is figuring out a design that can improve
concurrency.  If you keep a snapshot in shared memory and periodically
overwrite it in place, I don't think you're going to gain much.
Everyone who wants a snapshot still needs a share-lock and everyone
who wants to commit still needs an exclusive-lock, and while you might
be able to make the critical section a bit shorter, I think it's still
going to be hard to make big gains that way.  What I'm thinking about
instead is using a ring buffer with three pointers: a start pointer, a
stop pointer, and a write pointer.  When a transaction ends, we
advance the write pointer, write the XIDs or a whole new snapshot into
the buffer, and then advance the stop pointer.  If we wrote a whole
new snapshot, we advance the start pointer to the beginning of the
data we just wrote.

Someone who wants to take a snapshot must read the data between the
start and stop pointers, and must then check that the write pointer
hasn't advanced so far in the meantime that the data they read might
have been overwritten before they finished reading it.  Obviously,

Re: [HACKERS] How to define global variable in postgresql

2011-08-22 Thread Jim Nasby
On Aug 19, 2011, at 4:15 PM, Valentine Gogichashvili wrote:
  Hello. How can we define a global variable in postgresql?
 
 you can also use global structure in plpython for example:
 
 http://www.postgresql.org/docs/9.0/static/plpython-sharing.html

Same thing with plperl.

BTW, if you want something that's global but also transactional then you'll 
want to use a temp table.
--
Jim C. Nasby, Database Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Displaying accumulated autovacuum cost

2011-08-22 Thread Jim Nasby
On Aug 18, 2011, at 10:41 AM, Greg Smith wrote:
 that was all they got.  I'm going to add directly computing the write MB/s 
 figure from the dirty data written too, since that ends up being the thing 
 that I keep deriving by hand anyway.

I know folks have talked about progress, but I haven't seen anything 
specific... could you add info about what table/index vacuum is working on, and 
how far along it is? I realize that's not very close to an actual % completion, 
but it's far better than what we have right now.

FWIW, the number I end up caring about isn't so much write traffic as read. 
Thanks to a good amount of battery-backed write cache (and possibly some iSCSI 
misconfiguration), our writes are generally much cheaper than our reads.
--
Jim C. Nasby, Database Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.1rc1: TRAP: FailedAssertion(!(item_width 0), File: costsize.c, Line: 3274)

2011-08-22 Thread Robert Haas
On Mon, Aug 22, 2011 at 4:16 PM, Erik Rijkers e...@xs4all.nl wrote:
 I am developing against 9.1beta3 and encountered this problem. I repeated it 
 against rc1 (or
 actually, REL9_1_STABLE taken from git today).

 The below SQL causes:

 TRAP: FailedAssertion(!(item_width  0), File: costsize.c, Line: 3274)

 and the session crashes.

 select *
 from (
  select
       col1
    ,  col2
    ,  reviewed
  from
    (values
        ( , 'aweyu', 3)
      , ( , 'bwetu', 2)
      , ( , 'vwxyz', 1)
     )
        as r(col1,col2,col3)
  left join (
                select 1 as reviewed, c2 from ( values ('abcde') ) as t1(c2)
      union all select 0 as reviewed, c2 from ( values ('vwxyz') ) as t2(c2)
  )
     as up_all on up_all.c2 = r.col2
  order by col3
 )
  as f
 ;


 To simplify I used 'values' for the actual tables that I am using, but the 
 above still causes the
 FailedAssertion.

 Removing the outer select avoids the error and
 removing order by avoids the error

I can reproduce this on master but not on 9.0.  I suspect the problem
was introduced by this commit:

commit 0f61d4dd1b4f95832dcd81c9688dac56fd6b5687
Author: Tom Lane t...@sss.pgh.pa.us
Date:   Fri Nov 19 17:31:50 2010 -0500

Improve relation width estimation for subqueries.

...but I haven't confirmed that yet.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots redux

2011-08-22 Thread Jim Nasby
On Aug 22, 2011, at 4:25 PM, Robert Haas wrote:
 What I'm thinking about
 instead is using a ring buffer with three pointers: a start pointer, a
 stop pointer, and a write pointer.  When a transaction ends, we
 advance the write pointer, write the XIDs or a whole new snapshot into
 the buffer, and then advance the stop pointer.  If we wrote a whole
 new snapshot, we advance the start pointer to the beginning of the
 data we just wrote.
 
 Someone who wants to take a snapshot must read the data between the
 start and stop pointers, and must then check that the write pointer
 hasn't advanced so far in the meantime that the data they read might
 have been overwritten before they finished reading it.  Obviously,
 that's a little risky, since we'll have to do the whole thing over if
 a wraparound occurs, but if the ring buffer is large enough it
 shouldn't happen very often.  And a typical snapshot is pretty small
 unless massive numbers of subxids are in use, so it seems like it
 might not be too bad.  Of course, it's pretty hard to know for sure
 without coding it up and testing it.

Something that would be really nice to fix is our reliance on a fixed size of 
shared memory, and I'm wondering if this could be an opportunity to start in a 
new direction. My thought is that we could maintain two distinct shared memory 
snapshots and alternate between them. That would allow us to actually resize 
them as needed. We would still need something like what you suggest to allow 
for adding to the list without locking, but with this scheme we wouldn't need 
to worry about extra locking when taking a snapshot since we'd be doing that in 
a new segment that no one is using yet.

The downside is such a scheme does add non-trivial complexity on top of what 
you proposed. I suspect it would be much better if we had a separate mechanism 
for dealing with shared memory requirements (shalloc?). But if it's just not 
practical to make a generic shared memory manager it would be good to start 
thinking about ways we can work around fixed shared memory size issues.
--
Jim C. Nasby, Database Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots redux

2011-08-22 Thread Robert Haas
On Mon, Aug 22, 2011 at 6:45 PM, Jim Nasby j...@nasby.net wrote:
 Something that would be really nice to fix is our reliance on a fixed size of 
 shared memory, and I'm wondering if this could be an opportunity to start in 
 a new direction. My thought is that we could maintain two distinct shared 
 memory snapshots and alternate between them. That would allow us to actually 
 resize them as needed. We would still need something like what you suggest to 
 allow for adding to the list without locking, but with this scheme we 
 wouldn't need to worry about extra locking when taking a snapshot since we'd 
 be doing that in a new segment that no one is using yet.

 The downside is such a scheme does add non-trivial complexity on top of what 
 you proposed. I suspect it would be much better if we had a separate 
 mechanism for dealing with shared memory requirements (shalloc?). But if it's 
 just not practical to make a generic shared memory manager it would be good 
 to start thinking about ways we can work around fixed shared memory size 
 issues.

Well, the system I'm proposing is actually BETTER than having two
distinct shared memory snapshots.  For example, right now we cache up
to 64 subxids per backend.  I'm imagining that going away and using
that memory for the ring buffer.  Out of the box, that would imply a
ring buffer of 64 * 103 = 6592 slots.  If the average snapshot lists
100 XIDs, you could rewrite the snapshot dozens of times times before
the buffer wraps around, which is obviously a lot more than two.  Even
if subtransactions are being heavily used and each snapshot lists 1000
XIDs, you still have enough space to rewrite the snapshot several
times over before wraparound occurs.  Of course, at some point the
snapshot gets too big and you have to switch to retaining only the
toplevel XIDs, which is more or less the equivalent of what happens
under the current implementation when any single transaction's subxid
cache overflows.

With respect to a general-purpose shared memory allocator, I think
that there are cases where that would be useful to have, but I don't
think there are as many of them as many people seem to think.  I
wouldn't choose to implement this using a general-purpose allocator
even if we had it, both because it's undesirable to allow this or any
subsystem to consume an arbitrary amount of memory (nor can it fail...
especially in the abort path) and because a ring buffer is almost
certainly faster than a general-purpose allocator.  We have enough
trouble with palloc overhead already.  That having been said, I do
think there are cases where it would be nice to have... and it
wouldn't surprise me if I end up working on something along those
lines in the next year or so.  It turns out that memory management is
a major issue in lock-free programming; you can't assume that it's
safe to recycle an object once the last pointer to it has been removed
from shared memory - because someone may have fetched the pointer just
before you removed it and still be using it to examine the object.  An
allocator with some built-in capabilities for handling such problems
seems like it might be very useful

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SSL-mode error reporting in libpq

2011-08-22 Thread Daniel Farina
On Sun, Jul 24, 2011 at 11:48 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 We could perhaps go a bit further and make pqsecure_write responsible
 for the error message in non-SSL mode too, but it looks to me like
 pqSendSome has to have a switch on the errno anyway to decide whether to
 keep trying or not, so moving that responsibility would just lead to
 duplicative coding.

 Any objections?

No objection.

Some users we have have been confused by this message.  At first I
thought it was an SSL renegotiation problem, but now I realize that
the message is a lie, so anything can be the problem.  A better
message may well decrease support burden for us, and also help us find
problems...

-- 
fdr

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.1rc1: TRAP: FailedAssertion(!(item_width 0), File: costsize.c, Line: 3274)

2011-08-22 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Mon, Aug 22, 2011 at 4:16 PM, Erik Rijkers e...@xs4all.nl wrote:
 The below SQL causes:
 TRAP: FailedAssertion(!(item_width  0), File: costsize.c, Line: 3274)

 I can reproduce this on master but not on 9.0.  I suspect the problem
 was introduced by this commit:
 commit 0f61d4dd1b4f95832dcd81c9688dac56fd6b5687

Well, that's a pretty safe bet considering the Assert is in code that
didn't exist before that commit ;-).  The question is why the per-column
width estimate hasn't been set for that column.  Looking at the
coredump, I see that the width *did* get set for the subquery's other
output column, which makes it even weirder.  Too tired to dig further
right now though.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Displaying accumulated autovacuum cost

2011-08-22 Thread Greg Smith

On 08/22/2011 05:54 PM, Jim Nasby wrote:
I know folks have talked about progress, but I haven't seen anything 
specific... could you add info about what table/index vacuum is 
working on, and how far along it is? I realize that's not very close 
to an actual % completion, but it's far better than what we have right 
now.


Due to complaints about the mechanism the first version used to inform 
the user of the progress, I'm yanking that from the next patch 
altogether.  The goal for now is to get a good report into the logs, and 
then maybe that gets extended later with a progress report.  (All of the 
proposed alternate mechanisms are way more complicated than anything I 
have time to do right now)



FWIW, the number I end up caring about isn't so much write traffic as read. 
Thanks to a good amount of battery-backed write cache (and possibly some iSCSI 
misconfiguration), our writes are generally much cheaper than our reads.
   


VACUUM can't really know its true read rate from what's inside the 
database.  I can add a summary of the accumulated read amounts into the 
logs, in more useful figures than what is provided so far, which is 
better than nothing.  But those will be kind of deceptive, which is one 
reason I wasn't so focused on them yet.  If the relation is largely in 
the OS cache, but not the PostgreSQL one, the summary can show a read 
rate even when that isn't actually doing any reads at all.  That was 
exactly the case in the sample data I posted.  VACUUM thought it was 
reading anywhere from 2.5 to 6MB/s.  But at the OS level, it was 
actually reading zero bytes, since the whole thing was in cache already.


What you actually want is a count of the accumulated read counters at 
the OS level.  I've recently figured out how to track those, too, but 
that code is something that lives outside the database.  If this is 
something useful to you, I think you're about to sign up to be my next 
beta tester for that program.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers