Re: [HACKERS] could not open relation 1663/16384/16584: No such file or directory in a specific combination of transactions with temp tables

2008-02-29 Thread Heikki Linnakangas

John Smith wrote:

Architecture: Intel Core 2 Duo
OS: linux-2.6.20-gentoo-r8
Filesystem: ext3
Postgres v8.2.3 compiled with gcc 4.1.1-r3
RAM - 2GB
Shared buffers - 24MB
[All other Postgres configuration parameters are default values]

Problem description:
COPY into temp table fails using a specific combination of
create/insert on temp tables, prepare/commit in subsequent
transactions. The could not open relation error occurs reliably.

Steps to reproduce:

Existing schema (scripts to create and populate these tables are
uploaded to http://upload2.net/page/download/rfsLfnuVlYjEcCJ/basetables.tgz.html
):


I can't get that link to work. Can you please email me the files 
offlist? Or upload somewhere else if they're too big for email.



Observations:
1. The size of the data seems to matters. If the amount of data being
inserted is dropped to just one or two records per table, the error
doesn't happen.
2. The order of columns for the select into temp2 matters. Changing
the order can cause the error to go away.
3. If the prepare/commit is replaced with a commit; the error goes away.
4. Removing temp3 or temp4 from the transaction causes one run of
the above statements to succeed, but if the sequence is issued in the
same PSQL session, the second one will fail.
5. Given the current dataset, the error always occurs on line 926 of
the COPY (even if the values at line 926 are changed).
6. tablespace/database/oid typically always corresponds to that
of temp2 on my system.


I think I see what's happening here. We have restricted two-phase commit 
so that you're not supposed to be able to PREPARE TRANSACTION if the 
transaction has touched any temporary tables. That's because the 2nd 
phase commit can be performed from another backend, and another backend 
can't mess with another backend's temporary tables.


However in this case, where you CREATE and DROP the temporary table in 
the same transaction, we don't detect that, and let the PREPARE 
TRANSACTION to finish. The detection relies on the lock manager, but 
we're not holding any locks on the dropped relation.


I think we could in fact allow CREATE+DROP in same transaction, and 
remove the table immediately at PREPARE TRANSACTION, but what happens 
right now is that we store the relfilenode of the temp table to the 
two-phase state file in pg_twophase, for deletion at COMMIT/ROLLBACK 
PREPARED. But we don't store the fact that it's a temporary table, and 
therefore we try to unlink it like a normal table, and fail to purge the 
temp buffers of that table which causes problems later.


Attached is a simple patch to fix that by disallowing 
CREATE+DROP+PREPARE TRANSACTION more reliably. It'd still be nice to 
debug the full test case of yours to verify that that's what's 
happening, though.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com
Index: src/backend/access/transam/twophase.c
===
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/transam/twophase.c,v
retrieving revision 1.39
diff -c -r1.39 twophase.c
*** src/backend/access/transam/twophase.c	1 Jan 2008 19:45:48 -	1.39
--- src/backend/access/transam/twophase.c	29 Feb 2008 09:58:19 -
***
*** 793,798 
--- 793,799 
  	TransactionId *children;
  	RelFileNode *commitrels;
  	RelFileNode *abortrels;
+ 	bool haveTempCommit, haveTempAbort;
  
  	/* Initialize linked list */
  	records.head = palloc0(sizeof(XLogRecData));
***
*** 815,824 
  	hdr.prepared_at = gxact-prepared_at;
  	hdr.owner = gxact-owner;
  	hdr.nsubxacts = xactGetCommittedChildren(children);
! 	hdr.ncommitrels = smgrGetPendingDeletes(true, commitrels, NULL);
! 	hdr.nabortrels = smgrGetPendingDeletes(false, abortrels, NULL);
  	StrNCpy(hdr.gid, gxact-gid, GIDSIZE);
  
  	save_state_data(hdr, sizeof(TwoPhaseFileHeader));
  
  	/* Add the additional info about subxacts and deletable files */
--- 816,830 
  	hdr.prepared_at = gxact-prepared_at;
  	hdr.owner = gxact-owner;
  	hdr.nsubxacts = xactGetCommittedChildren(children);
! 	hdr.ncommitrels = smgrGetPendingDeletes(true, commitrels, NULL, haveTempCommit);
! 	hdr.nabortrels = smgrGetPendingDeletes(false, abortrels, NULL, haveTempAbort);
  	StrNCpy(hdr.gid, gxact-gid, GIDSIZE);
  
+ 	if (haveTempCommit || haveTempAbort)
+ 		ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+  errmsg(cannot PREPARE a transaction that has operated on temporary tables)));
+ 
  	save_state_data(hdr, sizeof(TwoPhaseFileHeader));
  
  	/* Add the additional info about subxacts and deletable files */
Index: src/backend/access/transam/xact.c
===
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/transam/xact.c,v
retrieving revision 1.257
diff -c -r1.257 xact.c
*** src/backend/access/transam/xact.c	15 Jan 2008 18:56:59 -	1.257
--- src/backend/access/transam/xact.c	29 Feb 2008 

[HACKERS] Read-ahead and parallelism in redo recovery

2008-02-29 Thread Pavan Deolasee
I remember Heikki mentioned improving redo recovery in one of the
emails in the past, so I know people are already thinking about this.
I have some ideas and just wanted to get comments here.

ISTM that its important to keep the redo recovery time as small as possible
in order to reduce the downtime in case of unplanned maintenence.
One way to do this is to take checkpoints very aggressively to keep the
amount of redo work small. But the current checkpoint logic writes all
the dirty buffers to disk and hence generates lots of IO. That limits our
ability to take very frequent checkpoints.

The current redo-recovery is a single threaded, synchronous process.
The XLOG is read sequentially, each log record is examined and replayed
if required. This requires reading disk blocks in the shared buffers and
applying changes to the buffer. The reading happens synchronously and
that would usually make the redo process very slow.

What I am thinking is if we can read ahead these blocks in the shared
buffers and then apply redo changes to them, it can potentially improve things
a lot. If there are multiple read requests, kernel (or controller ?)
can probably
schedule the reads more efficiently. One way to do this is to read ahead the
XLOG and make asynchronous read requests for these blocks. But I am not
sure if we support asynchronous reads yet. Another (and may be easier) way
is to fork another process which can just read-ahead the XLOG and get the
blocks in memory while other process does the normal redo recovery.
One obvious downside of reading ahead would be that we may need to
jump backward and forward in the XLOG file which is otherwise sequentially
read. But that can be handled by using XLOG buffers for redo.

Btw, isn't our redo recovery completely physical in nature ? I mean, can we
replay redo logs related to a block independent of other blocks ? The reason
I am asking because if thats the case, ISTM we can introduce parallelism in
recovery by splitting and reordering the xlog records and then run multiple
processes to do the redo recovery.



Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] CREATE TABLE, load and freezing

2008-02-29 Thread Simon Riggs
On Thu, 2008-02-28 at 14:38 +0530, Pavan Deolasee wrote:

 I had this idea sometime back. Not sure if this has been discussed before

Check the archives for my post to hackers in Jan 2007 and subsequent
discussion. It's possible, just a little fiddly.

-- 
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com 

  PostgreSQL UK 2008 Conference: http://www.postgresql.org.uk


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] could not open relation 1663/16384/16584: No such file or directory in a specific combination of transactions with temp tables

2008-02-29 Thread Heikki Linnakangas

Heikki Linnakangas wrote:
Attached is a simple patch to fix that by disallowing 
CREATE+DROP+PREPARE TRANSACTION more reliably.


That patch was missing changes to header files. New patch attached.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com
Index: src/backend/access/transam/twophase.c
===
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/transam/twophase.c,v
retrieving revision 1.39
diff -c -r1.39 twophase.c
*** src/backend/access/transam/twophase.c	1 Jan 2008 19:45:48 -	1.39
--- src/backend/access/transam/twophase.c	29 Feb 2008 13:05:24 -
***
*** 793,798 
--- 793,799 
  	TransactionId *children;
  	RelFileNode *commitrels;
  	RelFileNode *abortrels;
+ 	bool haveTempCommit, haveTempAbort;
  
  	/* Initialize linked list */
  	records.head = palloc0(sizeof(XLogRecData));
***
*** 815,824 
  	hdr.prepared_at = gxact-prepared_at;
  	hdr.owner = gxact-owner;
  	hdr.nsubxacts = xactGetCommittedChildren(children);
! 	hdr.ncommitrels = smgrGetPendingDeletes(true, commitrels, NULL);
! 	hdr.nabortrels = smgrGetPendingDeletes(false, abortrels, NULL);
  	StrNCpy(hdr.gid, gxact-gid, GIDSIZE);
  
  	save_state_data(hdr, sizeof(TwoPhaseFileHeader));
  
  	/* Add the additional info about subxacts and deletable files */
--- 816,830 
  	hdr.prepared_at = gxact-prepared_at;
  	hdr.owner = gxact-owner;
  	hdr.nsubxacts = xactGetCommittedChildren(children);
! 	hdr.ncommitrels = smgrGetPendingDeletes(true, commitrels, NULL, haveTempCommit);
! 	hdr.nabortrels = smgrGetPendingDeletes(false, abortrels, NULL, haveTempAbort);
  	StrNCpy(hdr.gid, gxact-gid, GIDSIZE);
  
+ 	if (haveTempCommit || haveTempAbort)
+ 		ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+  errmsg(cannot PREPARE a transaction that has operated on temporary tables)));
+ 
  	save_state_data(hdr, sizeof(TwoPhaseFileHeader));
  
  	/* Add the additional info about subxacts and deletable files */
Index: src/backend/access/transam/xact.c
===
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/transam/xact.c,v
retrieving revision 1.257
diff -c -r1.257 xact.c
*** src/backend/access/transam/xact.c	15 Jan 2008 18:56:59 -	1.257
--- src/backend/access/transam/xact.c	29 Feb 2008 13:05:24 -
***
*** 802,808 
  	TransactionId *children;
  
  	/* Get data needed for commit record */
! 	nrels = smgrGetPendingDeletes(true, rels, haveNonTemp);
  	nchildren = xactGetCommittedChildren(children);
  
  	/*
--- 802,808 
  	TransactionId *children;
  
  	/* Get data needed for commit record */
! 	nrels = smgrGetPendingDeletes(true, rels, haveNonTemp, NULL);
  	nchildren = xactGetCommittedChildren(children);
  
  	/*
***
*** 1174,1180 
  			 xid);
  
  	/* Fetch the data we need for the abort record */
! 	nrels = smgrGetPendingDeletes(false, rels, NULL);
  	nchildren = xactGetCommittedChildren(children);
  
  	/* XXX do we really need a critical section here? */
--- 1174,1180 
  			 xid);
  
  	/* Fetch the data we need for the abort record */
! 	nrels = smgrGetPendingDeletes(false, rels, NULL, NULL);
  	nchildren = xactGetCommittedChildren(children);
  
  	/* XXX do we really need a critical section here? */
Index: src/backend/storage/smgr/smgr.c
===
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/storage/smgr/smgr.c,v
retrieving revision 1.109
diff -c -r1.109 smgr.c
*** src/backend/storage/smgr/smgr.c	1 Jan 2008 19:45:52 -	1.109
--- src/backend/storage/smgr/smgr.c	29 Feb 2008 13:05:24 -
***
*** 678,689 
   *
   * If haveNonTemp isn't NULL, the bool it points to gets set to true if
   * there is any non-temp table pending to be deleted; false if not.
   *
   * Note that the list does not include anything scheduled for termination
   * by upper-level transactions.
   */
  int
! smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr, bool *haveNonTemp)
  {
  	int			nestLevel = GetCurrentTransactionNestLevel();
  	int			nrels;
--- 678,692 
   *
   * If haveNonTemp isn't NULL, the bool it points to gets set to true if
   * there is any non-temp table pending to be deleted; false if not.
+  * haveTemp is similar, but gets set if there is any temp table deletions
+  * pending.
   *
   * Note that the list does not include anything scheduled for termination
   * by upper-level transactions.
   */
  int
! smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr, 
! 	  bool *haveNonTemp, bool *haveTemp)
  {
  	int			nestLevel = GetCurrentTransactionNestLevel();
  	int			nrels;
***
*** 693,698 
--- 696,703 
  	nrels = 0;
  	if (haveNonTemp)
  		*haveNonTemp = false;
+ 	if (haveTemp)
+ 		*haveTemp = false;
  	for (pending = pendingDeletes; pending != NULL; pending = 

Re: [HACKERS] could not open relation 1663/16384/16584: No such file or directory in a specific combination of transactions with temp tables

2008-02-29 Thread Gurjeet Singh
Plausible theory, and nice explanation

Try the following link (I had to wait for 50 sec for the link to appear, but
I guess the trade-off of getting knowledge in return is worth it :) )

http://www5.upload2.net/download/77fa86e16a02e52fd5439c76e148d231/47c7fdce/rfsLfnuVlYjEcCJ/basetables.tgz

Not sending attachment in this mail; that may cause the mail to be not
accepted by the list. I will try to send the attachment in the next mail, to
retain it in the mailing list for historica purposes.

Thanks and best regards,
-- 
[EMAIL PROTECTED]
[EMAIL PROTECTED] gmail | hotmail | indiatimes | yahoo }.com

EnterpriseDB  http://www.enterprisedb.com

17° 29' 34.37N,  78° 30' 59.76E - Hyderabad
18° 32' 57.25N,  73° 56' 25.42E - Pune *
37° 47' 19.72N, 122° 24' 1.69 W - San Francisco

http://gurjeet.frihost.net

Mail sent from my BlackLaptop device

On Fri, Feb 29, 2008 at 3:32 PM, Heikki Linnakangas [EMAIL PROTECTED]
wrote:

 John Smith wrote:
  Architecture: Intel Core 2 Duo
  OS: linux-2.6.20-gentoo-r8
  Filesystem: ext3
  Postgres v8.2.3 compiled with gcc 4.1.1-r3
  RAM - 2GB
  Shared buffers - 24MB
  [All other Postgres configuration parameters are default values]
 
  Problem description:
  COPY into temp table fails using a specific combination of
  create/insert on temp tables, prepare/commit in subsequent
  transactions. The could not open relation error occurs reliably.
 
  Steps to reproduce:
 
  Existing schema (scripts to create and populate these tables are
  uploaded to
 http://upload2.net/page/download/rfsLfnuVlYjEcCJ/basetables.tgz.html
  ):

 I can't get that link to work. Can you please email me the files
 offlist? Or upload somewhere else if they're too big for email.

  Observations:
  1. The size of the data seems to matters. If the amount of data being
  inserted is dropped to just one or two records per table, the error
  doesn't happen.
  2. The order of columns for the select into temp2 matters. Changing
  the order can cause the error to go away.
  3. If the prepare/commit is replaced with a commit; the error goes
 away.
  4. Removing temp3 or temp4 from the transaction causes one run of
  the above statements to succeed, but if the sequence is issued in the
  same PSQL session, the second one will fail.
  5. Given the current dataset, the error always occurs on line 926 of
  the COPY (even if the values at line 926 are changed).
  6. tablespace/database/oid typically always corresponds to that
  of temp2 on my system.

 I think I see what's happening here. We have restricted two-phase commit
 so that you're not supposed to be able to PREPARE TRANSACTION if the
 transaction has touched any temporary tables. That's because the 2nd
 phase commit can be performed from another backend, and another backend
 can't mess with another backend's temporary tables.

 However in this case, where you CREATE and DROP the temporary table in
 the same transaction, we don't detect that, and let the PREPARE
 TRANSACTION to finish. The detection relies on the lock manager, but
 we're not holding any locks on the dropped relation.

 I think we could in fact allow CREATE+DROP in same transaction, and
 remove the table immediately at PREPARE TRANSACTION, but what happens
 right now is that we store the relfilenode of the temp table to the
 two-phase state file in pg_twophase, for deletion at COMMIT/ROLLBACK
 PREPARED. But we don't store the fact that it's a temporary table, and
 therefore we try to unlink it like a normal table, and fail to purge the
 temp buffers of that table which causes problems later.

 Attached is a simple patch to fix that by disallowing
 CREATE+DROP+PREPARE TRANSACTION more reliably. It'd still be nice to
 debug the full test case of yours to verify that that's what's
 happening, though.

 --
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com


 ---(end of broadcast)---
 TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq




[HACKERS] Why we panic in pglz_decompress

2008-02-29 Thread Zdenek Kotala
I'm now looking into toast code and I found following code in 
pglz_decompress:


00704 if (destsize != source-rawsize)
00705 elog(destsize  source-rawsize ? FATAL : ERROR,
00706  compressed data is corrupt);


I'm surprise why we there panic? By my opinion is not too good idea to 
crash server in case when we know how much memory we really have for 
dest and we can check range. Other silly thing is that message 
compressed data is corrupt does not contain any information about file 
relation etc.


My idea is to improve this piece of code and move error logging to 
callers (heap_tuple_untoast_attr() and heap_tuple_untoast_attr_slice()) 
where we have a little bit more details (especially for external storage).


Any comments?

thanks Zdenek

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [HACKERS] Read-ahead and parallelism in redo recovery

2008-02-29 Thread Florian Weimer
* Pavan Deolasee:

 The current redo-recovery is a single threaded, synchronous process.
 The XLOG is read sequentially, each log record is examined and
 replayed if required. This requires reading disk blocks in the
 shared buffers and applying changes to the buffer. The reading
 happens synchronously and that would usually make the redo process
 very slow.

Are you sure that it's actually slow for that reason?  Sequential I/O
on the log is typically quite fast, and if the pages dirtied since the
last checkpoint fit into the cache (shared buffers or OS cache), even
that part of recovery does not result in lots of random I/O (with 8.3
and full page writes active; this is a relatively recent change).

In the end, I wouldn't be surprised if for most loads, cache warming
effects dominated recovery times, at least when the machine is not
starved on RAM.

-- 
Florian Weimer[EMAIL PROTECTED]
BFK edv-consulting GmbH   http://www.bfk.de/
Kriegsstraße 100  tel: +49-721-96201-1
D-76133 Karlsruhe fax: +49-721-96201-99

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Read-ahead and parallelism in redo recovery

2008-02-29 Thread Tom Lane
Florian G. Pflug [EMAIL PROTECTED] writes:
 I know that Simon has some ideas about parallel restored, though I don't
 know how he wants to solve the dependency issues involved. Perhaps by
 not parallelizing withon one table or index...

I think we should be *extremely* cautious about introducing any sort of
parallelism or other hard-to-test behavior into xlog recovery.  Bugs
in that area will by definition bite people at the worst possible time.
And we already know that we don't have very good testing ability for
xlog recovery, because some pretty nasty bugs have gone undetected
for long periods.

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Why we panic in pglz_decompress

2008-02-29 Thread Tom Lane
Alvaro Herrera [EMAIL PROTECTED] writes:
 Zdenek Kotala wrote:
 I'm now looking into toast code and I found following code in  
 pglz_decompress:
 
 00704 if (destsize != source-rawsize)
 00705 elog(destsize  source-rawsize ? FATAL : ERROR,
 00706  compressed data is corrupt);
 
 
 I'm surprise why we there panic?

 Agreed, FATAL is too strong.

Did either of you read the comment just before this code?  The reason
it's panicing is that it's possibly already tromped on some critical
data structure inside the backend.

 My idea is to improve this piece of code and move error logging to  
 callers (heap_tuple_untoast_attr() and heap_tuple_untoast_attr_slice())  
 where we have a little bit more details (especially for external 
 storage).

 Why move it?  Just adding errcontext in the callers should be enough.

AFAIR this error has never once been reported from the field, so I don't
see the point of investing a lot of effort in it.

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Read-ahead and parallelism in redo recovery

2008-02-29 Thread Florian G. Pflug

Pavan Deolasee wrote:

What I am thinking is if we can read ahead these blocks in the shared
 buffers and then apply redo changes to them, it can potentially 
improve things a lot. If there are multiple read requests, kernel (or

 controller ?) can probably schedule the reads more efficiently.

The same holds true for index scans, though. Maybe we can find a
solution that benefits both cases - something along the line of a
bgreader process

Btw, isn't our redo recovery completely physical in nature ? I mean, 
can we replay redo logs related to a block independent of other 
blocks ? The reason I am asking because if thats the case, ISTM we 
can introduce parallelism in recovery by splitting and reordering the

 xlog records and then run multiple processes to do the redo
recovery.


I'd say its physical on the tuple level (We just log the new tuple on an
update, not how to calculate it from the old one), but logical on the
page level (We log the fact that a tuple was inserted on a page, but
e.g. the physical location of the tuple on the page can come out
differently upon replay). It's even more logical for indices, because
we log page splits as multiple wal records, letting the recovery process
deal with synthesizing upper-level updates should we crash in the middle
of a page split. Additionally, we log full-page images as a safeguard
against torn page writes. Those would need to be considered as a kind of
reorder barrier in any parallel restore scenario, I guess.

I know that Simon has some ideas about parallel restored, though I don't
know how he wants to solve the dependency issues involved. Perhaps by
not parallelizing withon one table or index...



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] Why we panic in pglz_decompress

2008-02-29 Thread Zdenek Kotala

Alvaro Herrera napsal(a):

Zdenek Kotala wrote:
I'm now looking into toast code and I found following code in  
pglz_decompress:


00704 if (destsize != source-rawsize)
00705 elog(destsize  source-rawsize ? FATAL : ERROR,
00706  compressed data is corrupt);


I'm surprise why we there panic?


Agreed, FATAL is too strong.

My idea is to improve this piece of code and move error logging to  
callers (heap_tuple_untoast_attr() and heap_tuple_untoast_attr_slice())  
where we have a little bit more details (especially for external 
storage).


Why move it?  Just adding errcontext in the callers should be enough.


Good idea.

thanks Zdenek

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Why we panic in pglz_decompress

2008-02-29 Thread Alvaro Herrera
Zdenek Kotala wrote:
 I'm now looking into toast code and I found following code in  
 pglz_decompress:

 00704 if (destsize != source-rawsize)
 00705 elog(destsize  source-rawsize ? FATAL : ERROR,
 00706  compressed data is corrupt);


 I'm surprise why we there panic?

Agreed, FATAL is too strong.

 My idea is to improve this piece of code and move error logging to  
 callers (heap_tuple_untoast_attr() and heap_tuple_untoast_attr_slice())  
 where we have a little bit more details (especially for external 
 storage).

Why move it?  Just adding errcontext in the callers should be enough.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] [BUGS] bug or not bug, xmlvalidate(xml, text) can read and show one line from file

2008-02-29 Thread Tom Lane
Sergey Burladyan [EMAIL PROTECTED] writes:
 [ xmlvalidate is a security hole ]

Given that this function is not documented nor tested in the regression
tests, I propose diking it out entirely.

regards, tom lane

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] Why we panic in pglz_decompress

2008-02-29 Thread Zdenek Kotala

Tom Lane napsal(a):

Alvaro Herrera [EMAIL PROTECTED] writes:

Zdenek Kotala wrote:
I'm now looking into toast code and I found following code in  
pglz_decompress:


00704 if (destsize != source-rawsize)
00705 elog(destsize  source-rawsize ? FATAL : ERROR,
00706  compressed data is corrupt);


I'm surprise why we there panic?



Agreed, FATAL is too strong.


Did either of you read the comment just before this code?  The reason
it's panicing is that it's possibly already tromped on some critical
data structure inside the backend.


Yes I did, but if you know how big memory you have for uncompress data 
you can check a boundaries. It is better then overwrite a data in 
memory. Yes, it little bit slow down a routine but you will able work 
with a table.


My idea is to improve this piece of code and move error logging to  
callers (heap_tuple_untoast_attr() and heap_tuple_untoast_attr_slice())  
where we have a little bit more details (especially for external 
storage).



Why move it?  Just adding errcontext in the callers should be enough.


AFAIR this error has never once been reported from the field, so I don't
see the point of investing a lot of effort in it.


Please, increment a counter :-). I'm now analyzing one core file and it 
fails finally in elog function (called from pglz_decompress), because 
memory was overwritten - no error message in a log file. :(


Zdenek

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


[HACKERS] request feature: alter view add column (or something like)

2008-02-29 Thread Olivier Thauvin
Sometimes, I have to add column to table, then adding the column to some view.

Currently it impossible to alter a view to add a new column, which is really 
annoying. The only solution is to drop the view, and all dependented view, 
then recreating everything.

Someone on #postgresqlfr said me there were discussion about that.

I think this would be a big improvement for pg 8.4.

My 2 cents.

Regards.


signature.asc
Description: This is a digitally signed message part.


[HACKERS] creating new aggregate function

2008-02-29 Thread Justin

Need help and direction  creating new aggregate functions.

We need to add more average functions for both scientific and finical 
purposes


RMS for electrical measurement purposes
Mode for both electrical and finical
Weighted Average  finical  purposes
Generalized mean for electrical measurement purposes
Geometric mean for electrical measurement purposes
Harmonic mean for electrical measurement purposes

what would be the best way to create these new functions??



---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] creating new aggregate function

2008-02-29 Thread Sam Mason
On Fri, Feb 29, 2008 at 12:11:59PM -0500, Justin wrote:
 Need help and direction  creating new aggregate functions.
 
 We need to add more average functions for both scientific and finical 
 purposes

[ ... ]

 what would be the best way to create these new functions??

I'd be tempted to look at pl/r[1], R[2] is a language for statistical
analysis and pl/r integrates it into Postgres.


  Sam

 [1] http://joeconway.com/plr/
 [2] http://www.r-project.org/

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Read-ahead and parallelism in redo recovery

2008-02-29 Thread Simon Riggs
On Fri, 2008-02-29 at 15:49 +0100, Florian G. Pflug wrote:

 I know that Simon has some ideas about parallel restored, though I don't
 know how he wants to solve the dependency issues involved. Perhaps by
 not parallelizing withon one table or index...

Well, I think that problem is secondary to making progress with your
work on hot standby. I don't want to tune the existing setup and then
make it harder to introduce new features.

I'm aiming to review your patches in this commit fest, with a view to
getting the work fully committed by 4-6 months from now, assuming your
happy to make any changes we identify. That still leaves us time to tune
things before next release.

The hope is to increase the level of functionality here. We may not be
able to move forwards in just one more stride. Warm Standby has taken
last 4 releases to mature to where we are now and the work ahead is at
least as difficult as what has gone before.

-- 
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com 

  PostgreSQL UK 2008 Conference: http://www.postgresql.org.uk


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Buildfarm member gypsy_moth seems not to like alignment patch

2008-02-29 Thread Tom Lane
Jorgen Austvik - Sun Norway [EMAIL PROTECTED] writes:
 Tom Lane wrote:
 This is unfortunate and surprising, since that patch was intended to
 prevent compilers from making unsafe alignment assumptions, but it sure
 looks like this compiler has instead added a new one.  Could you poke
 into it --- at least get a stack trace from the core dump?

 Running initdb with debug:

Hah, I guess the problem is that the compiler decided it was okay to put
the local variable chunk_data at an unaligned address.  Patched, but
we'll have to see whether any other places have similar issues.

Thanks for doing the gdb-work.

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


[HACKERS] Re: [BUGS] bug or not bug, xmlvalidate(xml, text) can read and show one line from file

2008-02-29 Thread Alvaro Herrera
Peter Eisentraut escribió:
 Am Freitag, 29. Februar 2008 schrieb Tom Lane:
  Sergey Burladyan [EMAIL PROTECTED] writes:
   [ xmlvalidate is a security hole ]
 
  Given that this function is not documented nor tested in the regression
  tests, I propose diking it out entirely.
 
 Yes, it was accidentally left over from previous work.  We should have 
 removed 
 it before the release, but that would have required an initdb.

So let's change it for a function that elog(ERROR)s on entry.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Read-ahead and parallelism in redo recovery

2008-02-29 Thread Decibel!

On Feb 29, 2008, at 8:10 AM, Florian Weimer wrote:

In the end, I wouldn't be surprised if for most loads, cache warming
effects dominated recovery times, at least when the machine is not
starved on RAM.



Uh... that's exactly what all the synchronous reads are doing...  
warming the cache. And synchronous reads are only fast if the system  
understands what's going on and reads a good chunk of data in at  
once. I don't know that that happens.


Perhaps a good short-term measure would be to have recovery allocate  
a 16M buffer and read in entire xlog files at once.

--
Decibel!, aka Jim C. Nasby, Database Architect  [EMAIL PROTECTED]
Give your computer some brain candy! www.distributed.net Team #1828




smime.p7s
Description: S/MIME cryptographic signature


Re: [HACKERS] [BUGS] bug or not bug, xmlvalidate(xml, text) can read and show one line from file

2008-02-29 Thread Tom Lane
Alvaro Herrera [EMAIL PROTECTED] writes:
 Peter Eisentraut escribió:
 Yes, it was accidentally left over from previous work.  We should have 
 removed 
 it before the release, but that would have required an initdb.

 So let's change it for a function that elog(ERROR)s on entry.

Yeah, I was just going to make it throw a not implemented error.

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] creating new aggregate function

2008-02-29 Thread Webb Sprague
This probably belongs on General, but

On Fri, Feb 29, 2008 at 9:11 AM, Justin [EMAIL PROTECTED] wrote:
 Need help and direction  creating new aggregate functions.

  We need to add more average functions for both scientific and finical
  purposes

  RMS for electrical measurement purposes
  Mode for both electrical and finical
  Weighted Average  finical  purposes
  Generalized mean for electrical measurement purposes
  Geometric mean for electrical measurement purposes
  Harmonic mean for electrical measurement purposes

  what would be the best way to create these new functions??

Have you already read the documentation  on creating aggregates?  Have
you tried something and it didn't work, or are you interested in
design ideas?  I would just knock together something in plpgsql.  If
you have trouble, send the specific questions to the list.

best  ?  C is fastest, etc -- what kind of tradeoffs do you need to satisfy?

One thing worth thinking about is using arrays to carry state from one
function call to the next in an aggregate; this is how the function
used in the average aggregate keeps track of both the running total
and the number of rows.  The Stdev uses a three item array similarly.



  ---(end of broadcast)---
  TIP 6: explain analyze is your friend


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] [BUGS] bug or not bug, xmlvalidate(xml, text) can read and show one line from file

2008-02-29 Thread Peter Eisentraut
Am Freitag, 29. Februar 2008 schrieb Tom Lane:
 Sergey Burladyan [EMAIL PROTECTED] writes:
  [ xmlvalidate is a security hole ]

 Given that this function is not documented nor tested in the regression
 tests, I propose diking it out entirely.

Yes, it was accidentally left over from previous work.  We should have removed 
it before the release, but that would have required an initdb.

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Read-ahead and parallelism in redo recovery

2008-02-29 Thread Heikki Linnakangas

Decibel! wrote:

On Feb 29, 2008, at 8:10 AM, Florian Weimer wrote:

In the end, I wouldn't be surprised if for most loads, cache warming
effects dominated recovery times, at least when the machine is not
starved on RAM.



Uh... that's exactly what all the synchronous reads are doing... warming 
the cache. And synchronous reads are only fast if the system understands 
what's going on and reads a good chunk of data in at once. I don't know 
that that happens.


Perhaps a good short-term measure would be to have recovery allocate a 
16M buffer and read in entire xlog files at once.


The problem isn't reading the WAL. The OS prefetches that just fine.

The problem is the random reads, when we read in the blocks mentioned in 
the WAL records, to replay the changes to them. The OS has no way of 
guessing and prefetching those blocks, and we read them synchronously, 
one block at a time, no matter how big your RAID array is.


I used to think it's a big problem, but I believe the full-page-write 
optimization in 8.3 made it much less so. Especially with the smoothed 
checkpoints: as checkpoints have less impact on response times, you can 
shorten checkpoint interval, which helps to keep the recovery time 
reasonable.


It'd still be nice to do the prefetching; I'm sure there's still 
workloads where it would be a big benefit. But as Tom pointed out, we 
shouldn't invent something new just for recovery. I think we should look 
at doing prefetching for index accesses etc. first, and once we have the 
infrastructure in place and tested, we can consider use it for recovery 
as well.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] Read-ahead and parallelism in redo recovery

2008-02-29 Thread Tom Lane
Decibel! [EMAIL PROTECTED] writes:
 Perhaps a good short-term measure would be to have recovery allocate  
 a 16M buffer and read in entire xlog files at once.

If that isn't entirely useless, you need a better kernel.  The system
should *certainly* be bright enough to do read-ahead for our reads of
the source xlog file.  The fetches that are likely to be problematic are
the ones for pages in the data area, which will be a lot less regular
for typical workloads.

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Read-ahead and parallelism in redo recovery

2008-02-29 Thread Aidan Van Dyk
* Tom Lane [EMAIL PROTECTED] [080229 15:49]:
 
 If that isn't entirely useless, you need a better kernel.  The system
 should *certainly* be bright enough to do read-ahead for our reads of
 the source xlog file.  The fetches that are likely to be problematic are
 the ones for pages in the data area, which will be a lot less regular
 for typical workloads.

How difficult is it to parse the WAL logs with enough knowledge to know
what heap page (file/offset) a wal record contains (I haven't looked
into any wal code)?

There are compression/decompression archive_command/restore_command
programs with rudimentary knowledge of the WAL record formats.  Would a
restore_command be able to parse the wal records as it copies them
over noting which file pages need to be read, and the just before it
exits, fork() and read each page in order.

This child doesn't need to do anything with the blocks it reads - it
just needs to read them to pre-warm the kernel buffer cache...  If the
restoration is doing any writing, this dumb reader would hopefully be
able to keep a block ahead...  And since it's separated enough from the
backend, any experiments in async_io/fadvise could easily be done.

-- 
Aidan Van Dyk Create like a god,
[EMAIL PROTECTED]   command like a king,
http://www.highrise.ca/   work like a slave.


signature.asc
Description: Digital signature


Re: [HACKERS] could not open relation 1663/16384/16584: No such file or directory in a specific combination of transactions with temp tables

2008-02-29 Thread Heikki Linnakangas

Gurjeet Singh wrote:

Plausible theory, and nice explanation

Try the following link (I had to wait for 50 sec for the link to appear, but
I guess the trade-off of getting knowledge in return is worth it :) )

http://www5.upload2.net/download/77fa86e16a02e52fd5439c76e148d231/47c7fdce/rfsLfnuVlYjEcCJ/basetables.tgz

Not sending attachment in this mail; that may cause the mail to be not
accepted by the list. I will try to send the attachment in the next mail, to
retain it in the mailing list for historica purposes.


Thanks!

As I suspected, what's happening is that buffers for the dropped temp 
table are not cleaned up from the temporary buffer cache as they should 
be. The next time the buffers are needed, on the next COPY, we try to 
write out the buffers make room for new pages, but that fails because 
the file the buffers are related to doesn't exist anymore.


The patch I sent earlier fixes that, by tightening the check that you 
can't operate on temporary tables on 2pc transactions.


If you had a real-world use case for that, sorry :-(. Perhaps we could 
enhance that for 8.4 if there's demand, so that you could CREATE+DROP or 
use ON COMMIT TRUNCATE temp tables in a transaction, though I haven't 
personally planned to work on it.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] pg_dump additional options for performance

2008-02-29 Thread Decibel!

On Feb 26, 2008, at 4:36 PM, Tom Lane wrote:

I think a sane way to think about what Simon would like to accomplish
is not turn psql into a parallel job scheduler



My $0.02: I often find myself wishing I could perform parallel  
operations in psql. There was a proposal for that that came up during  
8.3 development; whatever happened to it?

--
Decibel!, aka Jim C. Nasby, Database Architect  [EMAIL PROTECTED]
Give your computer some brain candy! www.distributed.net Team #1828




smime.p7s
Description: S/MIME cryptographic signature


Re: [HACKERS] Read-ahead and parallelism in redo recovery

2008-02-29 Thread Gregory Stark

Florian G. Pflug wrote:
 The same holds true for index scans, though. Maybe we can find a
 solution that benefits both cases - something along the line of a
 bgreader process

I posted a patch to do readahead for bitmap index scans using posix_fadvise.
Experiments showed it works great on raid arrays on Linux. Solaris will need
to use libaio though which I haven't tried yet.

Incidentally, I'm off on a ski vacation this week so I won't be around much on
email for the first half of the commit-fest. If anyone's putting together a
list of patches queued up for review I would like this patch considered. 

The main feedback I'm blocking on is whether others are interested in
restructuring the buffer manager to allow buffers to be allocated and pinned
with only the posix_fadvise i/o initiated. That would avoid the redundant trip
into the buffer manager for the usual case at the expense of a few buffers
being taken out of the cache. The patch I posted is the minimally invasive
approach of not altering the buffer management at all and just passing through
a readahead request.

I looked at doing it for normal index scans and couldn't think of any
convenient way. I can see getting a single buffer of read-ahead from the index
block's next pointer but that's about it. Luckily it seems to me that bitmap
index scans are much more likely to be chosen in the cases where there's a big
gain anyways.

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com


---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] Read-ahead and parallelism in redo recovery

2008-02-29 Thread Florian G. Pflug

Greg Stark wrote:

Florian G. Pflug wrote:
The same holds true for index scans, though. Maybe we can find a 
solution that benefits both cases - something along the line of a 
bgreader process
I posted a patch to do readahead for bitmap index scans using 
posix_fadvise. Experiments showed it works great on raid arrays on 
Linux. Solaris will need to use libaio though which I haven't tried 
yet.

Cool! I'd like to try it out - is that patch available in the pg-patches
archives?

Doing it for normal index scans is much much harder. You can 
readahead a single page by using the next pointer if it looks like 
you'll need it. But I don't see a convenient way to get more than 
that.

I was thinking that after reading a page from the index, the backend
could post a list of heap pages referenced from that index page to the
shmem. A background process would repeatedly scan that list, and load
those pages into the buffer cache.

regards, Florian Pflug



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster