[HACKERS] HOT WIP Patch - version 4.1

2007-03-08 Thread Pavan Deolasee


Please see  HOT WIP patch, version 4.1 posted on -patches.
here are  not any significant changes since the version 4.0 patch that
I posted a week back.

This patch includes some optimizations for efficiently looking
up LP_DELETEd tuples. I have used the recent changes made by
Tom/Heikki which give us few bits per page header. I use one
bit to track if there are any LP_DELETEd tuples in the page.
The changes to this bit are not WAL-logged and hence the
information might not be accurate. But we should be ok with
that.

Another non-trivial change is the addition of logic to clean
up row level fragmentation. I have discussed this earlier on
the list, but neverthless would summarize it again here.

When we reuse LP_DELETEd tuples for UPDATE, we might waste
few bytes when the original size of the reused tuple is
larger than the new tuple. The information about the original
length is lost. When we run out of LP_DELETEd tuples of
size equal or greater than the requested size in UPDATE path,
we correct the row level fragmentation, if any. Please note,
we don't move tuples around and hence don't need the
VACUUM-strength lock on the page.

We use another bit in the page header to track if there are
any fragmented LP_DELETEd tuples in the page. We also need
one bit in the tuple header to track that the particular
tuple was fragmeneted while being reused. This information
is then used when the tuple is again released and marked
LP_DELETE, to update the page level hint bit.


Comments/suggestions ?

Thanks,
Pavan

--


EnterpriseDBhttp://www.enterprisedb.com


---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [HACKERS] Auto creation of Partitions

2007-03-08 Thread NikhilS

Hi,

Maybe, the dummy entry could be extended to contain the bounds (max/min) for
each of the other involved partitions and they could be updated each time a
DML happens across the partitions. That ways, an update to a particular
partition needs to lock out the others, examine the dummy entries in its own
index and follow it up with dummy entries update into other partitions if
the need be.

Ofcourse as you have mentioned all of this so needs to be done after a
careful think on the locking/deadlocking etc issues.

Regards,
Nikhils

On 3/7/07, Alvaro Herrera [EMAIL PROTECTED] wrote:


I am wondering if we can implement unique indexes across several tables
(inheritance hierarchy) not by using a single, big index covering all
the tables, but rather by inserting a dummy entry into each partition's
unique index.  This dummy entry would have an expanded CTID which would
include the tableoid, so it's possible to check it (albeit there is a
problem in that we may require the opening of another heap to do the
actual checking).  These dummy entries could be removed by bulkcleanup
as soon as the inserting transaction is no longer running, to avoid
bloating the index too much.  All said dummy index entries would be
located at either the rightmost or the leftmost leaf, or close to it, so
another idea is to have future inserters reuse the entry for a different
key.

The obvious problem with this is, naturally, the excess I/O that extra
index traversing causes.  The not so obvious ones are locking,
deadlocking and the opening of other heaps and indexes while you do the
insertion, which may be too expensive.  On the other hand, maybe this
idea is easier to implement than full-fledged cross-table indexes, so we
could have richer partitioning earlier than when somebody finally bites
the bullet and implements cross-table indexes.

Or maybe this is just a dumb idea, but I had to let it out anyway :-)

--
Alvaro Herrera
http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.





--
EnterpriseDB   http://www.enterprisedb.com


Re: [HACKERS] Auto creation of Partitions

2007-03-08 Thread NikhilS

Hi,



There are 2 other reasons to favor triggers though:

1) People (Josh Drake comes to mind) have found that if you get over
a tiny number of partitions, the performance of rules is abysmal.

2) I believe it should be possible to construct an update trigger
that allows you to perform updates that will place the row in
question into a new partition. While I can see cases for simply
disallowing updates to the partitioning key, I think there are also
times when being able to do that would be very useful.



The consensus seems to be veering towards triggers.



I think it'd be great to make adding and removing partitions as
simple as ALTER TABLE. I don't think that DELETE should be the
mechanism to drop a partition, though. Again, DML statements
shouldn't be performing DDL.



Since partition is inheritance-based, a simple DROP or  NO INHERIT will do
the job to deal with the partition. Do we want to reinvent additional syntax
when these are around and are documented?

Regards,
Nikhils
--
EnterpriseDB   http://www.enterprisedb.com


Re: [HACKERS] Proposed ProcessUtility() API additions

2007-03-08 Thread Heikki Linnakangas

Tom Lane wrote:

The point of adding is_top_level is to provide a simpler, more reliable
means for PreventTransactionChain and related functions to detect
whether a function is trying to invoke a non-transaction-block-safe
command.  Currently we rely on an ugly test involving seeing if the
statement node is in the QueryContext, but that's always been a kluge,
and I'm not sure that it works 100% even today.  I'd like to get rid
of the QueryContext global altogether.


Great! I once played with the idea of using a simpler MemoryContext 
implementation in the parser to save some CPU cycles, and 
PreventTransactionChain stopped working. I don't know if what I was 
doing was a good idea, but PreventTransactionChain is definitely a hack.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Auto creation of Partitions

2007-03-08 Thread Simon Riggs
On Wed, 2007-03-07 at 22:32 -0500, Luke Lonergan wrote:
 Andreas,
 
 On 3/7/07 11:45 AM, Zeugswetter Andreas ADI SD [EMAIL PROTECTED]
 wrote:
 
  Whoa, do you have anything to back that up ?
 
 Sure - when we start to consider designs that implement advanced data
 management features, we run into problems with the architecture of
 tables-tables-tables  Here are some examples:
 1 - people think of partitions as a logical building block for tables, they
 would like to move partitions around underneath a table without the table
 definition being involved.  In the current implementation, there are
 explicit linkages between the table definition and the child tables -
 imagine an ALTER TABLE foo_parent ADD COLUMN and how it would need to
 cascade to 1,000 child tables and you get the beginning of it - this
 connection should not exist.

The inheritance can work at multiple levels, so its up to you how you
manage things. If you want to add a column to only the newest tables,
you just add a new mid-level table, add the new column only to that and
then make all new partitions inherit from that table rather than the
main table. So I don't see the objection here, I see a benefit.

 2 - INSERT/UPDATE/DELETE processing through the SQL rewrite layer (rules) is
 terribly slow and gets slower as you add more partitions.  If done closer to
 the storage layer, this can be done in ways that use access methods shared
 with other storage entities, e.g. Indices, and the code path would flow more
 naturally.

That bit is just syntactic sugar and unnecessary, for most applications,
in my experience - especially because it doesn't work with COPY.

People do seem to want it, so as I said upthread, we need a way of
speeding up the selection of the appropriate partition, so we can get
this to work for individual INSERTs.

This needs substantial improvement, but I don't see this challenging the
partition == table assumption.

 3 - Parallel query can be accomplished more easily by separating scans
 across relations split among tablespaces.  This is more natural than trying
 to parallelize APPEND nodes within existing plans

Tables- Partitions - Tablespaces

Changing Partitions from Tables to SomethingElse won't increase the
current capability to define the disk layout for concurrent I/O.

Parallel Query needs to be planner-aware, so if we change partitions
from being tables, then we'd need to reintroduce them to the planner.
APPEND is clearly not the only thing that's needed for parallel query.

Note to Nikhil: Make sure the new syntax doesn't prevent partitions from
being placed upon multiple tablespaces in some manner, at CREATE TABLE
time.

  You would need to elaborate what you actually mean, but I think it is
  moot.
  Sure, the constraint technique can be further extended (e.g. during
  runtime), but imho the approach is very good.
 
 Well, it's being used and that's good, but it needs to be better IMO and I
 think that before we go too far down the current path we should consider the
 alternatives more carefully.

I'm happy to reconsider things, but we need to do that with some clear
analysis of what doesn't work yet and how best to implement that.

None of what's been mentioned requires us to reconsider the 
Partition == Table assumption. There were other ways considered, but
they didn't win out in the analysis, for documented reasons.

If there are benefits to having partitions act like tables, then maybe
we can make them behave differently in just those circumstances. Pack
animals behave differently in a group, so why not tables, if need be?
But when is that exactly?

There is a ton of work to make partitioning the elegant beast we'd like
it to be, but that seems like extension only, not rip and replace.

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com



---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] Auto creation of Partitions

2007-03-08 Thread Simon Riggs
On Wed, 2007-03-07 at 21:27 -0700, Jim Nasby wrote:
 On Mar 7, 2007, at 3:26 AM, Simon Riggs wrote:
  If you know that the constraints on each of the tables is distinct,  
  then
  building a UNIQUE index on each of the partitions is sufficient to  
  prove
  that all rows in the combined partitioned table are distinct also.
 
  The hard part there is checking that the partition constraints are
  distinct. If the partition constraints are added one at a time, you  
  can
  use the predicate testing logic to compare the to-be-added partition's
  constraint against each of the already added constraints. That becomes
  an O(N) problem.
 
  What is really needed is a data structure that allows range partitions
  to be accessed more efficiently. This could make adding partitions and
  deciding in which partition a specific value goes an O(logN)  
  operation.
 
 Directing data to child tables with triggers pretty much necessitates  
 having some way to codify what partition a particular row belongs in.  
 IE: for partitioning by month, you'll see things like naming the  
 partition tables parent_table_name_$YEAR_$MONTH, so the  
 'partitioning function' takes a date or timestamp and then returns  
 what partition it belongs to. Perhaps there is some way to use that  
 mapping to drive the selection of what partitions could contain a  
 given value?
 
 One possibility would be to require 3 functions for a partitioned  
 table: one accepts the partitioning key and tells you what partition  
 it's in, one that tells you what the minimum partitioning key for a  
 partition would be, and one that tells you what the maximum would be.  
 If the user supplied those 3 functions, I think it would be possibly  
 to automatically generate code for the triggers and check  
 constraints. The min/max partition key functions might allow you to  
 more efficiently do partition elimination, too.

ISTM this is a good idea.

SQLServer uses partitioning functions and I like that approach. It makes
it much easier to do partition-wise joins between tables that share
partitioning functions.

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


[HACKERS] WSAStartup() in libpq

2007-03-08 Thread Magnus Hagander
It has been brought to my attention by Tokuharu Yuzawa that our calling
of WSAStartup() in DllMain() (libpqdll.c) is incorrect. Basically we're
calling WSAStartup() so that the client application does not have to.
However, due to the fact that WSAStartup() can itself load libraries,
there is a risk of deadlocking here. See for example:

MSDN docs
http://msdn2.microsoft.com/en-us/library/aa910684.aspx

MIT Kerberos have had the same problem:
http://mailman.mit.edu/pipermail/krbdev/2005-March/003244.html

And even MS had the same bug in their own native database library:
http://support.microsoft.com/kb/818539

There's also a note about socket issues in DLLs on:
http://support.microsoft.com/kb/q237572/


The easy fix for this is to remove the calls. Which obviously will break
some client apps. A fairly easy fix for the WSAStartup() call is to have
a check in the connection functions against a global variable that will
then make sure to call WSAStartup() the first time it's called.

That would leave us leaking the WSAStartup() call, but only one per
application. This is not perfect, but I'm thinking we can maybe live
with that. 

If not, perhaps we can have it call WSAStartup() everytime we connect to
a server, and then WSACleanup() when we shut down that connection with
PQfinish(). We're still going to leak if the user forgets to run
PQfinish(), but we're leaking other resources as well in that case. That
will break if any network related functions are called when there is no
connection open, but I doubt that's possible?

Of course, if we had a libpq_init() and a libpq_shutdown() function
things would be very easy, but we don't. And adding it just for this
purpose seems like trying too hard.


Yet another option would be to require that the client app deal with the
startup/shutdown code itself, but that will seriously break backwards
compatibility, so I don't think that's a good idea at all.


Other ideas?

//Magnus

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] Auto creation of Partitions

2007-03-08 Thread Gregory Stark
Jim Nasby [EMAIL PROTECTED] writes:

 One possibility would be to require 3 functions for a partitioned table: one
 accepts the partitioning key and tells you what partition  it's in, one that
 tells you what the minimum partitioning key for a  partition would be, and one
 that tells you what the maximum would be.  If the user supplied those 3
 functions, I think it would be possibly  to automatically generate code for 
 the
 triggers and check  constraints. The min/max partition key functions might
 allow you to  more efficiently do partition elimination, too.

But then it would be harder to tell whether a clause implied a given
partition. That is, if you have a partition constraint of col OP const then
we can test whether a query clause of col OP2 const2 implies that constraint
when planning (or actually whether it implies it's false to exclude the
partition). If you have a constraint like P1(const) it'll be pretty hard to
do much with that.

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Auto creation of Partitions

2007-03-08 Thread Zeugswetter Andreas ADI SD
  The hard part there is checking that the partition constraints are 
  distinct. If the partition constraints are added one at a time, you 
  can use the predicate testing logic to compare the to-be-added 
  partition's constraint against each of the already added
constraints. 
  That becomes an O(N) problem.

Yes, we could preevaluate that check (per index) in the DDL phase and
keep the info in a flag. 
Also interesting info is if there is an order the partitions can be read
in to satisfy a particular order by.

  What is really needed is a data structure that allows range
partitions 
  to be accessed more efficiently. This could make adding partitions
and 
  deciding in which partition a specific value goes an O(logN) 
  operation.

I do not really see a problem with O(N) since typical N currently range
from 10 to 200.
N = 1000 is already good for a = 10 TB table.
If a 10 GB partition were too large we should imho invest more in the
advanced indexing methods that are currently beeing developed.

 Directing data to child tables with triggers pretty much 
 necessitates having some way to codify what partition a 
 particular row belongs in.  
 IE: for partitioning by month, you'll see things like naming 
 the partition tables parent_table_name_$YEAR_$MONTH, so the 
 'partitioning function' takes a date or timestamp and then 
 returns what partition it belongs to. Perhaps there is some 
 way to use that mapping to drive the selection of what 
 partitions could contain a given value?

You put it in the first partition that has matching constraints.

 One possibility would be to require 3 functions for a partitioned
 table: one accepts the partitioning key and tells you what 
 partition it's in, one that tells you what the minimum 
 partitioning key for a partition would be, and one that tells 
 you what the maximum would be.  
 If the user supplied those 3 functions, I think it would be 
 possibly to automatically generate code for the triggers and 
 check constraints. The min/max partition key functions might 
 allow you to more efficiently do partition elimination, too.

I can see this as a good optional addition, but it can only be optional
else it would pretty much limit the methods that can be used for
partitioning. e.g. hash, modulo do not have a min,max per partition.

Andreas

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] WSAStartup() in libpq

2007-03-08 Thread Andreas Pflug
Magnus Hagander wrote:

 The easy fix for this is to remove the calls. Which obviously will break
 some client apps. A fairly easy fix for the WSAStartup() call is to have
 a check in the connection functions against a global variable that will
 then make sure to call WSAStartup() the first time it's called.

 That would leave us leaking the WSAStartup() call, but only one per
 application. This is not perfect, but I'm thinking we can maybe live
 with that. 

 If not, perhaps we can have it call WSAStartup() everytime we connect to
 a server, and then WSACleanup() when we shut down that connection with
 PQfinish(). 

Taken from MSDN docs, this seems the recommended solution. After the
first WSAStartup call subsequent calls are cheap because they only
increment a counter.

Regards,
Andreas


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] WSAStartup() in libpq

2007-03-08 Thread Magnus Hagander
On Thu, Mar 08, 2007 at 12:47:42PM +0100, Andreas Pflug wrote:
 Magnus Hagander wrote:
 
  The easy fix for this is to remove the calls. Which obviously will break
  some client apps. A fairly easy fix for the WSAStartup() call is to have
  a check in the connection functions against a global variable that will
  then make sure to call WSAStartup() the first time it's called.
 
  That would leave us leaking the WSAStartup() call, but only one per
  application. This is not perfect, but I'm thinking we can maybe live
  with that. 
 
  If not, perhaps we can have it call WSAStartup() everytime we connect to
  a server, and then WSACleanup() when we shut down that connection with
  PQfinish(). 
 
 Taken from MSDN docs, this seems the recommended solution. After the
 first WSAStartup call subsequent calls are cheap because they only
 increment a counter.

Now that I look closer at it, we *already* do WSAStartup() in
makeEmptyPGconn... And free it in  freePGconn().

So I suggest the following simple patch.. Any objections?

//Magnus
Index: fe-connect.c
===
RCS file: /projects/cvsroot/pgsql/src/interfaces/libpq/fe-connect.c,v
retrieving revision 1.344
diff -c -r1.344 fe-connect.c
*** fe-connect.c20 Feb 2007 15:20:51 -  1.344
--- fe-connect.c8 Mar 2007 12:16:42 -
***
*** 1840,1848 
  #ifdef WIN32

/*
!* Make sure socket support is up and running. Even though this is done 
in
!* libpqdll.c, that is only for MSVC and BCC builds and doesn't work for
!* static builds at all, so we have to do it in the main code too.
 */
WSADATA wsaData;

--- 1840,1846 
  #ifdef WIN32

/*
!* Make sure socket support is up and running.
 */
WSADATA wsaData;

Index: libpqdll.c
===
RCS file: /projects/cvsroot/pgsql/src/interfaces/libpq/libpqdll.c,v
retrieving revision 1.10
diff -c -r1.10 libpqdll.c
*** libpqdll.c  11 Mar 2006 04:38:40 -  1.10
--- libpqdll.c  8 Mar 2007 12:05:59 -
***
*** 15,31 
switch (fdwReason)
{
case DLL_PROCESS_ATTACH:
!   if (WSAStartup(MAKEWORD(1, 1), wsaData))
!   {
!   /*
!* No really good way to do error handling 
here, since we
!* don't know how we were loaded
!*/
!   return FALSE;
!   }
break;
case DLL_PROCESS_DETACH:
!   WSACleanup();
break;
}

--- 15,24 
switch (fdwReason)
{
case DLL_PROCESS_ATTACH:
!   /* We used to call WSAStartup() here, but this may 
cause deadlocks */
break;
case DLL_PROCESS_DETACH:
!   /* We used to call WSACleanup() here, but this may 
cause deadlocks */
break;
}


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Bug in VACUUM FULL ?

2007-03-08 Thread Pavan Deolasee

Pavan Deolasee wrote:

 Thanks a lot, Tom. It seems to work fine for me. I will do some
 more tests and report if I see any issue.


The problem mentioned before is hard to reproduce with the
suggested change, but its not completely gone away. I have
seen that again on CVS HEAD with the patch applied.

I am facing another issue with VACUUM FULL. This
problem gets reproduced with HOT very easily, but takes
few attempts to reproduce with CVS HEAD, but it
certainly exists.

This time I am using pgbench. All tables but history are
created with fillfactor=50

Now, I start running pgbench with scale factor of 10 and 40
clients and 1 txns/client. After few minutes, I start
running VACUUM FULL on tellers and branches, every 10 seconds.
After a while, all pgbench clients fail with the following
errors:


Client 1 aborted in state 11: ERROR:  duplicate key violates unique 
constraint branches_pkey
Client 30 aborted in state 11: ERROR:  duplicate key violates unique 
constraint branches_pkey
Client 39 aborted in state 11: ERROR:  duplicate key violates unique 
constraint branches_pkey
Client 7 aborted in state 11: ERROR:  duplicate key violates unique 
constraint branches_pkey


Next run of VACUUM FULL gives the following error:

WARNING:  index branches_pkey contains 15 row versions, but table 
contains 12 row versions

HINT:  Rebuild the index with REINDEX.


Has this been reported earlier ? IIRC Tom mentioned in one of the
emails that Merlin has reported some problem related
to duplicate key violation. Tom, could this be related ?


Thanks,
Pavan

--


EnterpriseDBhttp://www.enterprisedb.com


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] WSAStartup() in libpq

2007-03-08 Thread Tom Lane
Magnus Hagander [EMAIL PROTECTED] writes:
 So I suggest the following simple patch.. Any objections?

One wonders if we need DllMain() at all any more.  We certainly don't
need that switch statement ...

Also, isn't the WSACleanup() in freePGconn in the wrong place?  Seems
like it shouldn't be done until after we've closed the socket.  I'd
be inclined to put it at the bottom of the routine.

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] WSAStartup() in libpq

2007-03-08 Thread Magnus Hagander
On Thu, Mar 08, 2007 at 10:10:28AM -0500, Tom Lane wrote:
 Magnus Hagander [EMAIL PROTECTED] writes:
  So I suggest the following simple patch.. Any objections?
 
 One wonders if we need DllMain() at all any more.  We certainly don't
 need that switch statement ...

Indeed. Looking even more into it (sheesh, I really didn't do my
homework here), libpqdll.c isn't even *compiled* on mingw. Or on the new
MSVC build. It's only compiled on the old msvc build. Given that, we can
probably just delete the file.


 Also, isn't the WSACleanup() in freePGconn in the wrong place?  Seems
 like it shouldn't be done until after we've closed the socket.  I'd
 be inclined to put it at the bottom of the routine.

Certainly looks wrong. It's interesting how this could have worked
*before*. That's a clear indication that it really doesn't appear to
matter much what we do here :S

The patch would then look something like this, and a remove of
libpqdll.c.

//Magnus

Index: bcc32.mak
===
RCS file: /projects/cvsroot/pgsql/src/interfaces/libpq/bcc32.mak,v
retrieving revision 1.26
diff -c -r1.26 bcc32.mak
*** bcc32.mak   11 Jan 2007 02:42:31 -  1.26
--- bcc32.mak   8 Mar 2007 15:23:17 -
***
*** 93,99 
[EMAIL PROTECTED] $(INTDIR)\fe-secure.obj
[EMAIL PROTECTED] $(INTDIR)\pqexpbuffer.obj
[EMAIL PROTECTED] $(INTDIR)\pqsignal.obj
-   [EMAIL PROTECTED] $(OUTDIR)\libpqdll.obj
[EMAIL PROTECTED] $(OUTDIR)\win32.obj
[EMAIL PROTECTED] $(INTDIR)\wchar.obj
[EMAIL PROTECTED] $(INTDIR)\encnames.obj
--- 93,98 
***
*** 155,168 

  LINK32=ilink32.exe
  LINK32_FLAGS = -Gn -L$(BCB)\lib;$(INTDIR); -x -Tpd -v
- LINK32_OBJS= $(INTDIR)\libpqdll.obj

  # @ is a Response file, http://www.opussoftware.com/tutorial/TutMakefile.htm

! $(OUTDIR)\blibpq.dll: $(OUTDIR)\blibpq.lib $(LINK32_OBJS) 
$(INTDIR)\libpq.res blibpqdll.def
$(LINK32) @
$(LINK32_FLAGS) +
!   c0d32.obj $(LINK32_OBJS), +
$@,, +
$(OUTDIR)\blibpq.lib import32.lib cw32mt.lib, +
blibpqdll.def,$(INTDIR)\libpq.res
--- 154,166 

  LINK32=ilink32.exe
  LINK32_FLAGS = -Gn -L$(BCB)\lib;$(INTDIR); -x -Tpd -v

  # @ is a Response file, http://www.opussoftware.com/tutorial/TutMakefile.htm

! $(OUTDIR)\blibpq.dll: $(OUTDIR)\blibpq.lib $(INTDIR)\libpq.res 
blibpqdll.def
$(LINK32) @
$(LINK32_FLAGS) +
!   c0d32.obj , +
$@,, +
$(OUTDIR)\blibpq.lib import32.lib cw32mt.lib, +
blibpqdll.def,$(INTDIR)\libpq.res
Index: fe-connect.c
===
RCS file: /projects/cvsroot/pgsql/src/interfaces/libpq/fe-connect.c,v
retrieving revision 1.344
diff -c -r1.344 fe-connect.c
*** fe-connect.c20 Feb 2007 15:20:51 -  1.344
--- fe-connect.c8 Mar 2007 15:16:58 -
***
*** 1840,1848 
  #ifdef WIN32

/*
!* Make sure socket support is up and running. Even though this is done 
in
!* libpqdll.c, that is only for MSVC and BCC builds and doesn't work for
!* static builds at all, so we have to do it in the main code too.
 */
WSADATA wsaData;

--- 1840,1846 
  #ifdef WIN32

/*
!* Make sure socket support is up and running.
 */
WSADATA wsaData;

***
*** 1918,1927 
PGnotify   *notify;
pgParameterStatus *pstatus;

- #ifdef WIN32
-   WSACleanup();
- #endif
-
if (!conn)
return;

--- 1916,1921 
***
*** 1986,1991 
--- 1980,1989 
termPQExpBuffer(conn-errorMessage);
termPQExpBuffer(conn-workBuffer);
free(conn);
+
+ #ifdef WIN32
+   WSACleanup();
+ #endif
  }

  /*
Index: win32.mak
===
RCS file: /projects/cvsroot/pgsql/src/interfaces/libpq/win32.mak,v
retrieving revision 1.43
diff -c -r1.43 win32.mak
*** win32.mak   11 Jan 2007 02:42:31 -  1.43
--- win32.mak   8 Mar 2007 15:22:29 -
***
*** 63,69 
[EMAIL PROTECTED] $(INTDIR)\fe-secure.obj
[EMAIL PROTECTED] $(INTDIR)\pqexpbuffer.obj
[EMAIL PROTECTED] $(INTDIR)\pqsignal.obj
-   [EMAIL PROTECTED] $(OUTDIR)\libpqdll.obj
[EMAIL PROTECTED] $(OUTDIR)\win32.obj
[EMAIL PROTECTED] $(INTDIR)\wchar.obj
[EMAIL PROTECTED] $(INTDIR)\encnames.obj
--- 63,68 
***
*** 143,149 
   /pdb:$(OUTDIR)\libpqdll.pdb /machine:I386 
/out:$(OUTDIR)\$(OUTFILENAME).dll\
   /implib:$(OUTDIR)\$(OUTFILENAME)dll.lib  /def:$(OUTFILENAME)dll.def
  LINK32_OBJS= \
-   $(INTDIR)\libpqdll.obj \
$(OUTDIR)\$(OUTFILENAME).lib \
$(OUTDIR)\libpq.res

--- 142,147 
***
*** 159,165 
$(RSC) $(RSC_PROJ) libpq.rc


! $(OUTDIR)\$(OUTFILENAME).dll : $(OUTDIR) $(OUTDIR)\libpqdll.obj 

Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-08 Thread Sherry Moore
Hi Simon,

 and what you haven't said
 
 - all of this is orthogonal to the issue of buffer cache spoiling in
 PostgreSQL itself. That issue does still exist as a non-OS issue, but
 we've been discussing in detail the specific case of L2 cache effects
 with specific kernel calls. All of the test results have been
 stand-alone, so we've not done any measurements in that area. I say this
 because you make the point that reducing the working set size of write
 workloads has no effect on the L2 cache issue, but ISTM its still
 potentially a cache spoiling issue.

What I wanted to point out was that (reiterating to avoid requoting),

- My test was simply to demonstrate that the observed performance
  difference with VACUUM was caused by whether the size of the
  user buffer caused L2 thrashing.

- In general, application should reduce the size of the working set
  to reduce the penalty of TLB misses and cache misses.

- If the application access pattern meets the NTA trigger condition,
  the benefit of reducing the working set size will be much smaller.

Whatever I said is probably orthogonal to the buffer cache issue you
guys have been discussing, but I haven't read all the email exchange
on the subject.

Thanks,
Sherry
-- 
Sherry Moore, Solaris Kernel Developmenthttp://blogs.sun.com/sherrym

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [JDBC] [HACKERS] Plan invalidation vs. unnamed prepared statements

2007-03-08 Thread Тимчишин Виталий


I am from pgsql-jdbc, so I may not be in the thread, so please ignore 
places where my misunderstanding goes out.

The main two questions, IMHO, is:
1) What is the key to plan cache. Current option is some statement key 
(id). Another option would be statement text (you still need to store it 
if you want to replan at some point). In this case you can use same plan 
for multiple statements going from different sessions. That's the point 
Simon was talking about. This should significantly reduce planning, 
especially on multiple similar clients. Now, as I understand, every 
connection prepare same statements and plan then independent. Such 
change would make Application servers prepare new connections much 
faster (given they prepare a number of same statements for each 
connection, which is the case for my engine). This should work for both 
named and unnamed. Note that adding unnamed statements to cache (and not 
removing on statement disposal) may require much larger cache.

BTW: This is used by IBM DB2 UDB.
2) Specific plans when parameters are known. This is the point about 
using partial index(and sometimes even using full  index- i.e. 
specifying frequent value of some index or one of two tables in a join). 
I'd say the best would be to have generic plan and try to replan, 
starting from generic plan results (dispose any possibility that gives 
values worse then generic plan). Such a replan should be much faster 
then original planning because you have rather high starting point. 
Another option is to catch possibilities at original planning and select 
correct plan when parameters are known - you check all possible uses 
with this will be frequent value, this will match this partial index, 
... the question is the number of such plans. But since all of them 
must be better then generic (and it is possible to make a three, i.e. A 
and B are not frequent - A is frequent - A is frequent and B meets 
partial index and children must be better then parent), I'd say there 
won't be many (and you can always limit it's number and leave only the 
best if one goes out of number or even collect usages and leave the 
plans that are used).




---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


[HACKERS] No buffer space available

2007-03-08 Thread Nik
have an SQL file with a set of about 3000 insert statements. This
file is executed twice from a remote client machine running Windows
XP. The first time it is executed against a Windows 2003 Server
running PostgreSQL 8.0. The second time it is executed against a
Windows 2003 Server running PostgreSQL 8.2. This setup has been in
place for about a year now and never had any issues.

However, I noticed today that all the file executions fail with
[nativecode=could not send data to server: No buffer space available
(0x2747/10055)] error.

I attempted to run the same file using pgAdmin III version 1.6.2, but
every time the file is executed the only return in messages pane is
the colon sign :. No data is inserted.

I attempted to run the same file using psql.exe and this worked
without any problems.

I also noticed that if I use less inserts, for example only a few
hundred, the file executes fine.

What could be the issue?

Thank you.

Nik


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] WSAStartup() in libpq

2007-03-08 Thread Tom Lane
Magnus Hagander [EMAIL PROTECTED] writes:
 On Thu, Mar 08, 2007 at 10:10:28AM -0500, Tom Lane wrote:
 Also, isn't the WSACleanup() in freePGconn in the wrong place?  Seems
 like it shouldn't be done until after we've closed the socket.  I'd
 be inclined to put it at the bottom of the routine.

 Certainly looks wrong. It's interesting how this could have worked
 *before*.

Because the calls in DllMain covered us (ie, the WSA usage count never
got to be less than one).  If we remove them, we'd better get this pair
right.

One thing that bothers me a bit is that if we just move the call to the
bottom, then freePGconn won't do it at all if passed a NULL pointer.
Now (assuming a non-broken app) the only way that can happen is if
makeEmptyPGconn runs out of memory.  If the client gets back a null
pointer from a connection attempt, it's probably a 50-50 proposition
whether it will think it ought to do PQfinish with it.  So it'd be good
if we could keep the usage count straight either way.  I propose the
invariant a WSA usage count is associated with a non-null PGconn
structure.  That would mean that doing WSACleanup only at the bottom
of freePGconn is correct, but also that makeEmptyPGconn needs to do
WSACleanup in its (two) failure paths.

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


[HACKERS] Estimating seq_page_fetch and random_page_fetch

2007-03-08 Thread Umar Farooq Minhas
Hi,

How can we accrately estimate the seq_page_fetch and random_page_fetch 
costs from outside the postgres using for example a C routine. Using a very 
simple program in C, I have two sets of files. I want to access one set 
randomly i.e. pulling data from random locations within the files. The second 
set of files is accessed sequentially. The goal here is to approximate the disk 
I/O cost for a random page fetch and a sequential page fetch respectively. 
I am using low-level(unbuffered) C routines i.e. read/write and lseek ( for 
positioning file pointer), the read/write buffer size is 8k (to match the size 
of postgres page), and Linux is the host OS. We all know that linux is a 
heavily cached OS, for that very reason I am using sets of files instead of a 
single file, in a hope that whenever a new file from a set of files is accessed 
for the first time, it will NOT be in the OS cache, thus giving accurate 
results of actually fetching the file pages from the physical disk. And also 
the host is restarted before running the experiment so as to force a cold-cache 
start.

I am hoping somebody could point me in the right direction.

Thanks

-Umar

[HACKERS]

2007-03-08 Thread Abrie
I need to find out if anybody can give me some advice on incremental backups
and restores 

Reason: backup size 

 

abrie



Re: [HACKERS] WSAStartup() in libpq

2007-03-08 Thread Magnus Hagander
On Thu, Mar 08, 2007 at 10:37:11AM -0500, Tom Lane wrote:
 Magnus Hagander [EMAIL PROTECTED] writes:
  On Thu, Mar 08, 2007 at 10:10:28AM -0500, Tom Lane wrote:
  Also, isn't the WSACleanup() in freePGconn in the wrong place?  Seems
  like it shouldn't be done until after we've closed the socket.  I'd
  be inclined to put it at the bottom of the routine.
 
  Certainly looks wrong. It's interesting how this could have worked
  *before*.
 
 Because the calls in DllMain covered us (ie, the WSA usage count never
 got to be less than one).  If we remove them, we'd better get this pair
 right.

But those calls weren't even compiled in when building using mingw,
which is what the majority of our users have been using lately, I think.
(Since that's what we ship in the binary package)


 One thing that bothers me a bit is that if we just move the call to the
 bottom, then freePGconn won't do it at all if passed a NULL pointer.
 Now (assuming a non-broken app) the only way that can happen is if
 makeEmptyPGconn runs out of memory.  If the client gets back a null
 pointer from a connection attempt, it's probably a 50-50 proposition
 whether it will think it ought to do PQfinish with it.  So it'd be good
 if we could keep the usage count straight either way.  I propose the
 invariant a WSA usage count is associated with a non-null PGconn
 structure.  That would mean that doing WSACleanup only at the bottom
 of freePGconn is correct, but also that makeEmptyPGconn needs to do
 WSACleanup in its (two) failure paths.

I'm honestly unsure wether we need to bother with it, but yeah, that
will likely be more correct. 
(Except one of the error paths in makeEmptyPGconn is already covered,
since it calls freePGconn, which does the WSACleanup)

//Magnus

Index: bcc32.mak
===
RCS file: /projects/cvsroot/pgsql/src/interfaces/libpq/bcc32.mak,v
retrieving revision 1.26
diff -c -r1.26 bcc32.mak
*** bcc32.mak   11 Jan 2007 02:42:31 -  1.26
--- bcc32.mak   8 Mar 2007 15:23:17 -
***
*** 93,99 
[EMAIL PROTECTED] $(INTDIR)\fe-secure.obj
[EMAIL PROTECTED] $(INTDIR)\pqexpbuffer.obj
[EMAIL PROTECTED] $(INTDIR)\pqsignal.obj
-   [EMAIL PROTECTED] $(OUTDIR)\libpqdll.obj
[EMAIL PROTECTED] $(OUTDIR)\win32.obj
[EMAIL PROTECTED] $(INTDIR)\wchar.obj
[EMAIL PROTECTED] $(INTDIR)\encnames.obj
--- 93,98 
***
*** 155,168 
  
  LINK32=ilink32.exe
  LINK32_FLAGS = -Gn -L$(BCB)\lib;$(INTDIR); -x -Tpd -v
- LINK32_OBJS= $(INTDIR)\libpqdll.obj
  
  # @ is a Response file, http://www.opussoftware.com/tutorial/TutMakefile.htm
  
! $(OUTDIR)\blibpq.dll: $(OUTDIR)\blibpq.lib $(LINK32_OBJS) 
$(INTDIR)\libpq.res blibpqdll.def 
$(LINK32) @
$(LINK32_FLAGS) +
!   c0d32.obj $(LINK32_OBJS), +
$@,, +
$(OUTDIR)\blibpq.lib import32.lib cw32mt.lib, +
blibpqdll.def,$(INTDIR)\libpq.res
--- 154,166 
  
  LINK32=ilink32.exe
  LINK32_FLAGS = -Gn -L$(BCB)\lib;$(INTDIR); -x -Tpd -v
  
  # @ is a Response file, http://www.opussoftware.com/tutorial/TutMakefile.htm
  
! $(OUTDIR)\blibpq.dll: $(OUTDIR)\blibpq.lib $(INTDIR)\libpq.res 
blibpqdll.def 
$(LINK32) @
$(LINK32_FLAGS) +
!   c0d32.obj , +
$@,, +
$(OUTDIR)\blibpq.lib import32.lib cw32mt.lib, +
blibpqdll.def,$(INTDIR)\libpq.res
Index: fe-connect.c
===
RCS file: /projects/cvsroot/pgsql/src/interfaces/libpq/fe-connect.c,v
retrieving revision 1.344
diff -c -r1.344 fe-connect.c
*** fe-connect.c20 Feb 2007 15:20:51 -  1.344
--- fe-connect.c8 Mar 2007 15:42:59 -
***
*** 1840,1848 
  #ifdef WIN32
  
/*
!* Make sure socket support is up and running. Even though this is done 
in
!* libpqdll.c, that is only for MSVC and BCC builds and doesn't work for
!* static builds at all, so we have to do it in the main code too.
 */
WSADATA wsaData;
  
--- 1840,1846 
  #ifdef WIN32
  
/*
!* Make sure socket support is up and running.
 */
WSADATA wsaData;
  
***
*** 1853,1859 
--- 1851,1862 
  
conn = (PGconn *) malloc(sizeof(PGconn));
if (conn == NULL)
+   {
+ #ifdef WIN32
+   WSACleanup();
+ #endif
return conn;
+   }
  
/* Zero all pointers and booleans */
MemSet(conn, 0, sizeof(PGconn));
***
*** 1918,1927 
PGnotify   *notify;
pgParameterStatus *pstatus;
  
- #ifdef WIN32
-   WSACleanup();
- #endif
- 
if (!conn)
return;
  
--- 1921,1926 
***
*** 1986,1991 
--- 1985,1994 
termPQExpBuffer(conn-errorMessage);
termPQExpBuffer(conn-workBuffer);
free(conn);
+
+ #ifdef WIN32
+   WSACleanup();
+ #endif
  }


Re: [HACKERS] WSAStartup() in libpq

2007-03-08 Thread Tom Lane
Magnus Hagander [EMAIL PROTECTED] writes:
 On Thu, Mar 08, 2007 at 10:37:11AM -0500, Tom Lane wrote:
 Because the calls in DllMain covered us (ie, the WSA usage count never
 got to be less than one).  If we remove them, we'd better get this pair
 right.

 But those calls weren't even compiled in when building using mingw,

Hmm ... does make you wonder, doesn't it?  But anyway, if we're
bothering to call the functions at all, we should try to meet the
defined protocol.  So I like this latest version of the patch.

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] No buffer space available

2007-03-08 Thread Andrew Dunstan

Nik wrote:

have an SQL file with a set of about 3000 insert statements. This
file is executed twice from a remote client machine running Windows
XP. The first time it is executed against a Windows 2003 Server
running PostgreSQL 8.0. The second time it is executed against a
Windows 2003 Server running PostgreSQL 8.2. This setup has been in
place for about a year now and never had any issues.

However, I noticed today that all the file executions fail with
[nativecode=could not send data to server: No buffer space available
(0x2747/10055)] error.

I attempted to run the same file using pgAdmin III version 1.6.2, but
every time the file is executed the only return in messages pane is
the colon sign :. No data is inserted.

I attempted to run the same file using psql.exe and this worked
without any problems.

I also noticed that if I use less inserts, for example only a few
hundred, the file executes fine.

What could be the issue?

  



First, this is the wrong place to ask this question. pgsql-hackers is 
about development of postgres itself, not about usage questions. Perhaps 
you should ask on pgsql-general.


Second, if you can run the file using psql, then it does not appear to 
be a problem with postgres at all, but rather with the client you are 
using (you don't say what that client is). Perhaps your client is trying 
a naive method of batching inserts and thus running out of buffer space 
- it's hard to tell since you haven't given much information.


In the case of pgAdminIII, you might need to ask on its lists.

cheers

andrew



---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [HACKERS]

2007-03-08 Thread Richard Huxton

Abrie wrote:

I need to find out if anybody can give me some advice on incremental backups
and restores 

Reason: backup size 


Abrie - this is the pgsql-hackers mailing list. It's for questions about 
the development of the PostgreSQL database system. You'll want the 
pgsql-general or pgsql-admin lists.


--
  Richard Huxton
  Archonet Ltd

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [HACKERS] Estimating seq_page_fetch and random_page_fetch

2007-03-08 Thread Tom Lane
Umar Farooq Minhas [EMAIL PROTECTED] writes:
 How can we accrately estimate the seq_page_fetch and =
 random_page_fetch costs from outside the postgres using for example a =
 C routine.

Use a test case larger than memory.  Repeat many times to average out
noise.  IIRC, when I did the experiments that led to the current
random_page_cost of 4.0, it took about a week before I had numbers I
trusted.

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Estimating seq_page_fetch and random_page_fetch

2007-03-08 Thread Gregory Stark

Tom Lane [EMAIL PROTECTED] writes:

 Umar Farooq Minhas [EMAIL PROTECTED] writes:
 How can we accrately estimate the seq_page_fetch and =
 random_page_fetch costs from outside the postgres using for example a =
 C routine.

 Use a test case larger than memory.  Repeat many times to average out
 noise.  IIRC, when I did the experiments that led to the current
 random_page_cost of 4.0, it took about a week before I had numbers I
 trusted.

When I was running tests I did it on a filesystem where nothing else was
running. Between tests I unmounted and remounted it. As I understand it Linux
associates the cache with the filesystem and not the block device and discards
all pages from cache when the filesystem is unmounted.

That doesn't contradict anything Tom said, it might be useful as an additional
tool though.

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] RFC: changing autovacuum_naptime semantics

2007-03-08 Thread Alvaro Herrera
Tom Lane wrote:
 Alvaro Herrera [EMAIL PROTECTED] writes:

  Is everybody OK with not putting a per-tablespace worker limit?
  Is everybody OK with putting per-database worker limits on a pg_database
  column?
 
 I don't think we need a new pg_database column.  If it's a GUC you can
 do ALTER DATABASE SET, no?  Or was that what you meant?

No, that doesn't work unless we save the datconfig column to the
pg_database flatfile, because it's the launcher (which is not connected) 
who needs to read it.  Same thing with an hypothetical per-database
naptime.  The launcher would also need to parse it, which is not ideal
(though not a dealbreaker either).

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] [PATCHES] pg_standby

2007-03-08 Thread Doug Knight
Hi Simon,
I would preserve the existing trigger function as little t -t, and
maybe implement a catchup trigger function as big t -T? Set it up so
that if the first attempt to find the WAL file postgres is currently
requesting succeeds, skip over the trigger check. If the first attempt
fails, then do your trigger check. That way, in the OCF script, the
postmaster can be started, the trigger file set, and connection to the
database looped on until it succeeds as an indication for when the
database is up and available. I think that's cleaner than comparing a
filename from a 'ps' command. Once I've completed the OCF script and
done some testing, I'll forward it to you for you to review and see if
you want to include it.

Thanks,
Doug

On Thu, 2007-03-08 at 15:37 +, Simon Riggs wrote:
 On Thu, 2007-03-08 at 10:33 -0500, Doug Knight wrote:
  Thanks, Simon. I kind of figured that's how pg_standby would work,
  since its invoked by postgres once per WAL file. What I was thinking I
  might do in the OCF script is to grab the pg_standby process line from
  a ps, pull out the current WAL file path and filename, then do an
  existence check for the file. If the file exists, then
  pg_standby/postgres is probably processing it. If not, then we're
  probably waiting on it, implying that recovery is complete. Thoughts
  on this process?
 
 I suppose I might be able to have the option to catch up before it
 stops, on the basis that if it can find the file it was looking for
 without waiting then that can override the trigger.
 
 Which way would you like it to work?
 


Re: [HACKERS] Calculated view fields (8.1 != 8.2)

2007-03-08 Thread Gaetano Mendola
Tom Lane wrote:
 Gaetano Mendola [EMAIL PROTECTED] writes:
 [ 8.2 evaluates volatile functions in the targetlist of a view ]
 If I mark the function as STABLE or IMMUTABLE then even with version
 8.2 the function is not evaluated. Is this the intended behavior?
 
 Yes; people complained that we needed to be more careful about the
 number of times volatile functions get evaluated.

I suspect that functions are evaluated also for record discarded due to
joins. Is that the case?

Like:

SELECT * FROM ta JOIN tb USING (id) where tb.foo = 4;

If ta is a view with some calculated fields are the function on ta
evaluated only for record matching the filters or in some case (
like a full scan on ta ) also for the records discarded due to the join?


Regards
Gaetano Mendola

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] [PATCHES] pg_standby

2007-03-08 Thread Simon Riggs
On Thu, 2007-03-08 at 13:29 -0500, Doug Knight wrote:

 I would preserve the existing trigger function as little t -t, and
 maybe implement a catchup trigger function as big t -T? Set it up so
 that if the first attempt to find the WAL file postgres is currently
 requesting succeeds, skip over the trigger check. If the first attempt
 fails, then do your trigger check. That way, in the OCF script, the
 postmaster can be started, the trigger file set, and connection to the
 database looped on until it succeeds as an indication for when the
 database is up and available. I think that's cleaner than comparing a
 filename from a 'ps' command. Once I've completed the OCF script and
 done some testing, I'll forward it to you for you to review and see if
 you want to include it.

I'm happy to do this, unless other objections.

I'll be doing another version before feature freeze.

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] RFC: changing autovacuum_naptime semantics

2007-03-08 Thread Alvaro Herrera
Galy Lee wrote:

Hi,

 Alvaro Herrera wrote:
  I still haven't received the magic bullet to solve the hot table
  problem, but these at least means we continue doing *something*.
 
 Can I know about what is your plan or idea for autovacuum improvement
 for 8.3 now? And also what is the roadmap of autovacuum improvement for 8.4?

Things I want to do for 8.3:

- Make use of the launcher/worker stuff, that is, allow multiple
  autovacuum processes in parallel.  With luck we'll find out how to
  deal with hot tables.

Things I'm not sure we'll be able to have in 8.3, in which case I'll get
to them for early 8.4:

- The maintenance window stuff, i.e., being able to throttle workers
  depending on a user-defined schedule.

8.4 material:

- per-tablespace throttling, coordinating IO from multiple workers


I don't have anything else as detailed as a plan.  If you have
suggestions, I'm all ears.

Now regarding your restartable vacuum work.  I think that stopping a
vacuum at some point and being able to restart it later is very cool and
may get you some hot chicks, but I'm not sure it's really useful.  I
think it makes more sense to do something like throttling an ongoing
vacuum to a reduced IO rate, if the maintenance window closes.  So if
you're in the middle of a heap scan and the maintenance window closes,
you immediately stop the scan and go the the index cleanup phase, *at a
reduced IO rate*.  This way, the user will be able to get the benefits
of vacuuming at some not-too-distant future, without requiring the
maintenance window to open again, but without the heavy IO impact that
was allowed during the maintenance window.

Does this make sense?

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Auto creation of Partitions

2007-03-08 Thread Jim C. Nasby
On Thu, Mar 08, 2007 at 01:52:27PM +0530, NikhilS wrote:
 I think it'd be great to make adding and removing partitions as
 simple as ALTER TABLE. I don't think that DELETE should be the
 mechanism to drop a partition, though. Again, DML statements
 shouldn't be performing DDL.
 
 
 Since partition is inheritance-based, a simple DROP or  NO INHERIT will do
 the job to deal with the partition. Do we want to reinvent additional syntax
 when these are around and are documented?

Well, if the syntax for adding a new partition eventually ends up as
ALTER TABLE ADD PARTITION, then it would make more sense that you remove
a partition via ALTER TABLE DROP PARTITION.
-- 
Jim Nasby[EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Auto creation of Partitions

2007-03-08 Thread Jim C. Nasby
On Thu, Mar 08, 2007 at 10:12:30AM +, Gregory Stark wrote:
 Jim Nasby [EMAIL PROTECTED] writes:
 
  One possibility would be to require 3 functions for a partitioned table: one
  accepts the partitioning key and tells you what partition  it's in, one that
  tells you what the minimum partitioning key for a  partition would be, and 
  one
  that tells you what the maximum would be.  If the user supplied those 3
  functions, I think it would be possibly  to automatically generate code for 
  the
  triggers and check  constraints. The min/max partition key functions might
  allow you to  more efficiently do partition elimination, too.
 
 But then it would be harder to tell whether a clause implied a given
 partition. That is, if you have a partition constraint of col OP const then
 we can test whether a query clause of col OP2 const2 implies that constraint
 when planning (or actually whether it implies it's false to exclude the
 partition). If you have a constraint like P1(const) it'll be pretty hard to
 do much with that.

Well, you could tell what partition 'const' was in; I would think that
plus knowledge about OP2 would allow you to decide what partitions you
need to look at.

There's also nothing to prevent us from also adding the constraints and
using constraint exclusion as well. In fact, I think we'd want to have
the constraints just so we know that a given partition only contains the
data we want it to.
-- 
Jim Nasby[EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] Auto creation of Partitions

2007-03-08 Thread Jim C. Nasby
One other thought... a lot of this discussion seems to parallel the
debate from a while ago about whether SERIAL should act like a macro
(ie: it just sets everything up and users are free to monkey under the
hood afterwards), or whether it should be it's own 'closed-box'
construct.

Currently, we seem to be leaning towards partition management being a
'macro', with child tables very exposed, etc. I don't know if that's
good or bad, but it's probably worth some thought.
-- 
Jim Nasby[EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] PostgreSQL - 'SKYLINE OF' clause added!

2007-03-08 Thread Nikita

Few things from our side:

1. 'Skyline Of' is a new operator proposed in ICDE 2003, one of the topmost
conferences of Data Engineering. Skyline operation is a hot area of research
in query processing. Many of the database community people do know about
this operator, and it is fast catching the attention.

2. The skyline operation is very useful in data analysis. Suppose, if we
have a cricket database, and we want to find the bowlers who have taken
maximum wickets in minimum overs, we can issue an easy-to-write query using
'Skyline of' syntax as follows:

Select * from Player_Match Skyline Of overs_bowled min, wickets_taken max;

This query gives 25 interesting tuples (result set) out of 24750 tuples in
0.0509 seconds. The same result is obtained in 0.8228 seconds if the
following equivalent nested-query is issued:

select * from Player_Match p1 where not exists ( select * from Player_Match
p2 where p2.overs_bowled = p1.overs_bowled and p2.wickets_taken =
p1.wickets_taken and (p2.overs_bowled  p1.overs_bowled or p2.wickets_taken
p1.wickets_taken))

Note that the above time is the time elapsed between issuing a query and
obtaining the result set.
As can be seen, the above query looks pretty cumbersome to write and is
inefficient too. So, which query will the user prefer? As the number of
dimensions increases, writing a nested-query will become a hedious task.
Btw, how can such a query be written using aggregate function syntax??

3. As far as optimizing the Skyline is concerned, it is still a research
problem since it requires estimating the cardinality of the skyline result
set.

4. Until and unless this operator is implemented in a popular database
system, how can a user ever get to know about it and hence appreciate its
usefulness?

Btw, it was our B.Tech final year project, and not a term project :-)

Regards.

On 3/8/07, Tom Lane [EMAIL PROTECTED] wrote:


Shane Ambler [EMAIL PROTECTED] writes:
 Tom Lane wrote:
 Well, whether it's horrible or not is in the eye of the beholder, but
 this is certainly a non-standard syntax extension.

 Being non-standard should not be the only reason to reject a worthwhile
 feature.

No, but being non-standard is certainly an indicator that the feature
may not be of widespread interest --- if it were, the SQL committee
would've gotten around to including it; seems they've managed to include
everything but the kitchen sink already.  Add to that the complete lack
of any previous demand for the feature, and you have to wonder where the
market is.

 The fact that several
 different groups have been mentioned to be working on this feature would
 indicate that it is worth considering.

It looks to me more like someone published a paper that caught the
attention of a few profs looking for term projects for their students.

Now maybe it really is the best idea since sliced bread and will be seen
in the next SQL spec edition, but color me skeptical.  It seems to me
to be a very narrow-usage extension, as opposed to (eg) multi-input
aggregates or WITH/RECURSIVE, which provide general mechanisms applicable
to a multitude of problems.  Now even so it would be fine if the
implementation were similarly narrow in scope, but the published
description of the patch mentions a large chunk of additional executor
mechanisms.  If we're going to be adding as much code as that, I'd like
to see a wider scope of usage for it.

Basically, this patch isn't sounding like it has a reasonable
bang-to-the-buck ratio ...

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend





--
Pride sullies the noblest character


Re: [HACKERS] Stream bitmaps

2007-03-08 Thread Heikki Linnakangas

Hi Gavin,

Any progress?

Gavin Sherry wrote:

Heikki,

On Mon, 5 Mar 2007, Heikki Linnakangas wrote:


Hi all,

I'd like to see the indexam API changes needed by the bitmap indexam to
be committed soon. Has anyone looked at the proposed API in the latest
patch? Any thoughts?


Thanks for looking at it!


I'm quite happy with it myself, with a few reservations:

- All the getbitmap implementations except the new bitmap indexam are
just boilerplate. How about making getbitmap-function optional, and
having a generic implementation that fills in a hash bitmap using the
traditional getnext function?

- getbitmap is passed an existing bitmap as argument, and the
implementation needs to OR the existing bitmap with new tuples. How
about AND? An indexam could be smart about ANDing with an existing
bitmap, for example skipping to the first set bit in the existing bitmap
and starting the scan from there.

- I'd like to have support to return candidate matches with both
getbitmap and getnext. A simple flag per page of results would be enough
for getbitmap, I think.

- StreamBitmap and HashBitmap are separate node types, but OpStream is
not. opaque-field in the StreamBitmap struct is not really that opaque,
it needs to be a StreamNode. I drew a UML sketch of what I think the
class-hierarchy is
(http://community.enterprisedb.com/streambitmaps.png). This is
object-oriented programming, we're just implementing classes and
inheritance with structs and function pointers. The current patch mixes
different techniques, and that needs to be cleaned up.


All good ideas, I'll look at them more closely in the morning.


I'd like to see a separate patch that contains just the API changes.
Gavin, could you extract an API-only patch from the bitmap index patch?
I can work on it as well, but I don't want to step on your toes.


Okay, I can do that.

Thanks,

Gavin



--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] Estimating seq_page_fetch and random_page_fetch

2007-03-08 Thread Luke Lonergan
Adding to this:

Ayush recently wrote a C program that emulates PG IO to do this analysis, and 
we came out with (predictably) a ratio of sequential/random of 20-50 (for a 
single user).  This is predictable because the random component is fixed at the 
access time of a single hard drive no matter how many disks are in an array, 
while the sequential scales nearly linearly with the number of drives in the 
array.

So, you can estimate random using 8-12ms per random access, and sequential as 
1/(number of disks X 60-130MB/s).

Ayush, can you forward your C program?

- Luke

Msg is shrt cuz m on ma treo

 -Original Message-
From:   Gregory Stark [mailto:[EMAIL PROTECTED]
Sent:   Thursday, March 08, 2007 12:37 PM Eastern Standard Time
To: Tom Lane
Cc: Umar Farooq Minhas; pgsql-hackers@postgresql.org
Subject:Re: [HACKERS] Estimating seq_page_fetch and random_page_fetch


Tom Lane [EMAIL PROTECTED] writes:

 Umar Farooq Minhas [EMAIL PROTECTED] writes:
 How can we accrately estimate the seq_page_fetch and =
 random_page_fetch costs from outside the postgres using for example a =
 C routine.

 Use a test case larger than memory.  Repeat many times to average out
 noise.  IIRC, when I did the experiments that led to the current
 random_page_cost of 4.0, it took about a week before I had numbers I
 trusted.

When I was running tests I did it on a filesystem where nothing else was
running. Between tests I unmounted and remounted it. As I understand it Linux
associates the cache with the filesystem and not the block device and discards
all pages from cache when the filesystem is unmounted.

That doesn't contradict anything Tom said, it might be useful as an additional
tool though.

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq



Re: [HACKERS] [PATCHES] pg_standby

2007-03-08 Thread Doug Knight
Excellent. Once you're ready, fire it over and I'll test it on our
config.

Doug
On Thu, 2007-03-08 at 18:34 +, Simon Riggs wrote:
 On Thu, 2007-03-08 at 13:29 -0500, Doug Knight wrote:
 
  I would preserve the existing trigger function as little t -t, and
  maybe implement a catchup trigger function as big t -T? Set it up so
  that if the first attempt to find the WAL file postgres is currently
  requesting succeeds, skip over the trigger check. If the first attempt
  fails, then do your trigger check. That way, in the OCF script, the
  postmaster can be started, the trigger file set, and connection to the
  database looped on until it succeeds as an indication for when the
  database is up and available. I think that's cleaner than comparing a
  filename from a 'ps' command. Once I've completed the OCF script and
  done some testing, I'll forward it to you for you to review and see if
  you want to include it.
 
 I'm happy to do this, unless other objections.
 
 I'll be doing another version before feature freeze.
 


Re: [HACKERS] Auto creation of Partitions

2007-03-08 Thread Shane Ambler



Note to Nikhil: Make sure the new syntax doesn't prevent partitions from
being placed upon multiple tablespaces in some manner, at CREATE TABLE
time.


What if the syntax was something like -

CREATE TABLE tabname (
...
...
 ) PARTITION BY
 HASH(expr)
| RANGE(expr)
| LIST(expr)
[PARTITIONS num_partitions] /* will apply to HASH only for now*/
[PARTITION partition_name CHECK(...) [USING TABLESPACE tblspcname],
 PARTITION partition_name CHECK(...) [USING TABLESPACE tblspcname]
 ...
];


And (if we use the ALTER TABLE to add partitions)

ALTER TABLE tabname
ADD PARTITION partition_name CHECK(...)
[USING TABLESPACE tblspcname];


Of course ALTER TABLE childtable SET TABLESPACE tblspcname; should not 
cause any probs.



--

Shane Ambler
[EMAIL PROTECTED]

Get Sheeky @ http://Sheeky.Biz

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Estimating seq_page_fetch and random_page_fetch

2007-03-08 Thread Jim C. Nasby
On Thu, Mar 08, 2007 at 05:35:03PM +, Gregory Stark wrote:
 
 Tom Lane [EMAIL PROTECTED] writes:
 
  Umar Farooq Minhas [EMAIL PROTECTED] writes:
  How can we accrately estimate the seq_page_fetch and =
  random_page_fetch costs from outside the postgres using for example a =
  C routine.
 
  Use a test case larger than memory.  Repeat many times to average out
  noise.  IIRC, when I did the experiments that led to the current
  random_page_cost of 4.0, it took about a week before I had numbers I
  trusted.
 
 When I was running tests I did it on a filesystem where nothing else was
 running. Between tests I unmounted and remounted it. As I understand it Linux
 associates the cache with the filesystem and not the block device and discards
 all pages from cache when the filesystem is unmounted.
 
 That doesn't contradict anything Tom said, it might be useful as an additional
 tool though.

Another trick I've used in the past is to just run the machine out of
memory, using the following:

/*
 * $Id: clearmem.c,v 1.1 2003/06/29 20:41:33 decibel Exp $
 *
 * Utility to clear out a chunk of memory and zero it. Useful for flushing disk 
buffers
 */

int main(int argc, char *argv[]) {
if (!calloc(atoi(argv[1]), 1024*1024)) { printf(Error allocating 
memory.\n); }
}

I'll monitor top while that's running to ensure that some stuff gets
swapped out to disk. I believe this might still leave some cached data
in other areas of the kernel, but it's probably not enough to worry
about.
-- 
Jim Nasby[EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Acclerating INSERT/UPDATE using UPS

2007-03-08 Thread Hideyuki Kawashima

Simon,


Not checkpointing at all is not a good plan, since this will lead to an
enormous build up of WAL files and a very long recovery time if the
system does fail.


I appreciate your detailed comments.
Following your comments, I revised the problem.
Sigres-0.1.3 does checkpointings.

In summary, the features of Sigres-0.1.3 are as follows.
0: 10% faster than conventional PostgreSQL under tmpfs.
1: Checkpointings are continually executed.
2: Sigres mode is in default (the mode can be turned off via postgresql.conf).
3: issue_xlog_sync is called only by bgwriter (continually, via
createcheckpoint)
4: The entity of XLogWrite (_XLogWrite in my code) is called by both
backends and a bgwriter.
For each backend, _XLogWrite is called only via AdvanceXLInsertBuffer.
For a bgwriter, _XLogWrite is called via CreateCheckPoint.

Please try it if you have interest.
http://sourceforge.jp/projects/sigres/

Again, I really appreciate beneficial comments from this community !

Regards,

-- Hideyuki

--
Hideyuki Kawashima (Ph.D.)
University of Tsukuba
Assistant Professor

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] Auto creation of Partitions

2007-03-08 Thread Shane Ambler

Jim C. Nasby wrote:

On Thu, Mar 08, 2007 at 01:52:27PM +0530, NikhilS wrote:

I think it'd be great to make adding and removing partitions as
simple as ALTER TABLE. I don't think that DELETE should be the
mechanism to drop a partition, though. Again, DML statements
shouldn't be performing DDL.


Since partition is inheritance-based, a simple DROP or  NO INHERIT will do
the job to deal with the partition. Do we want to reinvent additional syntax
when these are around and are documented?


Well, if the syntax for adding a new partition eventually ends up as
ALTER TABLE ADD PARTITION, then it would make more sense that you remove
a partition via ALTER TABLE DROP PARTITION.


This follows on from the suggestion I made - taken along the lines of 
the subject auto creation of partitions where I suggested the syntax 
of partition check(month of mydatecol) and have a new partition created 
as data was entered. With this scenario dropping the partition when it 
was empty would complement the creation of a new partition as needed.


Given that there seems to be no real support of going with auto 
maintenance were new partitions are added as needed, then the auto 
dropping of empty partitions would also not apply.


Leaving us with only specific add partition / drop partition commands.
And have the parent table pick up rows not matching any partition check 
criteria.



--

Shane Ambler
[EMAIL PROTECTED]

Get Sheeky @ http://Sheeky.Biz

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] Estimating seq_page_fetch and random_page_fetch

2007-03-08 Thread Neil Conway
On Thu, 2007-03-08 at 17:35 +, Gregory Stark wrote:
 When I was running tests I did it on a filesystem where nothing else was
 running. Between tests I unmounted and remounted it. As I understand it Linux
 associates the cache with the filesystem and not the block device and discards
 all pages from cache when the filesystem is unmounted.

On recent Linux kernels, /proc/sys/vm/drop_caches can also be useful:

http://linux.inet.hr/proc_sys_vm_drop_caches.html

You could also use posix_fadvise() to achieve a similar effect on a per-file
basis.

-Neil



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Stream bitmaps

2007-03-08 Thread Gavin Sherry
On Thu, 8 Mar 2007, Heikki Linnakangas wrote:

 Hi Gavin,

 Any progress?


Really busy at the moment, but it's on my TODO list for today.

Thanks,

Gavin

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


[HACKERS] who gets paid for this

2007-03-08 Thread Christian Bird

Hi all,

I'm a grad student at UC Davis studying the postgres community and I
wanted to know if some on this list could help me out.  I'm studying
the factors that affect people graduating from being mailing list
participant to developers with write access to the repository.  Is it
possible to find out who is being employed to work on postgres and who
is doing it on their own time?  Some of my data points to there being
two ways that people make the jump.  More specifically, could those
who worked on apache as some aspect of their job prior to getting repo
access let me know?  Or if there are devs who know this information
about others, I'd be really appreciative to get it.  Thanks a lot.

-- Christian Bird

--
Christian Bird
[EMAIL PROTECTED]

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] who gets paid for this

2007-03-08 Thread Christian Bird

Hi all,

I'm a grad student at UC Davis studying the apache server community
and I wanted to know if some on this list could help me out.  I'm
studying the factors that affect people graduating from being
mailing list participant to developers with write access to the
repository.  Is it possible to find out who is being employed to work
on apache and who is doing it on their own time?  Some of my data
points to there being two ways that people make the jump.  More
specifically, could those who worked on postgres as some aspect of
their job prior to getting cvs access let me know?  Or if there are
devs who know this information about others, I'd be really
appreciative to get it.  Thanks a lot.

-- Christian Bird

--
Christian Bird
[EMAIL PROTECTED]



--
Christian Bird
[EMAIL PROTECTED]

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [HACKERS] Estimating seq_page_fetch and random_page_fetch

2007-03-08 Thread Umar Farooq Minhas
Re: [HACKERS] Estimating seq_page_fetch and random_page_fetchThanks a lot for 
your replies. The suggestions have proved much useful.
Ayush, I'm curious to see your C program, thanks.

Here is a related but different issue. I started looking at the postgres 
optimizer/planner code a month back to modify it for the purposes of 
experiments that I need to conduct. The EXPLAIN command prints the total costs 
i.e both CPU + I/O however, for my purposes I need these two costs to be 
separated i.e. instead of getting one cost displayed, I want cpu cost and io 
cost displayed separated when i run EXPLAIN on a particular query. Till now I 
haven't been able to figure out a 'clean' way of doing this. Can anyone tell me 
how much time should I expect to spend making such a change ? and from where 
should I start ? costsize.c ?

I have another question. Looking at the optimizer code, it pretty much looks 
insensitive to the memory factor. The only parameters being utilized are the 
effective_cache_size ( in estimating index cost only) and work_mem for 
(sort, aggregation, groups, hash/merge joins). Are these the only memory 
factors that DIRECTLY effect the cost estimates of the planner/optimizer?

Again your help is appreciated.

-Umar
  - Original Message - 
  From: Luke Lonergan 
  To: Gregory Stark ; Tom Lane ; Ayush Parashar 
  Cc: Umar Farooq Minhas ; pgsql-hackers@postgresql.org 
  Sent: Thursday, March 08, 2007 2:16 PM
  Subject: Re: [HACKERS] Estimating seq_page_fetch and random_page_fetch


  Adding to this:

  Ayush recently wrote a C program that emulates PG IO to do this analysis, and 
we came out with (predictably) a ratio of sequential/random of 20-50 (for a 
single user).  This is predictable because the random component is fixed at the 
access time of a single hard drive no matter how many disks are in an array, 
while the sequential scales nearly linearly with the number of drives in the 
array.

  So, you can estimate random using 8-12ms per random access, and sequential as 
1/(number of disks X 60-130MB/s).

  Ayush, can you forward your C program?

  - Luke

  Msg is shrt cuz m on ma treo

   -Original Message-
  From:   Gregory Stark [mailto:[EMAIL PROTECTED]
  Sent:   Thursday, March 08, 2007 12:37 PM Eastern Standard Time
  To: Tom Lane
  Cc: Umar Farooq Minhas; pgsql-hackers@postgresql.org
  Subject:Re: [HACKERS] Estimating seq_page_fetch and random_page_fetch


  Tom Lane [EMAIL PROTECTED] writes:

   Umar Farooq Minhas [EMAIL PROTECTED] writes:
   How can we accrately estimate the seq_page_fetch and =
   random_page_fetch costs from outside the postgres using for example a =
   C routine.
  
   Use a test case larger than memory.  Repeat many times to average out
   noise.  IIRC, when I did the experiments that led to the current
   random_page_cost of 4.0, it took about a week before I had numbers I
   trusted.

  When I was running tests I did it on a filesystem where nothing else was
  running. Between tests I unmounted and remounted it. As I understand it Linux
  associates the cache with the filesystem and not the block device and discards
  all pages from cache when the filesystem is unmounted.

  That doesn't contradict anything Tom said, it might be useful as an additional
  tool though.

  --
Gregory Stark
EnterpriseDB  http://www.enterprisedb.com

  ---(end of broadcast)---
  TIP 3: Have you checked our extensive FAQ?

 http://www.postgresql.org/docs/faq




Re: [HACKERS] RFC: changing autovacuum_naptime semantics

2007-03-08 Thread Galy Lee

Alvaro Herrera wrote:
 I don't have anything else as detailed as a plan.  If you have
 suggestions, I'm all ears.
Cool, thanks for the update. :) We also have some new ideas on the
improvement of autovacuum now. I will raise it up later.

 Now regarding your restartable vacuum work.  
 Does this make sense?
I also have reached a similar conclusion now.  Thank you.

Regards
Galy

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] Log levels for checkpoint/bgwriter monitoring

2007-03-08 Thread ITAGAKI Takahiro
Jim Nasby [EMAIL PROTECTED] wrote:

  Also, my recommended bgwriter_lru_maxpages is average number of
  recycled buffers per cycle, that is hardly able to tune manually.
 
 What do you mean by 'number of recycled buffers per cycle?

There is the following description in the documentation:

| * bgwriter_lru_percent (floating point)
| To reduce the probability that server processes will need to issue their
| own writes, the background writer tries to write buffers that are likely
| to be recycled soon.
| * bgwriter_lru_maxpages (integer)
| In each round, no more than this many buffers will be written as a
| result of scanning soon-to-be-recycled buffers.

The number of recycled buffers per round is the same as the number of
StrategyGetBuffer() calls per round. I think the number is suitable for
bgwriter_lru_maxpages if we want bgwriter to write almost of dirty pages.

I proposed to change the semantics of bgwriter_lru_maxpages. It represented
maximum writes per round, but the new meaning is reserved buffers for
recycle. Non-dirty unused buffers will be counted among it.


I'm measuring the difference of performance between manual and automatic
tuning, especially adding some jobs before writes. I'll inform you about
them when I get the results.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Log levels for checkpoint/bgwriter monitoring

2007-03-08 Thread ITAGAKI Takahiro

Greg Smith [EMAIL PROTECTED] wrote:

  Also, my recommended bgwriter_lru_maxpages is average number of
  recycled buffers per cycle, that is hardly able to tune manually.
 
 This is completely dependent on what percentage of your buffer cache is 
 pinned.

Don't you mean usage_count? In my understanding, each backend pins two
or so buffers at once. So percentage of pinned buffers should be low.

 If your load is something like the standard pgbench, the LRU 
 writer will rarely find anything useful to write, so this entire line of 
 thinking won't work.  The proper behavior for heavily pinned data is to 
 turn off the LRU writer altogether so there's more time to run the all 
 scan.

I think you are pointing out the problem of buffer management algorithm
itself, not only bgwriter. Even if you turn off the LRU writer, each
backend pays the same cost to find recyclable buffers every time they
allocate a buffer.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center



---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] RFC: changing autovacuum_naptime semantics

2007-03-08 Thread Tom Lane
Alvaro Herrera [EMAIL PROTECTED] writes:
 Now regarding your restartable vacuum work.  I think that stopping a
 vacuum at some point and being able to restart it later is very cool and
 may get you some hot chicks, but I'm not sure it's really useful.

Too true :-(

 I think it makes more sense to do something like throttling an ongoing
 vacuum to a reduced IO rate, if the maintenance window closes.  So if
 you're in the middle of a heap scan and the maintenance window closes,
 you immediately stop the scan and go the the index cleanup phase, *at a
 reduced IO rate*.

Er, why not just finish out the scan at the reduced I/O rate?  Any sort
of abort behavior is going to create net inefficiency, eg doing an
index scan to remove only a few tuples.  ISTM that the vacuum ought to
just continue along its existing path at a slower I/O rate.

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Log levels for checkpoint/bgwriter monitoring

2007-03-08 Thread Greg Smith

On Fri, 9 Mar 2007, ITAGAKI Takahiro wrote:

In my understanding, each backend pins two or so buffers at once. So 
percentage of pinned buffers should be low.


With the pgbench workload, a substantial percentage of the buffer cache 
ends up pinned.  From staring at the buffer cache using 
contrib/pg_buffercache, I believe most of that consists of the index 
blocks for the records being updated in the accounts table.


I just posted a new version of the patch I asked for feedback on at the 
beginning of this thread, the latest one is at 
http://westnet.com/~gsmith/content/postgresql/new-patch-checkpoint.txt 
I've been adjusting it to monitor the same data I think you need to refine 
your patch.  I believe the approach you're taking includes some 
assumptions that seem perfectly reasonable, but that my testing doesn't 
agree with.  There's nothing like measuring something to settle what's 
really going on, though, so that's what I've been focusing on.  I'd love 
to get some feedback on whether other people can replicate the kind of 
things I'm seeing.


The new code generates statistics about exactly what the background writer 
scan found during each round.  If there's been substantial write activity, 
it prints a log line when it recycles back to the beginning of the all 
scan, to help characterize what the buffer pool looked like during that 
scan from the perspective of the bgwriter.  Here's some sample log output 
from my underpowered laptop while running pgbench:


bgwriter scan all writes=16.6 MB (69.3%) pinned=11.7 MB (48.8%) LRU=7.7 MB 
(31.9%)
...
checkpoint required (wrote checkpoint_segments)
checkpoint buffers dirty=19.4 MB (80.8%) write=188.9 ms sync=4918.1 ms

Here 69% of the buffer cache contained dirty data, and 49% of the cache 
was both pinned and dirty.  During that same time period, the LRU write 
also wrote out a fair amount of data, operating on the 20% of the cache 
that was dirty but not pinned.  On my production server, where the 
background writer is turned way up to reduce checkpoint times, these 
numbers are even more extreme; almost everything that's dirty is also 
pinned during pgbench, and the LRU is lucky to find anything it can write 
as a result.


That patch is against the 8.2 codebase; now that I'm almost done I'm 
planning to move it to HEAD instead soon (where it will conflict 
considerably with your patch).  If you have an 8.2 configuration you can 
test with my patch applied, set log_min_messages = debug2, try it out, and 
see what you get when running pgbench for a while.  I think you'll 
discover some interesting and unexpected things.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] who gets paid for this

2007-03-08 Thread Josh Berkus
Christian,

More specifically, could those
 who worked on apache as some aspect of their job prior to getting repo
 access let me know?  Or if there are devs who know this information
 about others, I'd be really appreciative to get it. 

Hmmm.  Wrong project.  And I think you're making the (incorrect) assumption 
that granting commit rights works the same way in all projects.   It does 
not.

How about you call me and we can chat about how the PostgreSQL project 
actually works?  415-752-2500.

-- 
Josh Berkus
PostgreSQL @ Sun
San Francisco

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Auto creation of Partitions

2007-03-08 Thread NikhilS

Hi,

On 3/9/07, Shane Ambler [EMAIL PROTECTED] wrote:



 Note to Nikhil: Make sure the new syntax doesn't prevent partitions from
 being placed upon multiple tablespaces in some manner, at CREATE TABLE
 time.

What if the syntax was something like -

CREATE TABLE tabname (
 ...
 ...
  ) PARTITION BY
  HASH(expr)
| RANGE(expr)
| LIST(expr)
[PARTITIONS num_partitions] /* will apply to HASH only for now*/
[PARTITION partition_name CHECK(...) [USING TABLESPACE tblspcname],
  PARTITION partition_name CHECK(...) [USING TABLESPACE tblspcname]
  ...
];


And (if we use the ALTER TABLE to add partitions)

ALTER TABLE tabname
ADD PARTITION partition_name CHECK(...)
[USING TABLESPACE tblspcname];



We could as well drop the USING part.

Regards,
Nikhils
--
EnterpriseDB   http://www.enterprisedb.com


Re: [HACKERS] Auto creation of Partitions

2007-03-08 Thread NikhilS

Hi,



This follows on from the suggestion I made - taken along the lines of
the subject auto creation of partitions where I suggested the syntax
of partition check(month of mydatecol) and have a new partition created
as data was entered. With this scenario dropping the partition when it
was empty would complement the creation of a new partition as needed.

Given that there seems to be no real support of going with auto
maintenance were new partitions are added as needed, then the auto
dropping of empty partitions would also not apply.

Leaving us with only specific add partition / drop partition commands.
And have the parent table pick up rows not matching any partition check
criteria.



I was thinking along the lines of what Jim had suggested earlier regarding
overflow partition. Instead of dumping unmatched rows to the master table,
we could put them into a default DUMP/DUMB partition.

Given that Simon wants to do away with having the master table APPENDed in
the planning phase, this would be better.

Regards,
Nikhils


--


Shane Ambler
[EMAIL PROTECTED]

Get Sheeky @ http://Sheeky.Biz





--
EnterpriseDB   http://www.enterprisedb.com


Re: [HACKERS] who gets paid for this

2007-03-08 Thread Tom Lane
Josh Berkus josh@agliodbs.com writes:
 Christian,
 More specifically, could those
 who worked on apache as some aspect of their job prior to getting repo
 access let me know?  Or if there are devs who know this information
 about others, I'd be really appreciative to get it. 

 Hmmm.  Wrong project.  And I think you're making the (incorrect) assumption 
 that granting commit rights works the same way in all projects.   It does 
 not.

Even more to the point, getting paid for has almost nothing to do
with has commit privileges.  At least on this project.

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] RFC: changing autovacuum_naptime semantics

2007-03-08 Thread Galy Lee

Tom Lane wrote:
 Er, why not just finish out the scan at the reduced I/O rate?  Any sort

Sometimes, you may need to vacuum large table in maintenance window and
hot table in the service time. If vacuum for hot table does not eat two
much foreground resource, then you can vacuum large table with a lower
IO rate outside maintenance window; but if vacuum for hot table is
overeating the system resource, then launcher had better to stop the
long running vacuum outside maintenance window.

But I am not insisting on the stop-start feature at this moment.
Changing the cost delay dynamically sounds more reasonable. We can use
it to balance total I/O of workers in service time or maintenance time.
It is not so difficult to achieve this by leveraging the share memory of
autovacuum.

Best Regards
Galy Lee

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Log levels for checkpoint/bgwriter monitoring

2007-03-08 Thread Tom Lane
Greg Smith [EMAIL PROTECTED] writes:
 With the pgbench workload, a substantial percentage of the buffer cache 
 ends up pinned.

[ raised eyebrow... ]  Prove that.  AFAIK it's impossible for the
pgbench queries to pin more than about three or four buffers per backend
concurrently.

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Log levels for checkpoint/bgwriter monitoring

2007-03-08 Thread ITAGAKI Takahiro

Greg Smith [EMAIL PROTECTED] wrote:

  In my understanding, each backend pins two or so buffers at once. So 
  percentage of pinned buffers should be low.
 
 With the pgbench workload, a substantial percentage of the buffer cache 
 ends up pinned.

 http://westnet.com/~gsmith/content/postgresql/new-patch-checkpoint.txt 
 bgwriter scan all writes=16.6 MB (69.3%) pinned=11.7 MB (48.8%) LRU=7.7 MB 
 (31.9%)
 ...
 checkpoint required (wrote checkpoint_segments)
 checkpoint buffers dirty=19.4 MB (80.8%) write=188.9 ms sync=4918.1 ms
 
 Here 69% of the buffer cache contained dirty data, and 49% of the cache 
 was both pinned and dirty.

No. Pinned means bufHdr-refcount  0 and you don't distinguish pinned or
recently-used (bufHdr-usage_count  0) buffers in your patch.

!   if (bufHdr-refcount != 0 || bufHdr-usage_count != 0)
{
!   if (skip_pinned)
!   {
!   UnlockBufHdr(bufHdr);
!   return BUF_PINNED;
!   }
!   buffer_write_type=BUF_WRITTEN_AND_PINNED;


Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Acclerating INSERT/UPDATE using UPS

2007-03-08 Thread August Zajonc
Joshua D. Drake wrote:
 From an deployable application perspective, this could be a big deal. We
 are already starting to see very large traction in the Win32 desktop app
 arena.
 

There seem to be a few overlapping proposals in terms of reducing
various guarantees in the name of performance.

As more and more options are added that affect integrity (fsync, full
page writes, commit nowait, sigres) it might be nice to outline and
compare the approaches, and particularly to describe clearly the failure
scenarios and how they are significantly different from one another.

One potentially needs to track an increasing number of ways in which
items might be set which reduce certain guarantees on data integrity
which is unpleasant.

If a setting is wrong on a performance knob, no problem, when there are
complaints things are slow you can go through and adjust them. The same
is not true of data consistency. When the complaint comes it is usually
too late to fiddle with knobs.

I'm just thinking some caution should be exercised in adding too many of
them in the first place. I happen to love COMMIT NOWAIT though, for
many, this replaces fsync=off.

- August

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org