Re: [HACKERS] Group Commit

2007-04-10 Thread Zeugswetter Andreas ADI SD

   I've been working on the patch to enhance our group commit
behavior. 
   The patch is a dirty hack at the moment, but I'm settled on the 
   algorithm I'm going to use and I know the issues involved.
  
  One question that just came to mind is whether Simon's
no-commit-wait 
  patch doesn't fundamentally alter the context of discussion for
this.
  Aside from the prospect that people won't really care about group 
  commit if they can just use the periodic-WAL-sync approach, ISTM
that 
  one way to get group commit is to just make everybody wait for the 
  dedicated WAL writer to write their commit record.

Yes good catch, I think we will want to merge the two.
But, you won't want to wait indefinitely, since imho the dedicated WAL
writer will primarily only want to write/flush full WAL pages. Maybe
flush half full WAL pages only after some longer timeout. But basically
this timeout should be longer than an individual backend is willing to
delay their commit.

   With a 
  sufficiently short delay between write/fsync attempts in the 
  background process, won't that net out at about the same place as a 
  complicated group-commit patch?

I don't think we want the delay so short, or we won't get any grouped
writes.

I think what we could do is wait up to commit_delay for the
dedicated WAL writer to do it's work. If it did'nt do it until timeout
let the backend do the flushing itself.

 I think the big question is whether commit_delay is ever going to be
generally useful.

It is designed to allow a higher transaction/second rate on a constantly
WAL bottlenecked system, so I think it still has a use case. I think you
should not compare it to no-commit-wait from the feature side (only
implementation).

Andreas

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Group Commit

2007-04-10 Thread Heikki Linnakangas

Tom Lane wrote:

Heikki Linnakangas [EMAIL PROTECTED] writes:
I've been working on the patch to enhance our group commit behavior. The 
patch is a dirty hack at the moment, but I'm settled on the algorithm 
I'm going to use and I know the issues involved.


One question that just came to mind is whether Simon's no-commit-wait
patch doesn't fundamentally alter the context of discussion for this.
Aside from the prospect that people won't really care about group commit
if they can just use the periodic-WAL-sync approach, ISTM that one way
to get group commit is to just make everybody wait for the dedicated
WAL writer to write their commit record.  With a sufficiently short
delay between write/fsync attempts in the background process, won't
that net out at about the same place as a complicated group-commit
patch?


Possibly. To get efficient group commit there would need to be some kind 
of signaling between the WAL writer and normal backends. I think there 
is some in the patch, but I'm not sure if it gives efficient group 
commit. A constant delay will just give us something similar to 
commit_delay.


I've refrained from spending time on group commit until the 
commit-no-wait patch lands, because it's going to conflict anyway. I'm 
starting to feel we should not try to rush group commit into 8.3, unless 
it somehow falls out of the commit-no-wait patch by accident, given that 
we're past feature freeze and coming up with a proper group commit 
algorithm would need a lot of research and testing. Better do it for 8.4 
with more time, we've got enough features on plate for 8.3 anyway.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] Anyone interested in improving postgresql scaling?

2007-04-10 Thread Mark Kirkwood

Kris Kennaway wrote:

If so, then your task is the following:

Make SYSV semaphores less dumb about process wakeups.  Currently
whenever the semaphore state changes, all processes sleeping on the
semaphore are woken, even if we only have released enough resources
for one waiting process to claim.  i.e. there is a thundering herd
wakeup situation which destroys performance at high loads.  Fixing
this will involve replacing the wakeup() calls with appropriate
amounts of wakeup_one().


I'm forwarding this to the pgsql-hackers list so that folks more 
qualified than I can comment, but as I understand the way postgres 
implements locking each process has it *own* semaphore it waits on  - 
and who is waiting for what is controlled by an in (shared) memory hash 
of lock structs (access to these is controlled via platform Dependant 
spinlock code). So a given semaphore state change should only involve 
one process wakeup.


Cheers

Mark

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


[HACKERS] \da doesn't show result type

2007-04-10 Thread Peter Eisentraut
Is there a reason that \da doesn't show the functions result type, like \df 
does?  I would find that information useful.

-- 
Peter Eisentraut
http://developer.postgresql.org/~petere/

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] select ('{}'::text[])[1] returns NULL -- is it correct?

2007-04-10 Thread Tom Lane
Nikolay Samokhvalov [EMAIL PROTECTED] writes:
 I remember several cases when people (e.g. me :-) ) were spending some
 time trying to find an error in some pl/pgsql function and the reason
 lied in incorrect work with arrays (i.e. messages like index is out
 of bounds and index cannot be negative number would help, surely).

Well, if indexes *couldn't* be negative numbers then that might be
helpful, but they can.

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


[HACKERS] Idle idea for a feature

2007-04-10 Thread Tom Lane
psql's \d command tells you about outgoing foreign key constraints
(ie, ones referencing another table from this one).  It doesn't tell
you about incoming ones (ie, ones where another table references this
one).  ISTM it'd be a good idea if it did, as are there any incoming
foreign keys seems to be a question we constantly ask when solving
update-performance problems, and there isn't any easy way to check for
such.  I'm not real sure what the printout should look like, though.

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Anyone interested in improving postgresql scaling?

2007-04-10 Thread Tom Lane
Mark Kirkwood [EMAIL PROTECTED] writes:
 Kris Kennaway wrote:
 If so, then your task is the following:
 
 Make SYSV semaphores less dumb about process wakeups.  Currently
 whenever the semaphore state changes, all processes sleeping on the
 semaphore are woken, even if we only have released enough resources
 for one waiting process to claim.  i.e. there is a thundering herd
 wakeup situation which destroys performance at high loads.  Fixing
 this will involve replacing the wakeup() calls with appropriate
 amounts of wakeup_one().

 I'm forwarding this to the pgsql-hackers list so that folks more 
 qualified than I can comment, but as I understand the way postgres 
 implements locking each process has it *own* semaphore it waits on  - 
 and who is waiting for what is controlled by an in (shared) memory hash 
 of lock structs (access to these is controlled via platform Dependant 
 spinlock code). So a given semaphore state change should only involve 
 one process wakeup.

Correct.  The behavior Kris describes is surely bad, but it's not
relevant to Postgres' usage of SysV semaphores.

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] TOASTing smaller things

2007-04-10 Thread Luke Lonergan
Hi Bruce,

How about these:

- Allow specification of TOAST size threshold in bytes on a per column basis
- Enable storage of columns in separate TOAST tables
- Enable use of multi-row compression method(s) for TOAST tables

- Luke


On 3/26/07 5:39 PM, Bruce Momjian [EMAIL PROTECTED] wrote:

 Luke Lonergan wrote:
 I advocate the following:
 
 - Enable specification of TOAST policy on a per column basis
 
 As a first step, then:
 
 - Enable vertical partitioning of tables using per-column specification of
 storage policy.
 
 
 How are these different from ALTER TABLE SET STORAGE?  They need to be
 more specific.



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] \da doesn't show result type

2007-04-10 Thread Tom Lane
Peter Eisentraut [EMAIL PROTECTED] writes:
 Is there a reason that \da doesn't show the functions result type, like \df 
 does?  I would find that information useful.

+1, I've been annoyed by that too.

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Idle idea for a feature

2007-04-10 Thread Stephen Frost
* Tom Lane ([EMAIL PROTECTED]) wrote:
 one).  ISTM it'd be a good idea if it did, as are there any incoming
 foreign keys seems to be a question we constantly ask when solving
 update-performance problems, and there isn't any easy way to check for
 such.  I'm not real sure what the printout should look like, though.

Agreed.  Suggestion:

networx= \d wdm_networx.loc_base_clin
 Table wdm_networx.loc_base_clin
   Column|  Type   | Modifiers 
-+-+---
 btable_id   | integer | not null
 loc_base_id | integer | not null
 clin| integer | not null
Indexes:
loc_base_clin_pkey PRIMARY KEY, btree (btable_id, loc_base_id, clin)
btable_id_clin_unique UNIQUE, btree (btable_id, clin)
Foreign-key constraints:
loc_base_clin_btable_id_fkey FOREIGN KEY (btable_id) REFERENCES 
wdm_networx.btables_ref(btable_id)
loc_base_clin_clin_fkey FOREIGN KEY (clin) REFERENCES 
wdm_networx.clin(clin)
loc_base_clin_loc_base_id_fkey FOREIGN KEY (loc_base_id) REFERENCES 
wdm_networx.loc_base_dscr(loc_base_id)

networx= \d wdm_networx.loc_base_dscr
  Table wdm_networx.loc_base_dscr
 Column |  Type  | Modifiers 
++---
 loc_base_id| integer| not null
 description| character varying(254) | not null
 locations  | character varying(254) | 
 univ_mandatory | character varying(254) | 
 ent_mandatory  | character varying(254) | 
Indexes:
loc_base_dscr_pkey PRIMARY KEY, btree (loc_base_id)
Referenced by:
loc_base_clin_loc_base_id_fkey FOREIGN KEY (loc_base_id) BY 
wdm_networx.loc_base_clin(loc_base_id)
 /|\/|\
Referenced column(s) in *this* table |  |--- 
column(s) in referencing table

Just my 2c.

Thanks!

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] \da doesn't show result type

2007-04-10 Thread Magnus Hagander
On Tue, Apr 10, 2007 at 10:32:38AM -0400, Tom Lane wrote:
 Peter Eisentraut [EMAIL PROTECTED] writes:
  Is there a reason that \da doesn't show the functions result type, like \df 
  does?  I would find that information useful.
 
 +1, I've been annoyed by that too.

Eh, but it does? As of
http://archives.postgresql.org/pgsql-committers/2007-03/msg00138.php.

Or am I missing what you're asking for here? :-)

//Magnus


---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] \da doesn't show result type

2007-04-10 Thread Peter Eisentraut
Am Dienstag, 10. April 2007 17:12 schrieb Magnus Hagander:
   Is there a reason that \da doesn't show the functions result type, like
   \df does?  I would find that information useful.

 Eh, but it does? As of
 http://archives.postgresql.org/pgsql-committers/2007-03/msg00138.php.

Ah, OK, one feature in 8.3 that I can use. :)

-- 
Peter Eisentraut
http://developer.postgresql.org/~petere/

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Group Commit

2007-04-10 Thread Tom Lane
Heikki Linnakangas [EMAIL PROTECTED] writes:
 I've refrained from spending time on group commit until the 
 commit-no-wait patch lands, because it's going to conflict anyway. I'm 
 starting to feel we should not try to rush group commit into 8.3, unless 
 it somehow falls out of the commit-no-wait patch by accident, given that 
 we're past feature freeze and coming up with a proper group commit 
 algorithm would need a lot of research and testing. Better do it for 8.4 
 with more time, we've got enough features on plate for 8.3 anyway.

It's possible that it *would* fall out of commit-no-wait, if we are
alert to the possibility of shaking the tree in the right direction ;-)
Otherwise I agree with waiting till 8.4 to deal with it.

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] [DOCS] uuid type not documented

2007-04-10 Thread Peter Eisentraut
Am Dienstag, 10. April 2007 17:30 schrieb Neil Conway:
 On Tue, 2007-04-10 at 17:24 +0200, Peter Eisentraut wrote:
  The new uuid type is lacking documentation.

 We had also talked about including some UUID generation functionality in
 8.3, but it should be okay to leave that for 8.4.

The problem is that most of the standard methods are platform dependent, as 
they require MAC addresses or a good random source, for instance.  I'm not 
sure how we wanted to solve that, but certainly leaving the uuid type with 
*no* method to generate one is pretty poor.

-- 
Peter Eisentraut
http://developer.postgresql.org/~petere/

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] [DOCS] uuid type not documented

2007-04-10 Thread Joshua D. Drake

Peter Eisentraut wrote:

Am Dienstag, 10. April 2007 17:30 schrieb Neil Conway:

On Tue, 2007-04-10 at 17:24 +0200, Peter Eisentraut wrote:

The new uuid type is lacking documentation.

We had also talked about including some UUID generation functionality in
8.3, but it should be okay to leave that for 8.4.


The problem is that most of the standard methods are platform dependent, as 
they require MAC addresses or a good random source, for instance.  I'm not 
sure how we wanted to solve that, but certainly leaving the uuid type with 
*no* method to generate one is pretty poor.


Actually, I would say that not having the ability to generate a UUID is 
just fine. Most languages, have the ability to generate them per their 
particular platforms. Let's leave it to them.


Sincerely,

Joshua D. Drake







--

  === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive  PostgreSQL solutions since 1997
 http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL Replication: http://www.commandprompt.com/products/


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [HACKERS] [DOCS] uuid type not documented

2007-04-10 Thread Tom Lane
Joshua D. Drake [EMAIL PROTECTED] writes:
 Peter Eisentraut wrote:
 The problem is that most of the standard methods are platform dependent, as 
 they require MAC addresses or a good random source, for instance.  I'm not
 sure how we wanted to solve that, but certainly leaving the uuid type with 
 *no* method to generate one is pretty poor.

 Actually, I would say that not having the ability to generate a UUID is 
 just fine. Most languages, have the ability to generate them per their 
 particular platforms. Let's leave it to them.

Let us *not* open that can of worms for 8.3.  This patch would not have
got in except that it didn't attempt to solve that problem, and there's
even less time available now.

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PATCHES] [HACKERS] Full page writes improvement, code update

2007-04-10 Thread Tom Lane
Koichi Suzuki [EMAIL PROTECTED] writes:
 My proposal is to remove unnecessary full page writes (they are needed 
 in crash recovery from inconsistent or partial writes) when we copy WAL 
 to archive log and rebuilt them as a dummy when we restore from archive 
 log.
 ...
 Benchmark: DBT-2
 Database size: 120WH (12.3GB)
 Total WAL size: 4.2GB (after 60min. run)
 Elapsed time:
cp:120.6sec
gzip:  590.0sec
pg_compresslog: 79.4sec
 Resultant archive log size:
cp: 4.2GB
gzip:   2.2GB
pg_compresslog: 0.3GB
 Resource consumption:
cp:   user:   0.5sec system: 15.8sec idle:  16.9sec I/O wait: 87.7sec
gzip: user: 286.2sec system:  8.6sec idle: 260.5sec I/O wait: 36.0sec
pg_compresslog:
  user:   7.9sec system:  5.5sec idle:  37.8sec I/O wait: 28.4sec

What checkpoint settings were used to make this comparison?  I'm
wondering whether much of the same benefit can't be bought at zero cost
by increasing the checkpoint interval, because that translates directly
to a reduction in the number of full-page images inserted into WAL.

Also, how much was the database run itself slowed down by the increased
volume of WAL (due to duplicated information)?  It seems rather
pointless to me to measure only the archiving effort without any
consideration for the impact on the database server proper.

regards, tom lane

PS: there's something fishy about the gzip numbers ... why all the idle
time?

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] [PATCHES] Fix mdsync never-ending loop problem

2007-04-10 Thread Tom Lane
I wrote:
 This patch looks fairly sane to me; I have a few small gripes about
 coding style but that can be fixed while applying.  Heikki, you were
 concerned about the cycle-ID idea; do you have any objection to this
 patch?

Actually, on second look I think the key idea here is Takahiro-san's
introduction of a cancellation flag in the hashtable entries, to
replace the cases where AbsorbFsyncRequests can try to delete entries.

What that means is mdsync() doesn't need an outer retry loop at all:
the periodic AbsorbFsyncRequests calls are not a hazard, and retry of
FileSync failures can be handled as an inner loop on the single failing
table entry.  (We can make the failure counter a local variable, too,
instead of needing space in every hashtable entry.)

And with that change, it's no longer possible for an incoming stream
of fsync requests to keep mdsync from terminating.  It might fsync
more than it really needs to, but it won't repeat itself, and it must
reach the end of the hashtable eventually.  So we don't actually need
the cycle counter at all.

It might be worth having the cycle counter anyway just to avoid doing
useless fsync work.  I'm not sure about this.  If we have a cycle
counter of say 32 bits, then it's theoretically possible for an fsync
to fail 2^32 consecutive times and then be skipped on the next try,
allowing a checkpoint to succeed that should not have.  We can fix that
with a few more lines of logic to detect a wrapped-around value, but is
it worth the trouble?

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Anyone interested in improving postgresql scaling?

2007-04-10 Thread Tom Lane
Kris Kennaway [EMAIL PROTECTED] writes:
 Make SYSV semaphores less dumb about process wakeups.  Currently
 whenever the semaphore state changes, all processes sleeping on the
 semaphore are woken, even if we only have released enough resources
 for one waiting process to claim.

 Correct.  The behavior Kris describes is surely bad, but it's not
 relevant to Postgres' usage of SysV semaphores.

 Sorry, but the behaviour is real.

Oh, I'm sure the BSD kernel acts as you describe.  But Mark's point is
that Postgres never has more than one process waiting on any particular
SysV semaphore, and so the problem doesn't really affect us.

Or do you mean that the kernel wakes all processes sleeping on *any*
SysV semaphore?  That would be nasty :-(

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PATCHES] [HACKERS] [Fwd: Index Advisor]

2007-04-10 Thread Gurjeet Singh

Hi Tom,

   The original patch was submitted by Kai Sattler, and we (at EDB) spent a
lot of time improving it, making it as seamless and as user-friendly as
possible. As is evident from the version number of the patch (v26), it has
gone through a lot of iterations, and was available to the community for
review and discussion (and discuss they did; they asked for a few things and
those were added/improved).

quote Bruce
I am thinking the API needs to be simpified, perhaps by removing the system
table and having the recommendations just logged to the server logs.
/quote

quote Kenneth
This means that this very useful information (in log files) would need to be
passed through an intermediary or another tool developed to allow access to
this information. I think that having this available from a table would be
very nice.
/quote

   In the initial submission, the feature was a big piece of code embedded
inside the backend. It required a system table, did not show the new plan,
actually created index physically before re planning, and could not advise
for a running application (everything had to be manually EXPLAINed).

   I read through the thread titled Index Tuning Features that first
discussed the idea of an Index adviser for PG, and this patch also meets
quite a few requirements raised there.

   Here are a few of the good things about this patch as of now:

.) Loadable plugin. Develop your own plugin to do nifty things with the plan
generated by the planner. Just as the debugger is implemented; if no
plugin... no work to do...

.) No syntax change. Run your queries as they are and get the advice in the
advise_index table (or set client_min_messages = LOG, to see the improved
plan on the screen also, if any).

.) Can recommend indexes even for the generated dynamic-queries, that are
hard to regenerate in a dry-run.

.) Can recommend indexes for SQL being executed through plpgsql (or any PL)
(again, hard to regenerate the parameterized queries by hand), and the the
advice is available in the advise_index table.

.) The adviser dumps it's advice in a table named advise_index. That can be
a user table, or a view with INSERT rule, or anything else; it should just
be an INSERTable object, accessible to the executing user (as opposed to a
system table required by the original implementation, and hence a need for
initdb).

.) No need to modify the application in any way; just set PGOPTIONS
environment variable properly before executing the appln., and run it as
usual... you have the advice generated for you.

.) No need for DBA (or the appln. writer) to feed anything to the planner in
any way; the process of recommendation is fully automated (this may change
if another plugin implimentation requires the stats in some user table).

.) Does recommend multi-column indexes. Does not make a set of each
fathomable combination of table columns to develop multi-column indexes
(hence avoiding a combinatorial explosion of time-space requirements); it
uses the columns used in the query to generate multi-column indexes.

.) The indexes are not created on disk; the index-tuple-size calculation
function does a very good job of estimating the size of the virtual index.

.) The changes to the catalog are just for the backend running under the
adviser, no one else can see those virtual indexes (as opposed to the
earlier implementation where the indexes were created on-disk, and available
to all the backends in the planning phase).

   So, with one hook (no GUC variables!), we get all these cool things. I
tried very hard to eliminate that one leftover kludge, but couldn't (we have
two options here, and they are enclosed in '#if GLOBAL_CAND_LIST ... #else'
parts of the code; left upto the committers to decide which one we need!).

   Another kludge that I had to add was the SPI_connect() and SPI_finish()
frame around the savepoint handling, since the RollbackToSavepoint in
xact.cassumes that only a PL/* module must be using the savepoint
infrastucture
(this was discussed on -hackers).

   The interface etc. may not be beautiful, but it isn't ugly either! It is
a lot better than manually creating pg_index records and inserting them into
cache; we use index_create() API to create the index (build is deferred),
and then 'rollback to savepoint' to undo those changes when the advisor is
done. index_create() causes pg_depends entries too, so a 'RB to SP' is far
much safer than going and deleting cache records manually.

   I hope you would agree that we need two passes of planner, one without
v-indexes and the other with v-indexes, for the backend to compare the
costs, and recommend indexes only if the second plan turned out to be
cheaper. If we implement the way you have suggested, then we will need one
hook at the end of get_relation_info(), one in EXPLAIN code, and yet
another, someplace after planner is finished, to do the comparison of the
two plans and recommend only those indexes that were considered to be useful
by the planner. 

Re: [HACKERS] [PATCHES] Fix mdsync never-ending loop problem

2007-04-10 Thread ITAGAKI Takahiro
(Sorry if you receive duplicate messages. I resend it since it was not
 delivered after a day.)


Here is another patch to fix never-ending loop in mdsync. I introduced
a mdsync counter (cycle id) and cancel flags to fix the problem.

The mdsync counter is incremented at the every beginning of mdsync().
Each pending entry has a field assigned from the counter when it is
newly inserted to pendingOpsTable. Only entries that have smaller counter
values than the mdsync counter are fsync-ed in mdsync().

Another change is to add a cancel flag in each pending entry. When a
relation is dropped and bgwriter receives a forget-request, the corresponding
entry is marked as dropped but we don't delete it at that time. Actual
deletion is performed in the next fsync loop. We don't have to retry after
AbsorbFsyncRequests() because entries are not removed outside of seqscan.

This patch can be applied to HEAD, 8.2 and 8.1 with a few hunks.


Tom Lane [EMAIL PROTECTED] wrote:
  In my understanding from the discussion, we'd better to take cycle ID
  approach instead of making a copy of pendingOpsTable, because duplicated
  table is hard to debug and requires us to pay attention not to leak 
  memories.
  I'll adopt the cycle ID approach and build LDC on it as a separate patch.
 
 Heikki made some reasonable arguments against the cycle-ID idea.  I'm
 not intending to insist on it ...

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center



fix_mdsync.patch
Description: Binary data

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] [PATCHES] Fix mdsync never-ending loop problem

2007-04-10 Thread Tom Lane
ITAGAKI Takahiro [EMAIL PROTECTED] writes:
 Here is another patch to fix never-ending loop in mdsync. I introduced
 a mdsync counter (cycle id) and cancel flags to fix the problem.

 The mdsync counter is incremented at the every beginning of mdsync().
 Each pending entry has a field assigned from the counter when it is
 newly inserted to pendingOpsTable. Only entries that have smaller counter
 values than the mdsync counter are fsync-ed in mdsync().

 Another change is to add a cancel flag in each pending entry. When a
 relation is dropped and bgwriter receives a forget-request, the corresponding
 entry is marked as dropped but we don't delete it at that time. Actual
 deletion is performed in the next fsync loop. We don't have to retry after
 AbsorbFsyncRequests() because entries are not removed outside of seqscan.

This patch looks fairly sane to me; I have a few small gripes about
coding style but that can be fixed while applying.  Heikki, you were
concerned about the cycle-ID idea; do you have any objection to this
patch?

 This patch can be applied to HEAD, 8.2 and 8.1 with a few hunks.

I don't think we should back-patch something that's a performance fix
for an extreme case, especially not when it's not been through any
extensive testing yet ...

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Anyone interested in improving postgresql scaling?

2007-04-10 Thread Kris Kennaway
On Tue, Apr 10, 2007 at 10:41:04PM +1200, Mark Kirkwood wrote:
 Kris Kennaway wrote:
 If so, then your task is the following:
 
 Make SYSV semaphores less dumb about process wakeups.  Currently
 whenever the semaphore state changes, all processes sleeping on the
 semaphore are woken, even if we only have released enough resources
 for one waiting process to claim.  i.e. there is a thundering herd
 wakeup situation which destroys performance at high loads.  Fixing
 this will involve replacing the wakeup() calls with appropriate
 amounts of wakeup_one().
 
 I'm forwarding this to the pgsql-hackers list so that folks more 
 qualified than I can comment, but as I understand the way postgres 
 implements locking each process has it *own* semaphore it waits on  - 
 and who is waiting for what is controlled by an in (shared) memory hash 
 of lock structs (access to these is controlled via platform Dependant 
 spinlock code). So a given semaphore state change should only involve 
 one process wakeup.

I have not studied the exact code path, but there are indeed multiple
wakeups happening from the semaphore code (as many as the number of
active postgresql processes).  It is easy to instrument
sleepq_broadcast() and log them when they happen.

Anyway mux@ fixed this some time ago, which indeed helped scaling for
traffic over a local domain socket (particularly at higher loads), but
I saw some anomalous results when using loopback TCP traffic.  I think
this is unrelated (in this situation TCP is highly contended, and it
is often the case that fixing one bottleneck can make a highly
contended situation perform worse, because you were effectively
serializing a bit before, and reducing the non-linear behaviour) but
am still investigating, so the patch has not yet been committed.

Kris


pgpDDvfRzeiGJ.pgp
Description: PGP signature


Re: [HACKERS] Anyone interested in improving postgresql scaling?

2007-04-10 Thread Kris Kennaway
On Tue, Apr 10, 2007 at 10:23:42AM -0400, Tom Lane wrote:
 Mark Kirkwood [EMAIL PROTECTED] writes:
  Kris Kennaway wrote:
  If so, then your task is the following:
  
  Make SYSV semaphores less dumb about process wakeups.  Currently
  whenever the semaphore state changes, all processes sleeping on the
  semaphore are woken, even if we only have released enough resources
  for one waiting process to claim.  i.e. there is a thundering herd
  wakeup situation which destroys performance at high loads.  Fixing
  this will involve replacing the wakeup() calls with appropriate
  amounts of wakeup_one().
 
  I'm forwarding this to the pgsql-hackers list so that folks more 
  qualified than I can comment, but as I understand the way postgres 
  implements locking each process has it *own* semaphore it waits on  - 
  and who is waiting for what is controlled by an in (shared) memory hash 
  of lock structs (access to these is controlled via platform Dependant 
  spinlock code). So a given semaphore state change should only involve 
  one process wakeup.
 
 Correct.  The behavior Kris describes is surely bad, but it's not
 relevant to Postgres' usage of SysV semaphores.

Sorry, but the behaviour is real.

Kris

pgphJTqz6La4j.pgp
Description: PGP signature


[HACKERS] prepared statements logging

2007-04-10 Thread marcofuics
Hi *
I am using the postgresql-8.2.3, with a jdbc-8.2-504 (the GeoNet
webServer tool...) My question is :
Is the PostGresDB server able to log the whole SELECT query?
{made by a prepared statement}
Looking at the log I can see only queries of the type:

(cut).

LOG:  execute unnamed: SELECT * FROM Metadata WHERE id=$1
DETAIL:  parameters: $1 = '28'
DEBUG:  parse unnamed: SELECT schemaId, createDate, lastChangeDate,
source, isTemplate, uuid FROM Metadata WHERE id = 28

(cut).

The strange is that my query has to search on a Test vector data and
in the whole log-file this word Test is not present!

My desire is to have a line where I can see:
SELECTbla..bla...bla   FROM   MetadataWHERE id = ...some
pattern   [maybe islike some pattern]

Thx in advance.


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] select ('{}'::text[])[1] returns NULL -- is it correct?

2007-04-10 Thread Nikolay Samokhvalov

On 4/9/07, Tom Lane [EMAIL PROTECTED] wrote:

Nikolay Samokhvalov [EMAIL PROTECTED] writes:
 As I can see here, when I ask for element that doesn't exist, the
 database returns NULL for me. Maybe it's well-known issue (and
 actually I understood this behaviour before), but strictly speaking it
 seems wrong for me: the database _knows_ that there is no element, so
 why NULL?
[...]

AFAIR it's always been like that, so changing it seems exceedingly
likely to break some peoples' applications.  It's not completely without
analogies in SQL, anyway: consider the behavior of INSERT when fewer
columns are provided than the table has.  Pretending that elements
outside the stored range of the array are null is not all that different
from silently adding nulls to a row-to-be-stored.


OK, I see.
But if I try to INSERT to column that doesn't exist in the table, I
have an error.
Why pg's arrays are designed so that postgres doesn't produce errors
for attempts to access nonexistent element of array? Why there is no
simple sanity check (SELECT (ARRAY[6,8])[-1] -- works w/o an error)? I
remember several cases when people (e.g. me :-) ) were spending some
time trying to find an error in some pl/pgsql function and the reason
lied in incorrect work with arrays (i.e. messages like index is out
of bounds and index cannot be negative number would help, surely).

--
Best regards,
Nikolay

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] select ('{}'::text[])[1] returns NULL -- is it correct?

2007-04-10 Thread Nikolay Samokhvalov

On 4/10/07, Tom Lane [EMAIL PROTECTED] wrote:

Nikolay Samokhvalov [EMAIL PROTECTED] writes:
 I remember several cases when people (e.g. me :-) ) were spending some
 time trying to find an error in some pl/pgsql function and the reason
 lied in incorrect work with arrays (i.e. messages like index is out
 of bounds and index cannot be negative number would help, surely).

Well, if indexes *couldn't* be negative numbers then that might be
helpful, but they can.



Ooops :-) OK, my proposal is narrowing to very simple one: what about
triggering WARNINGs when user tries to access nonexistent element of
array?

--
Best regards,
Nikolay

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] [DOCS] uuid type not documented

2007-04-10 Thread Neil Conway
On Tue, 2007-04-10 at 18:28 +0200, Peter Eisentraut wrote:
 The problem is that most of the standard methods are platform dependent, as 
 they require MAC addresses or a good random source, for instance.

http://archives.postgresql.org/pgsql-patches/2007-01/msg00392.php

ISTM random() or similar sources is a sufficient PSRNG for the purposes
of UUID generation -- I can't see anything in the RFC that would
contradict that.

-Neil



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] [DOCS] uuid type not documented

2007-04-10 Thread Andrew Dunstan

Neil Conway wrote:

On Tue, 2007-04-10 at 18:28 +0200, Peter Eisentraut wrote:
  
The problem is that most of the standard methods are platform dependent, as 
they require MAC addresses or a good random source, for instance.



http://archives.postgresql.org/pgsql-patches/2007-01/msg00392.php

ISTM random() or similar sources is a sufficient PSRNG for the purposes
of UUID generation -- I can't see anything in the RFC that would
contradict that.

  


How about we set up a contrib (I wish we'd fixed that) module with an 
example function or two?


cheers

andrew

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [HACKERS] select ('{}'::text[])[1] returns NULL -- is it correct?

2007-04-10 Thread Florian G. Pflug

Nikolay Samokhvalov wrote:

On 4/10/07, Tom Lane [EMAIL PROTECTED] wrote:

Nikolay Samokhvalov [EMAIL PROTECTED] writes:
 I remember several cases when people (e.g. me :-) ) were spending some
 time trying to find an error in some pl/pgsql function and the reason
 lied in incorrect work with arrays (i.e. messages like index is out
 of bounds and index cannot be negative number would help, surely).

Well, if indexes *couldn't* be negative numbers then that might be
helpful, but they can.


Ooops :-) OK, my proposal is narrowing to very simple one: what about
triggering WARNINGs when user tries to access nonexistent element of
array?


Please don't ;-)
There are two sane options - return an error, or return NULL. Both are
sensible, and different programming languages make different choices.

The only reason for a WARNING would be a long-term plan to change the
existing behaviour. But this will cause lots of pain, for no real gain,
because no matter which behaviour you pick, there are always situations
where the other would be more convenient.

Just look at the mess PHP has created by altering fundamental aspects
of the language (4.4 - 5.0).

greetings, Florian Pflug

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [HACKERS] Anyone interested in improving postgresql scaling?

2007-04-10 Thread Tom Lane
Kris Kennaway [EMAIL PROTECTED] writes:
 On Tue, Apr 10, 2007 at 02:46:56PM -0400, Tom Lane wrote:
 Oh, I'm sure the BSD kernel acts as you describe.  But Mark's point is
 that Postgres never has more than one process waiting on any particular
 SysV semaphore, and so the problem doesn't really affect us.

 To be clear, some behaviour that postgresql does with sysv semaphores
 causes wakeups of many processes at once.  i.e. if you have 20
 clients, you will get up to 20 wakeups.  I haven't studied the precise
 cause of this, but it is empirically true.  This is the scaling
 problem I described, and it's what mux's patch addresses.

[ shrug... ]  To the extent that that happens, it's Postgres' own issue,
and no amount of kernel rejiggering will change it.  But I certainly
have no objection to a patch that fixes the kernel behavior ...

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] [DOCS] uuid type not documented

2007-04-10 Thread Florian G. Pflug

Neil Conway wrote:

On Tue, 2007-04-10 at 18:28 +0200, Peter Eisentraut wrote:
The problem is that most of the standard methods are platform dependent, as 
they require MAC addresses or a good random source, for instance.


http://archives.postgresql.org/pgsql-patches/2007-01/msg00392.php

ISTM random() or similar sources is a sufficient PSRNG for the purposes
of UUID generation -- I can't see anything in the RFC that would
contradict that.


Maybe a short-term solution could be a UUID-generated function that
takes some kind of seed as a parameter. People not concerned about
collisons could just pass some random value, while others could use
the mac-address of the client or something similar.

greetings, Florian Pflug


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [HACKERS] Idle idea for a feature

2007-04-10 Thread Guillaume Smet

On 4/10/07, Tom Lane [EMAIL PROTECTED] wrote:

ISTM it'd be a good idea if it did, as are there any incoming
foreign keys seems to be a question we constantly ask when solving
update-performance problems, and there isn't any easy way to check for
such.


Sure. We wrote a stored proc to do that and it could be really nice to
have it by defaut in the table definition.

Something like:
Incoming foreign keys:
  name_of_the_fkey (column of this table) REFERENCED BY
table(columns of the other table)
could be nice.
Not sure about the title of the section. Another problem is that this
syntax is not based on valid SQL syntax.

--
Guillaume

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] [DOCS] uuid type not documented

2007-04-10 Thread Tom Lane
Neil Conway [EMAIL PROTECTED] writes:
 On Tue, 2007-04-10 at 18:28 +0200, Peter Eisentraut wrote:
 The problem is that most of the standard methods are platform dependent, as 
 they require MAC addresses or a good random source, for instance.

 http://archives.postgresql.org/pgsql-patches/2007-01/msg00392.php

 ISTM random() or similar sources is a sufficient PSRNG for the purposes
 of UUID generation -- I can't see anything in the RFC that would
 contradict that.

Doesn't seem like quite enough bits of uniqueness.

We could improve matters by incorporating the database's
pg_control.system_identifier into the UUID, substituting for the MAC
address we don't have a good way to get.  system_identifier is currently
determined by the system clock at initdb time (to gettimeofday
precision), so it would add at least some additional bits of
randomness...

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Anyone interested in improving postgresql scaling?

2007-04-10 Thread Tom Lane
Kris Kennaway [EMAIL PROTECTED] writes:
 I have not studied the exact code path, but there are indeed multiple
 wakeups happening from the semaphore code (as many as the number of
 active postgresql processes).  It is easy to instrument
 sleepq_broadcast() and log them when they happen.

There are certainly cases where Postgres will wake up a number of
processes in quick succession, but that should happen from a separate
semop() kernel call, on a different semaphore, for each such process.
If there's really multiple processes being released by the same semop()
then there's a bug we need to look into (or maybe it's a kernel bug?).
Anyway I'd be interested to know what the test case is, and which PG
version you were testing.

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Anyone interested in improving postgresql scaling?

2007-04-10 Thread Andrew - Supernews
On 2007-04-10, Tom Lane [EMAIL PROTECTED] wrote:
 Kris Kennaway [EMAIL PROTECTED] writes:
 I have not studied the exact code path, but there are indeed multiple
 wakeups happening from the semaphore code (as many as the number of
 active postgresql processes).  It is easy to instrument
 sleepq_broadcast() and log them when they happen.

 There are certainly cases where Postgres will wake up a number of
 processes in quick succession, but that should happen from a separate
 semop() kernel call, on a different semaphore, for each such process.
 If there's really multiple processes being released by the same semop()
 then there's a bug we need to look into (or maybe it's a kernel bug?).
 Anyway I'd be interested to know what the test case is, and which PG
 version you were testing.

This is a problem in FreeBSD, not specifically to do with postgres - the
granularity for SysV semaphore wakeups in FreeBSD-6.x and earlier is the
entire semaphore set, not just one specific semaphore within the set. I
explained that to Kris some weeks ago, and someone (mux) did a patch (to
FreeBSD, not pg) which was already mentioned in this discussion.

-- 
Andrew, Supernews
http://www.supernews.com - individual and corporate NNTP services

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] Anyone interested in improving postgresql scaling?

2007-04-10 Thread Tom Lane
Kris Kennaway [EMAIL PROTECTED] writes:
 On Tue, Apr 10, 2007 at 05:36:17PM -0400, Tom Lane wrote:
 Anyway I'd be interested to know what the test case is, and which PG
 version you were testing.

 I used 8.2 (and some older version when I first noticed it a year ago)
 and either sysbench or supersmack will show it - presumably anything
 that makes simultaneous queries.  Just instrument sleepq_broadcast()
 to e.g. log a KTR event when it wakes more than 1 process and you'll
 see it happening.

Sorry, I'm not much of a BSD kernel hacker ... but sleepq_broadcast
seems a rather generic name.  Is that called *only* from semop?
I'm wondering if you are seeing simultaneous wakeup from some other
cause --- sleep timeout being the obvious possibility.  We are aware
of behaviors (search the PG lists for context swap storm) where a
number of backends will all fail to get a spinlock and do short usleep
or select-timeout waits.  In this situation they'd all wake up at the
next scheduler clock tick ...

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] [EMAIL PROTECTED]: Re: Anyone interested in improving postgresql scaling?]

2007-04-10 Thread Tom Lane
Kris Kennaway [EMAIL PROTECTED] forwards:
 Yes but there are still a lot of wakeups to be avoided in the current
 System V semaphore code.  More specifically, not only do we wakeup all
 the processes waiting on a single semaphore everytime something changes,
 but we also wakeup all processes waiting on *any* of the semaphore in
 the semaphore *set*, whatever the reason we're sleeping.

O ... *that's* the problem.  Ugh.  Although we have a separate
semaphore for each PG backend, they're grouped into semaphore sets
(I think 16 active semaphores per set).  So a wakeup intended for one
process would uselessly send up to 15 others through the semop code.

The only thing we could do to fix that from our end would be to use
a smaller sema-set size on *BSD platforms.  Is the overhead per sema set
small enough to make this a sane thing to do?  Will we be likely to
run into system limits on the number of sets?

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] [EMAIL PROTECTED]: Re: Anyone interested in improving postgresql scaling?]

2007-04-10 Thread Tom Lane
Maxime Henrion [EMAIL PROTECTED] writes:
 Thanks for forwarding my mail, Kris!  To Tom: if you can get my mails
 to reach pgsql-hackers@ somehow that would be just great :-).

They'll get approved eventually, just like mine to the BSD lists will
get approved eventually ;-)

 The only thing we could do to fix that from our end would be to use
 a smaller sema-set size on *BSD platforms.  Is the overhead per sema set
 small enough to make this a sane thing to do?  Will we be likely to
 run into system limits on the number of sets?

 I'm not familiar enough with the PostgreSQL code to know what impact
 such a change could have, but since the problem is clearly on our
 side here, I would advise against doing changes in PostgreSQL that
 are likely to complicate the code for little gain.  We still didn't
 even fully measure how much the useless wakups cost us since we're
 running into other contention problems with my patch that removes
 those.  And, as you point out, there are complications ensuing with
 respect to system limits (we already ask users to bump them when
 they install PostgreSQL).

OK, it was just an off-the-cuff idea.

 I think the high number of setproctitle() calls are more problematic
 to us at the moment, Kris can comment on that.

As of PG 8.2 it is possible to turn those off.  I don't think there's a
lot of enthusiasm for turning them off by default ... at least not yet.
But it might make sense to point out in the PG documentation that
update_process_title is particularly costly on platforms X, Y, and Z.
Do you know if this issue affects all the BSDen equally?

regards, tom lane

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] [EMAIL PROTECTED]: Re: Anyone interested in improving postgresql scaling?]

2007-04-10 Thread Mark Kirkwood

Tom Lane wrote:




I think the high number of setproctitle() calls are more problematic
to us at the moment, Kris can comment on that.


As of PG 8.2 it is possible to turn those off.  I don't think there's a
lot of enthusiasm for turning them off by default ... at least not yet.
But it might make sense to point out in the PG documentation that
update_process_title is particularly costly on platforms X, Y, and Z.
Do you know if this issue affects all the BSDen equally?




Might be good to turn off by default for the 8.2+ Postgresql versions in 
the FreeBSD ports tree (looks like postgresql.conf.sample is being 
patched anyway, so pretty easy to amend).


Cheers

Mark


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PATCHES] [HACKERS] Full page writes improvement, code update

2007-04-10 Thread Koichi Suzuki
Hi,

In the case below, we run DBT-2 benchmark for one hour to get the
measure.   Checkpoint occured three times (checkpoint interval was 20min).

For more information, when checkpoint interval is one hour, the amount
of the archived log size was as follows:
cp: 3.1GB
gzip:   1.5GB
pg_compresslog: 0.3GB

For both cases, database size was 12.7GB, relatively small.

As pointed out, if we don't run the checkpoint forever, the value for cp
will become close to that for pg_compresslog, but it is not practical.

The point here is, if we collect archive log with cp and the average
work load is a quarter of the full power, cp archiving will produce
about 0.8GB archive log per hour (for DBT-2 case, of course the size
depends on the nature of the transaction).   If we run the database
whole day, the amount of the archive log will be as large as database
itself.   After one week, archive log size gets seven times as large as
the database itself.   This is the point.   In production, such large
archive log will raise storage cost.   The purpose of the proposal is
not to improve the performance, but to decrease the size of archive log
to save necessary storage, preserving the same chance of recovery at the
crash recovery as full_page_writes=on.

Because of DBT-2 nature, it is not meaningful to compare the throuput
(databsae size determines the number of transactions to run).   Instead,
 I compared the throuput using pgbench.   These measures are: cp:
570tps, gzip:558tps, pg_compresslog: 574tps, negligible difference.

In terms of idle time for gzip and other command to archive WAL offline,
no difference in the environment was given other than the command to
archive.   My guess is because the user time is very large in gzip, it
has more chance for scheduler to give resource to other processes.   In
the case of cp, idle time is more than 30times longer than user time.
Pg_compresslog uses seven times longer idle time than user time.  On the
other hand, gzip uses less idle time than user time.   Considering the
total amount of user time, I think it's reasonable measure.

Again, in my proposal, it is not the issue to increase run time
performance.   Issue is to decrease the size of archive log to save the
storage.

Regards;


Tom Lane wrote:
 Koichi Suzuki [EMAIL PROTECTED] writes:
 My proposal is to remove unnecessary full page writes (they are needed 
 in crash recovery from inconsistent or partial writes) when we copy WAL 
 to archive log and rebuilt them as a dummy when we restore from archive 
 log.
 ...
 Benchmark: DBT-2
 Database size: 120WH (12.3GB)
 Total WAL size: 4.2GB (after 60min. run)
 Elapsed time:
cp:120.6sec
gzip:  590.0sec
pg_compresslog: 79.4sec
 Resultant archive log size:
cp: 4.2GB
gzip:   2.2GB
pg_compresslog: 0.3GB
 Resource consumption:
cp:   user:   0.5sec system: 15.8sec idle:  16.9sec I/O wait: 87.7sec
gzip: user: 286.2sec system:  8.6sec idle: 260.5sec I/O wait: 36.0sec
pg_compresslog:
  user:   7.9sec system:  5.5sec idle:  37.8sec I/O wait: 28.4sec
 
 What checkpoint settings were used to make this comparison?  I'm
 wondering whether much of the same benefit can't be bought at zero cost
 by increasing the checkpoint interval, because that translates directly
 to a reduction in the number of full-page images inserted into WAL.
 
 Also, how much was the database run itself slowed down by the increased
 volume of WAL (due to duplicated information)?  It seems rather
 pointless to me to measure only the archiving effort without any
 consideration for the impact on the database server proper.
 
   regards, tom lane
 
 PS: there's something fishy about the gzip numbers ... why all the idle
 time?
 
 ---(end of broadcast)---
 TIP 4: Have you searched our list archives?
 
http://archives.postgresql.org
 


-- 
Koichi Suzuki

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PATCHES] [HACKERS] Full page writes improvement, code update

2007-04-10 Thread Joshua D. Drake

 In terms of idle time for gzip and other command to archive WAL offline,
 no difference in the environment was given other than the command to
 archive.   My guess is because the user time is very large in gzip, it
 has more chance for scheduler to give resource to other processes.   In
 the case of cp, idle time is more than 30times longer than user time.
 Pg_compresslog uses seven times longer idle time than user time.  On the
 other hand, gzip uses less idle time than user time.   Considering the
 total amount of user time, I think it's reasonable measure.
 
 Again, in my proposal, it is not the issue to increase run time
 performance.   Issue is to decrease the size of archive log to save the
 storage.

Considering the relatively little amount of storage a transaction log
takes, it would seem to me that the performance angle is more appropriate.

Is it more efficient in other ways besides negligible tps? Possibly more
efficient memory usage? Better restore times for a crashed system?

Sincerely,

Joshua D. Drake

 
 Regards;
 


-- 

  === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive  PostgreSQL solutions since 1997
 http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL Replication: http://www.commandprompt.com/products/


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] [EMAIL PROTECTED]: Re: Anyone interested in improving postgresql scaling?]

2007-04-10 Thread Tom Lane
Kris Kennaway [EMAIL PROTECTED] writes:
 I think the high number of setproctitle() calls are more problematic
 to us at the moment, Kris can comment on that.

 Since we've basically had it handed to us that calling setproctitle()
 thousands of times per second is something that real applications now
 do, we're pretty much forced to work on making it cheaper.
 ...
 However this won't help all the existing systems out there (including
 other affected OSes), so it would be great if you guys could meet us
 half way and find a way to make postgresql rate-limit these calls by
 default to some suitable compromise rate, like once/second or
 whatever.

Well, the thing is, we've pretty much had it handed to us that
current-command indicators that aren't up to date are not very useful.
So rate-limited updates strike me as a useless compromise.  We have
the real solution (status advertised in PG's shared memory) already,
so the question in my mind is just how fast DBAs will wish to transition
to looking at select * from pg_stat_activity instead of looking at
ps auxww.

I don't see anything wrong at all with making update_process_title
default to off in BSD-specific packaging of Postgres.  It's a harder
sell to turn it off by default everywhere, because of all them Linux
users for whom that's just taking away a convenient status viewing
method.  I think we might get there eventually, but we need a decent
interval to wean people away from the old method.

[ Disclaimer: I work for Red Hat, so am unlikely to favor doing anything
that is a loss on Linux.  But I do use and like other platforms too;
just don't happen to have any BSD in-house currently, unless you're
willing to count Darwin as BSD. ]

regards, tom lane

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] [DOCS] uuid type not documented

2007-04-10 Thread Roman Neuhauser
# [EMAIL PROTECTED] / 2007-04-10 15:49:08 -0400:
 Neil Conway wrote:
 On Tue, 2007-04-10 at 18:28 +0200, Peter Eisentraut wrote:
   
 The problem is that most of the standard methods are platform dependent, 
 as they require MAC addresses or a good random source, for instance.
 
 
 http://archives.postgresql.org/pgsql-patches/2007-01/msg00392.php
 
 ISTM random() or similar sources is a sufficient PSRNG for the purposes
 of UUID generation -- I can't see anything in the RFC that would
 contradict that.
 
   
 
 How about we set up a contrib (I wish we'd fixed that) module with an 
 example function or two?

Thought I'd mention Ralph Engelschall's uuid library, comes with
a PostgreSQL binding:

http://www.ossp.org/pkg/lib/uuid/
http://cvs.ossp.org/fileview?f=ossp-pkg/uuid/pgsql/uuid.txtv=1.1

I only played with it some time ago.

-- 
How many Vietnam vets does it take to screw in a light bulb?
You don't know, man.  You don't KNOW.
Cause you weren't THERE. http://bash.org/?255991

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PATCHES] [HACKERS] Full page writes improvement, code update

2007-04-10 Thread Hannu Krosing
Ühel kenal päeval, T, 2007-04-10 kell 18:17, kirjutas Joshua D. Drake:
  In terms of idle time for gzip and other command to archive WAL offline,
  no difference in the environment was given other than the command to
  archive.   My guess is because the user time is very large in gzip, it
  has more chance for scheduler to give resource to other processes.   In
  the case of cp, idle time is more than 30times longer than user time.
  Pg_compresslog uses seven times longer idle time than user time.  On the
  other hand, gzip uses less idle time than user time.   Considering the
  total amount of user time, I think it's reasonable measure.
  
  Again, in my proposal, it is not the issue to increase run time
  performance.   Issue is to decrease the size of archive log to save the
  storage.
 
 Considering the relatively little amount of storage a transaction log
 takes, it would seem to me that the performance angle is more appropriate.

As I understand it it's not about transaction log but about write-ahead
log.

and the amount of data in WAL can become very important once you have to
keep standby servers in different physical locations (cities, countries
or continents) where channel throughput and cost comes into play.

With simple cp (scp/rsync) the amount of WAL data needing to be copied
is about 10x more than data collected by trigger based solutions
(Slony/pgQ). With pg_compresslog WAL-shipping seems to have roughly the
same amount and thus becomes a viable alternative again.

 Is it more efficient in other ways besides negligible tps? Possibly more
 efficient memory usage? Better restore times for a crashed system?

I think that TPS is more affected by number of writes than size of each
block written, so there is probably not that much to gain in TPS, except
perhaps from better disk cache usage. 

For me pg_compresslog seems to be a winner even if it just does not
degrade performance.

-- 

Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me:  callto:hkrosing
Get Skype for free:  http://www.skype.com



---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] Idle idea for a feature

2007-04-10 Thread NikhilS

Hi,

On 4/11/07, Guillaume Smet [EMAIL PROTECTED] wrote:


On 4/10/07, Tom Lane [EMAIL PROTECTED] wrote:
 ISTM it'd be a good idea if it did, as are there any incoming
 foreign keys seems to be a question we constantly ask when solving
 update-performance problems, and there isn't any easy way to check for
 such.

On similar lines, maybe \d can also show the list of inheritors when

invoked on a parent.
e.g:
postgres=# \d parent
   Table public.parent
Column |  Type   | Modifiers
+-+---
a  | integer |
Indexes:
   parent_a_key UNIQUE, btree (a)
Inherited by: child

Regards,
Nikhils
--
EnterpriseDB   http://www.enterprisedb.com


Re: [HACKERS] Anyone interested in improving postgresql scaling?

2007-04-10 Thread Kris Kennaway
On Tue, Apr 10, 2007 at 03:52:00PM -0400, Tom Lane wrote:
 Kris Kennaway [EMAIL PROTECTED] writes:
  On Tue, Apr 10, 2007 at 02:46:56PM -0400, Tom Lane wrote:
  Oh, I'm sure the BSD kernel acts as you describe.  But Mark's point is
  that Postgres never has more than one process waiting on any particular
  SysV semaphore, and so the problem doesn't really affect us.
 
  To be clear, some behaviour that postgresql does with sysv semaphores
  causes wakeups of many processes at once.  i.e. if you have 20
  clients, you will get up to 20 wakeups.  I haven't studied the precise
  cause of this, but it is empirically true.  This is the scaling
  problem I described, and it's what mux's patch addresses.
 
 [ shrug... ]  To the extent that that happens, it's Postgres' own issue,
 and no amount of kernel rejiggering will change it.  But I certainly
 have no objection to a patch that fixes the kernel behavior ...

As we've discussed before, by far the bigger issue with postgresql
performance on FreeBSD is the default setting of
update_process_titles=on.

Kris


pgpxNR2bN01jL.pgp
Description: PGP signature


Re: [HACKERS] [EMAIL PROTECTED]: Re: Anyone interested in improving postgresql scaling?]

2007-04-10 Thread Kris Kennaway
On Tue, Apr 10, 2007 at 08:23:36PM -0400, Tom Lane wrote:

  I think the high number of setproctitle() calls are more problematic
  to us at the moment, Kris can comment on that.
 
 As of PG 8.2 it is possible to turn those off.  I don't think there's a
 lot of enthusiasm for turning them off by default ... at least not yet.
 But it might make sense to point out in the PG documentation that
 update_process_title is particularly costly on platforms X, Y, and Z.
 Do you know if this issue affects all the BSDen equally?

It will likely affect them to some extent.  In fact the only platforms
it will not hurt on are those which have already jumped through
special hoops to make setproctitle() super-cheap.  I presume Linux is
in this category but don't know which others are, if any.

Since we've basically had it handed to us that calling setproctitle()
thousands of times per second is something that real applications now
do, we're pretty much forced to work on making it cheaper.  Hopefully
this is something that will be addressed over the next few months
(we're going to look at adding support for pages shared between libc
and kernel so this kind of thing can be done without requiring a
syscall).

However this won't help all the existing systems out there (including
other affected OSes), so it would be great if you guys could meet us
half way and find a way to make postgresql rate-limit these calls by
default to some suitable compromise rate, like once/second or
whatever.

Kris


pgpxMmw1cbEgN.pgp
Description: PGP signature


Re: [HACKERS] Anyone interested in improving postgresql scaling?

2007-04-10 Thread Kris Kennaway
On Tue, Apr 10, 2007 at 06:26:37PM -0400, Tom Lane wrote:
 Kris Kennaway [EMAIL PROTECTED] writes:
  On Tue, Apr 10, 2007 at 05:36:17PM -0400, Tom Lane wrote:
  Anyway I'd be interested to know what the test case is, and which PG
  version you were testing.
 
  I used 8.2 (and some older version when I first noticed it a year ago)
  and either sysbench or supersmack will show it - presumably anything
  that makes simultaneous queries.  Just instrument sleepq_broadcast()
  to e.g. log a KTR event when it wakes more than 1 process and you'll
  see it happening.
 
 Sorry, I'm not much of a BSD kernel hacker ... but sleepq_broadcast
 seems a rather generic name.  Is that called *only* from semop?

It's part of how wakeup() is implemented.

 I'm wondering if you are seeing simultaneous wakeup from some other
 cause --- sleep timeout being the obvious possibility.  We are aware
 of behaviors (search the PG lists for context swap storm) where a
 number of backends will all fail to get a spinlock and do short usleep
 or select-timeout waits.  In this situation they'd all wake up at the
 next scheduler clock tick ...

Nope, it's not this.

Kris


pgpa4cQe39p9O.pgp
Description: PGP signature


Re: [HACKERS] [EMAIL PROTECTED]: Re: Anyone interested in improving postgresql scaling?]

2007-04-10 Thread Kris Kennaway
On Wed, Apr 11, 2007 at 12:50:06PM +1200, Mark Kirkwood wrote:
 Tom Lane wrote:
 
 
 I think the high number of setproctitle() calls are more problematic
 to us at the moment, Kris can comment on that.
 
 As of PG 8.2 it is possible to turn those off.  I don't think there's a
 lot of enthusiasm for turning them off by default ... at least not yet.
 But it might make sense to point out in the PG documentation that
 update_process_title is particularly costly on platforms X, Y, and Z.
 Do you know if this issue affects all the BSDen equally?
 
 
 
 Might be good to turn off by default for the 8.2+ Postgresql versions in 
 the FreeBSD ports tree (looks like postgresql.conf.sample is being 
 patched anyway, so pretty easy to amend).

Yeah, we might end up doing this, but I consider it a workaround.

Kris


pgpBiA1MKrVDJ.pgp
Description: PGP signature


Re: [HACKERS] [EMAIL PROTECTED]: Re: Anyone interested in improving postgresql scaling?]

2007-04-10 Thread Kris Kennaway
On Wed, Apr 11, 2007 at 01:03:50AM -0400, Tom Lane wrote:
 Kris Kennaway [EMAIL PROTECTED] writes:
  I think the high number of setproctitle() calls are more problematic
  to us at the moment, Kris can comment on that.
 
  Since we've basically had it handed to us that calling setproctitle()
  thousands of times per second is something that real applications now
  do, we're pretty much forced to work on making it cheaper.
  ...
  However this won't help all the existing systems out there (including
  other affected OSes), so it would be great if you guys could meet us
  half way and find a way to make postgresql rate-limit these calls by
  default to some suitable compromise rate, like once/second or
  whatever.
 
 Well, the thing is, we've pretty much had it handed to us that
 current-command indicators that aren't up to date are not very useful.
 So rate-limited updates strike me as a useless compromise.  We have
 the real solution (status advertised in PG's shared memory) already,
 so the question in my mind is just how fast DBAs will wish to transition
 to looking at select * from pg_stat_activity instead of looking at
 ps auxww.

I don't get your argument - ps auxww is never going to be 100%
up-to-date because during the time the command is running the status
may change.  So we already know that stats being a fraction of a
second out of date are acceptable to users, because that's what may
happen when you run ps in the present model.  So you can use this to
get away with limiting updates to e.g. 10/second and in practise no
users will notice the difference.

Updating thousands of times a second just on the off chance that an
admin may one day run ps is completely inefficient (and has a huge
overhead on non-Linux systems, so it's demonstrably not a sensible way
to do things), and to the extent that there is a problem to be solved
it isn't even really solving it anyway.

If there really are users who find 10 proctitle updates/second an
unacceptably low update rate, then tune for the default case and
provide an option to allow them to override the rate limit to whatever
update rate they find appropriate.

Kris


pgpReJUB51aNd.pgp
Description: PGP signature


Re: [HACKERS] Anyone interested in improving postgresql scaling?

2007-04-10 Thread Kris Kennaway
On Tue, Apr 10, 2007 at 02:46:56PM -0400, Tom Lane wrote:
 Kris Kennaway [EMAIL PROTECTED] writes:
  Make SYSV semaphores less dumb about process wakeups.  Currently
  whenever the semaphore state changes, all processes sleeping on the
  semaphore are woken, even if we only have released enough resources
  for one waiting process to claim.
 
  Correct.  The behavior Kris describes is surely bad, but it's not
  relevant to Postgres' usage of SysV semaphores.
 
  Sorry, but the behaviour is real.
 
 Oh, I'm sure the BSD kernel acts as you describe.  But Mark's point is
 that Postgres never has more than one process waiting on any particular
 SysV semaphore, and so the problem doesn't really affect us.
 
 Or do you mean that the kernel wakes all processes sleeping on *any*
 SysV semaphore?  That would be nasty :-(

To be clear, some behaviour that postgresql does with sysv semaphores
causes wakeups of many processes at once.  i.e. if you have 20
clients, you will get up to 20 wakeups.  I haven't studied the precise
cause of this, but it is empirically true.  This is the scaling
problem I described, and it's what mux's patch addresses.

Kris


pgp00SdLk8acL.pgp
Description: PGP signature


Re: [HACKERS] Anyone interested in improving postgresql scaling?

2007-04-10 Thread Kris Kennaway
On Tue, Apr 10, 2007 at 05:36:17PM -0400, Tom Lane wrote:
 Kris Kennaway [EMAIL PROTECTED] writes:
  I have not studied the exact code path, but there are indeed multiple
  wakeups happening from the semaphore code (as many as the number of
  active postgresql processes).  It is easy to instrument
  sleepq_broadcast() and log them when they happen.
 
 There are certainly cases where Postgres will wake up a number of
 processes in quick succession, but that should happen from a separate
 semop() kernel call, on a different semaphore, for each such process.
 If there's really multiple processes being released by the same semop()
 then there's a bug we need to look into (or maybe it's a kernel bug?).
 Anyway I'd be interested to know what the test case is, and which PG
 version you were testing.

I used 8.2 (and some older version when I first noticed it a year ago)
and either sysbench or supersmack will show it - presumably anything
that makes simultaneous queries.  Just instrument sleepq_broadcast()
to e.g. log a KTR event when it wakes more than 1 process and you'll
see it happening.

Kris


pgptMLonITGtT.pgp
Description: PGP signature


Re: [HACKERS] [EMAIL PROTECTED]: Re: Anyone interested in improving postgresql scaling?]

2007-04-10 Thread Tom Lane
Kris Kennaway [EMAIL PROTECTED] writes:
 On Wed, Apr 11, 2007 at 01:03:50AM -0400, Tom Lane wrote:
 Well, the thing is, we've pretty much had it handed to us that
 current-command indicators that aren't up to date are not very useful.
 So rate-limited updates strike me as a useless compromise.

 I don't get your argument - ps auxww is never going to be 100%
 up-to-date because during the time the command is running the status
 may change.

Of course.  But we have already done the update-once-every-half-second
bit --- that was how pg_stat_activity used to work --- and our users
made clear that it's not good enough.  So I don't see us expending
significant effort to convert the setproctitle code path to that
approach.  The clear way of the future for expensive-setproctitle
platforms is just to turn it off entirely and rely on the new
pg_stat_activity implementation.

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings