Re: [HACKERS] NOT NULL violation error handling in file_fdw

2012-03-13 Thread Shigeru HANADA
(2012/03/12 19:21), Etsuro Fujita wrote:
 According to the following documentation on IterateForeignScan() in
 50.2. Foreign Data Wrapper Callback Routines, I have created a patch to
 support the error handling in file_fdw.  Please find attached a patch.
 
  Note that PostgreSQL's executor doesn't care whether the rows
  returned violate the NOT NULL constraints which were defined
  on the foreign table columns - but the planner does care, and
  may optimize queries incorrectly if NULL values are present
  in a column declared not to contain them. If a NULL value is
  encountered when the user has declared that none should be
  present, it may be appropriate to raise an error (just as you
  would need to do in the case of a data type mismatch).

Interesting.  This patch could be applied cleanly, and it catches first
record which violates NOT NULL constraint.  I have some comments for the
patch.

I worry performance degradation caused by checking NOT NULL constraints
for every row, though such overhead might be hidden by disk I/O.  Do you
have any result of performance testing?  Users might want to disable NOT
NULL checking for already-validated files.

In addition to performance issue, IMHO exporting
ExecBuildSlotValueDescription needs more consideration.  Have you
examined calling ExecConstraints instead of copying NOT NULL check
codes?  It requires fully-built ResultRelInfo, and it also checks CHECK
constraints which have not been supported on foreign tables, but it
seems the standard way to apply constraints on a tuple.  If you don't
want to check CHECK constraints, another possible idea is to add new
external function ExecNotNull (or something) and move NOT NULL checking
codes from ExecConstraints, and call it from fileIterateForeignScan and
ExecConstraints.

Anyway, please add this patch to Commit Fest App for tracking.
https://commitfest.postgresql.org/action/commitfest_view?id=14

-- 
Shigeru Hanada

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SPGiST versus hot standby - question about conflict resolution rules

2012-03-13 Thread Simon Riggs
On Tue, Mar 13, 2012 at 2:50 AM, Tom Lane t...@sss.pgh.pa.us wrote:

 Info appreciated.

Email seen, will reply when I can later today.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] subselect in the column list

2012-03-13 Thread amit sehas
If we have a query of the form:

Select *, (Select * FROM T2 WHERE p2 = T1.p1) FROM T1 ORDER
BY 1 WHERE p3 = 75

In SQL, if a subselect is present in the column-list, is the
result set of this subselect considered to be a part of a
single tuple returned from the outer query, or does the result
set of the whole query look like a cross product of results
of outer and inner query 

thanks
-Amit

PS: sorry i tried to send this email earlier and accidentally pressed send 
before completing the email ...


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_upgrade and statistics

2012-03-13 Thread Daniel Farina
On Mon, Mar 12, 2012 at 8:10 PM, Bruce Momjian br...@momjian.us wrote:
 To answer your specific question, I think clearing the last analyzed
 fields should cause autovacuum to run on analyze those tables.  What I
 don't know is whether not clearing the last vacuum datetime will cause
 the table not to be analyzed.

Thank you very much for this reference.  I will look into it.

-- 
fdr

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_upgrade and statistics

2012-03-13 Thread Daniel Farina
On Mon, Mar 12, 2012 at 9:12 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Bruce Momjian br...@momjian.us writes:
 Copying the statistics from the old server is on the pg_upgrade TODO
 list.  I have avoided it because it will add an additional requirement
 that will make pg_upgrade more fragile in case of major version changes.

 Does anyone have a sense of how often we change the statistics data
 between major versions?

 I don't think pg_statistic is inherently any more stable than any other
 system catalog.

Agreed, but it would appear that in practice that a fair amount of it
carries forward.  If someone ripped up the statistics system and did
them all over in such a way that the old fields had no meaning on
future costing metrics, that'd probably be reasonable cause for a
caveat involving full-blown reanalyze...still, that doesn't seem to
happen every year.

 We've whacked it around significantly just last week,
 which might color my perception a bit, but there are other changes on
 the to-do list.  (For one example, see nearby complaints about
 estimating TOAST-related costs, which we could not fix without adding
 more stats data.)

Is accruing additional statistics likely going to be a big problem?  I
noticed the addition of the new anyarray (presumably for
array-selectivity) features; would planning with an empty assumption
be disastrous vs. the old behavior, which had no concept of those at
all?

I don't think it's necessary to make statistics porting a feature of
pg_upgrade in all circumstances, but it would be nice when possible.
 That having been said, perhaps there are other ways for pg_upgrade to
be better invested inor, best of all and somewhat unrelatedly,
full blown logical replication.

Although this conversation has taken focus on how do we move stats
forward, I am about as interested in how do I run statements (like
ANALYZE) more 'nicely'.  The same general problem pervades many
background task issues, including autovacuum and large physical
reorganizations of data.

-- 
fdr

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] initdb and fsync

2012-03-13 Thread Andres Freund
On Tuesday, March 13, 2012 04:49:40 AM Jeff Davis wrote:
 On Sun, 2012-02-05 at 17:56 -0500, Noah Misch wrote:
  I meant primarily to illustrate the need to be comprehensive, not comment
  on which executable should fsync a particular file.  Bootstrap-mode
  backends do not sync anything during an initdb run on my system.  With
  your patch, we'll fsync a small handful of files and leave nearly
  everything else vulnerable.
 
 Thank you for pointing that out. With that in mind, I have a new version
 of the patch which just recursively fsync's the whole directory
 (attached).
 
 I also introduced a new option --nosync (-N) to disable this behavior.
 
 The bad news is that it introduces a lot more time to initdb -- it goes
 from about 1s to about 10s on my machine. I tried fsync'ing the whole
 directory twice just to make sure that the second was a no-op, and
 indeed it didn't make much difference (still about 10s). 
I suggest you try making it two loops:

for recursively everything in dir:
   posix_fadvise(fd, POSIX_FADV_DONTNEED);

for recursively everything in dir:
   fsync(fd);

In my experience that gives way much better performance due to the fact that 
it does not force its own metadata/journal commit/transaction for every file 
but can be batched. copydir() does the same since some releases...

Obviously its not that nice to use _DONTNEED but I havent found something that 
works equally well. You could try sync_file_range(fd, 0, 0, 
SYNC_FILE_RANGE_WRITE) in the first loop but my experience with that hasn't 
been that good.

Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] wal_buffers, redux

2012-03-13 Thread Andres Freund
On Tuesday, March 13, 2012 03:26:34 AM Robert Haas wrote:
 Meanwhile, here are some TPS graphs at 16MB and 32MB on the IBM POWER7
 machine.  32 clients, 1800 seconds, scale factor 300, synchronous
 commit off.
That graph makes me cringe because its pretty representative of what I have 
seen in practise. Any chance we can get a ~10s average in there? I find it 
pretty hard to read more than spikey.

Whats the filesystem + options on this?

Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] NOT NULL violation error handling in file_fdw

2012-03-13 Thread Etsuro Fujita
(2012/03/13 15:53), Shigeru HANADA wrote:
 (2012/03/12 19:21), Etsuro Fujita wrote:
 According to the following documentation on IterateForeignScan() in
 50.2. Foreign Data Wrapper Callback Routines, I have created a patch to
 support the error handling in file_fdw.  Please find attached a patch.

 Interesting.  This patch could be applied cleanly, and it catches first
 record which violates NOT NULL constraint.  I have some comments for the
 patch.

Thank you for the review.

 I worry performance degradation caused by checking NOT NULL constraints
 for every row, though such overhead might be hidden by disk I/O.  Do you
 have any result of performance testing?  Users might want to disable NOT
 NULL checking for already-validated files.

I don't have any numbers for now.  OK I'll check it.

 In addition to performance issue, IMHO exporting
 ExecBuildSlotValueDescription needs more consideration.  Have you
 examined calling ExecConstraints instead of copying NOT NULL check
 codes?  It requires fully-built ResultRelInfo, and it also checks CHECK
 constraints which have not been supported on foreign tables, but it
 seems the standard way to apply constraints on a tuple.

Yes, I thought the use of ExecConstraints().  But I feel that it is an
overkill.

 If you don't
 want to check CHECK constraints, another possible idea is to add new
 external function ExecNotNull (or something) and move NOT NULL checking
 codes from ExecConstraints, and call it from fileIterateForeignScan and
 ExecConstraints.

I think that it is a good idea.  I'll do it at the next version of the
patch.

 Anyway, please add this patch to Commit Fest App for tracking.
 https://commitfest.postgresql.org/action/commitfest_view?id=14

Done.

Best regards,
Etsuro Fujita

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Command Triggers, patch v11

2012-03-13 Thread Andres Freund
Hi,

I did a short review of what I found after merging master 
(b4af1c25bbc636379efc5d2ffb9d420765705b8a) to what I currently fetched from 
your repo (d63df64580114de4d83cfe8eb45eb630724b8b6f).

- I still find it strange not to fire on cascading actions
- I dislike the missing locking leading to strange errors uppon concurrent 
changes. But then thats just about all the rest of commands/* is handling 
it...
 T1:
 BEGIN;
 ALTER COMMAND TRIGGER cmd_before SET DISABLE;

 T2:
 BEGIN;
 ALTER COMMAND TRIGGER cmd_before SET DISABLE;

 T1:
 COMMIT;

 T2:
 ERROR:  tuple concurrently updated

- I think list_command_triggers should do a heap_lock_tuple(LockTupleShared)
 on the command trigger tuple. But then again just about nothing else does :(

- ExecBeforeOrInsteadOfCommandTriggers is referenced in 
exec_command_triggers_internal comments

- InitCommandContext comments are outdated

- generally comments look a bit outdated

- shouldn't the command trigger stuff for ALTER TABLE be done in inside 
AlterTable instead of utility.c?

- you have repetitions of the following pattern:
void
ExecBeforeCommandTriggers(CommandContext cmd)
{
/* that will execute under command trigger memory context */
if (cmd != NULL  cmd-before != NIL)
exec_command_triggers_internal(cmd, cmd-before, BEFORE);

/* switch back to the command Memory Context now */
MemoryContextSwitchTo(cmd-oldmctx);
}

1. Either cmd != NULL does not need to be checked or you need to check it 
before the MemoryContextSwitchTo
2. the switch to cmd-oldmctx made me very wary at first because I wasn't sure 
its guaranteed to be non NULL

- why is there a special CommandTriggerContext if its not reset separately? 
Should it be reset? I have to say that I dislike the api around this.

- an AFTER .. ALTER AGGREATE ... SET SCHEMA has the wrong schema. Probably the 
same problem exists elsewhere. Or is that as-designed? Would be inconsistent 
with the way object names are handled.

- what does that mean?
+   cmd.objectname = NULL;  /* composite object name */

- DropPropertyStmt seems to be an unused leftover?



Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] wal_buffers, redux

2012-03-13 Thread Robert Haas
On Mon, Mar 12, 2012 at 4:45 PM, Jeff Janes jeff.ja...@gmail.com wrote:
 Rerunning all 4 benchmarks (both 16MB and 32MB wal_buffers on both
 machines) with fsync=off (as well as synchronous_commit=off still)
 might help clarify things.
 If it increases the TPS of Nate@16MB, but doesn't change the other 3
 situations much, then that suggests the IO system is driving it.
 Basically moving up to 32MB is partially innoculating against slow
 fsyncs upon log switch on that machine.

I had the idea of running pg_test_fsync on each machine.  Here, for
what it's worth, are the numbers.

Nate Boley's box:

2 seconds per test
O_DIRECT supported on this platform for open_datasync and open_sync.

Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
open_datasync 116.604 ops/sec
fdatasync 108.654 ops/sec
fsync  20.234 ops/sec
fsync_writethroughn/a
open_sync  17.979 ops/sec

Compare file sync methods using two 8kB writes:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
open_datasync  39.833 ops/sec
fdatasync  58.072 ops/sec
fsync  19.756 ops/sec
fsync_writethroughn/a
open_sync   8.425 ops/sec

Compare open_sync with different write sizes:
(This is designed to compare the cost of writing 16kB
in different write open_sync sizes.)
 1 * 16kB open_sync write  17.408 ops/sec
 2 *  8kB open_sync writes  9.376 ops/sec
 4 *  4kB open_sync writes  4.912 ops/sec
 8 *  2kB open_sync writes  2.477 ops/sec
16 *  1kB open_sync writes  1.244 ops/sec

Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)
write, fsync, close19.818 ops/sec
write, close, fsync19.086 ops/sec

Non-Sync'ed 8kB writes:
write   101176.089 ops/sec

IBM POWER7 server:

2 seconds per test
O_DIRECT supported on this platform for open_datasync and open_sync.

Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
open_datasync 167.009 ops/sec
fdatasync 167.091 ops/sec
fsync  47.321 ops/sec
fsync_writethroughn/a
open_sync  53.735 ops/sec

Compare file sync methods using two 8kB writes:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
open_datasync  68.891 ops/sec
fdatasync 164.597 ops/sec
fsync  50.334 ops/sec
fsync_writethroughn/a
open_sync  25.240 ops/sec

Compare open_sync with different write sizes:
(This is designed to compare the cost of writing 16kB
in different write open_sync sizes.)
 1 * 16kB open_sync write  53.305 ops/sec
 2 *  8kB open_sync writes 23.863 ops/sec
 4 *  4kB open_sync writes 12.500 ops/sec
 8 *  2kB open_sync writes  4.734 ops/sec
16 *  1kB open_sync writes  3.063 ops/sec

Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)
write, fsync, close48.730 ops/sec
write, close, fsync48.463 ops/sec

Non-Sync'ed 8kB writes:
write   186465.321 ops/sec

I can't rerun any more serious benchmarks on Nate Boley's box right
now due to other activity on the box.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] patch for parallel pg_dump

2012-03-13 Thread Tom Lane
Joachim Wieland j...@mcknight.de writes:
 On Sat, Mar 10, 2012 at 9:51 AM, Robert Haas robertmh...@gmail.com wrote:
 -const char *owner, bool withOids,
 +const char *owner,
 +unsigned long int relpages, bool withOids,
 
 The new argument to ArchiveEntry() is unused.  Removing it would
 declutter things a good bit.

 How do you mean it's unused? pg_dump_sort.c uses relpages to dump the
 largest tables first. What you don't want to see in a parallel dump is
 a worker starting to dump a large table while everybody else is
 already idle...

Used or not, I think you could find a less ugly and less invasive way to
pass that around than this.  We should try to avoid adding arguments to
ArchiveEntry that apply to only one object type.

(I'm also unconvinced that sorting by relation size is a good idea
anyway.  Anything that makes the dump order less predictable gets
push-back, IME.)

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Measuring relation free space

2012-03-13 Thread Robert Haas
On Mon, Mar 12, 2012 at 11:10 PM, Jaime Casanova ja...@2ndquadrant.com wrote:
 On Mon, Mar 12, 2012 at 9:41 PM, Noah Misch n...@leadboat.com wrote:

 I created a CF entry for this and marked it Ready for Committer.

 i wasn't sure if create an entry this late was a good idea or not...
 but now i feel better because is less probable that it will fall out
 on the cracks, thanks

 You left the
 bstrategy variable non-static, but that didn't seem important enough to
 justify another round trip.


 ah! i forgot that...

I committed this, but I didn't like the global variable, so I adjusted
it to pass bstrategy as a parameter where needed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] patch for parallel pg_dump

2012-03-13 Thread Kevin Grittner
Tom Lane t...@sss.pgh.pa.us wrote:
 
 (I'm also unconvinced that sorting by relation size is a good idea
 anyway.  Anything that makes the dump order less predictable gets
 push-back, IME.)
 
Given that people often use diff on files from pg_dump,
unpredictable ordering can be a bad thing.  On the other hand, that
is not something you would probably want to do with the output of a
*parallel* dump, so if it only affect that, it probably makes sense.
It seems like a reasonable heuristic to avoid having all but some
big table done, and having to wait for that while the other
processors are sitting idle.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] patch for parallel pg_dump

2012-03-13 Thread Andres Freund
On Tuesday, March 13, 2012 02:48:11 PM Tom Lane wrote:
 (I'm also unconvinced that sorting by relation size is a good idea
 anyway.  Anything that makes the dump order less predictable gets
 push-back, IME.)
Why? Especially in the directory format - which is a prerequisite for parallel 
dump if I remember this correctly - I don't really see a negative point in a 
slightly changing dump order. Given its not deterministic anyway.

Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Explicitly specifying use of IN/OUT variable in PL/pgSQL functions

2012-03-13 Thread Joel Jacobson
The introduction of custom_variable_classes and #variable_conflict in
9.0 partly solves the problem with mixing IN/OUT variables with column
names.
In 8.4 and before, it defaulted to the IN/OUT variable, if it shared
the name of a column.
In 9.0 the behaviour was changed to raise an error if a variable
shared the same name as a column.
This was an important an great change, as it catches unintentional
potentially very dangerous mixups of IN/OUT variables and column
names.

But it's still not possible to use the same names in IN/OUT variables
and column names, which is somewhat a limitation, if not at least it
can be argued it's ugly.

In situations when it's natural and makes sense to mix IN/OUT
variables and columns names, it would be nice to being able to
explicitly specifying you are referring to the IN or OUT variable with
a specific name.
In lack of better ideas, I propose to prefix conflicting variable with
IN or OUT, to allow using them, even if there is a column with a
conflicting name.

Example:

CREATE OR REPLACE FUNCTION Get_UserID(OUT UserID integer, Username
text) RETURNS INTEGER AS $BODY$
#variable_conflict use_column
BEGIN
SELECT UserID INTO OUT.UserID FROM Users WHERE Username = IN.Username;
IF NOT FOUND THEN
INSERT INTO Users (Username) VALUES (IN.Username) RETURNING UserID
INTO STRICT OUT.UserID;
END IF;
RETURN;
END;
$BODY$ LANGUAGE plpgsql VOLATILE;

While IN is a reserved word, OUT is not, so I guess that's a bit of a
problem with existing code.
Perhaps some other words or symbols can be used.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_upgrade and statistics

2012-03-13 Thread Bruce Momjian
On Tue, Mar 13, 2012 at 12:12:27AM -0400, Tom Lane wrote:
 Bruce Momjian br...@momjian.us writes:
  Copying the statistics from the old server is on the pg_upgrade TODO
  list.  I have avoided it because it will add an additional requirement
  that will make pg_upgrade more fragile in case of major version changes.
 
  Does anyone have a sense of how often we change the statistics data
  between major versions?
 
 I don't think pg_statistic is inherently any more stable than any other
 system catalog.  We've whacked it around significantly just last week,
 which might color my perception a bit, but there are other changes on
 the to-do list.  (For one example, see nearby complaints about
 estimating TOAST-related costs, which we could not fix without adding
 more stats data.)

Yes, that was my reaction too.  pg_upgrade has worked hard to avoid
copying any system tables, relying on pg_dump to handle that.  

I just received a sobering blog comment stating that pg_upgrade took 5
minutes on a 0.5TB database, but analyze took over an hour:

http://momjian.us/main/blogs/pgblog/2012.html#March_12_2012

Is there some type of intermediate format we could use to dump/restore
the statistics?  Is there an analyze light mode we could support that
would run faster?

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [v9.2] Add GUC sepgsql.client_label

2012-03-13 Thread Kohei KaiGai
2012/3/12 Robert Haas robertmh...@gmail.com:
 On Mon, Mar 12, 2012 at 12:30 PM, Kohei KaiGai kai...@kaigai.gr.jp wrote:
 Suppose that the connection starts out in context connection_pooler_t.
  Based on the identity of the user, we transition to foo_t, bar_t, or
 baz_t.  If it's possible, by any method, for one of those contexts to
 get back to connection_pooler_t, then we've got a problem.  We give a
 connection to user foo which is in foo_t; he transitions it back to
 connection_pooler_t, then to bar_t, and usurps user bar's privileges.
 Unless there's some way to prevent that, the only way to make this
 secure is to make the transition to foo_t irreversible.

 It is the reason why I advocate the idea to allow sepgsql_setcon()
 inside of trusted-procedures.

 The original use-case of Joshua does not allow connection_pooler_t
 to execute any SQL commands except for invocation of a particular
 trusted-procedures; that takes a secret credential as an argument,
 then it switches the client label to foo_t, bar_t or baz_t according to
 the supplied credential.
 These labels are allowed to switch back to the original
 connection_pooler_t, but it is unavailable to switch arbitrary label
 without suitable credential.

 Oh, I get it.

 Given that that's the intended use case, the current design does make
 sense, but it seems awfully special-purpose.  Not knowing that this is
 what you had in mind, I never would have guessed the reason for all
 this complexity.  I worry that this is too much of a purpose-built
 mechanism, and that nobody will ever be able to use it for much of
 anything beyond the extremely specific use case that you've laid out
 here.  I think that, at the very least, the comments and documentation
 need to make it clear that this is very deliberately intended to
 modify only the toplevel security context of the session, which may be
 different from the currently active context if a TP is in use; and
 also that the change will apply to future transactions only if the
 current transaction commits.

OK, I try to update the documentation and test cases with related
security policy, rather than the code base itself.

Please wait for a few days to update them.
-- 
KaiGai Kohei kai...@kaigai.gr.jp

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Explicitly specifying use of IN/OUT variable in PL/pgSQL functions

2012-03-13 Thread Pavel Stehule
Hello

you can use function name as qualifier

create or replace function fx(paramname type, ...)
returns ...
begin
  SELECT INTO fx.paramname, ...

Regards

Pavel Stehule

2012/3/13 Joel Jacobson j...@trustly.com:
 The introduction of custom_variable_classes and #variable_conflict in
 9.0 partly solves the problem with mixing IN/OUT variables with column
 names.
 In 8.4 and before, it defaulted to the IN/OUT variable, if it shared
 the name of a column.
 In 9.0 the behaviour was changed to raise an error if a variable
 shared the same name as a column.
 This was an important an great change, as it catches unintentional
 potentially very dangerous mixups of IN/OUT variables and column
 names.

 But it's still not possible to use the same names in IN/OUT variables
 and column names, which is somewhat a limitation, if not at least it
 can be argued it's ugly.

 In situations when it's natural and makes sense to mix IN/OUT
 variables and columns names, it would be nice to being able to
 explicitly specifying you are referring to the IN or OUT variable with
 a specific name.
 In lack of better ideas, I propose to prefix conflicting variable with
 IN or OUT, to allow using them, even if there is a column with a
 conflicting name.

 Example:

 CREATE OR REPLACE FUNCTION Get_UserID(OUT UserID integer, Username
 text) RETURNS INTEGER AS $BODY$
 #variable_conflict use_column
 BEGIN
 SELECT UserID INTO OUT.UserID FROM Users WHERE Username = IN.Username;
 IF NOT FOUND THEN
    INSERT INTO Users (Username) VALUES (IN.Username) RETURNING UserID
 INTO STRICT OUT.UserID;
 END IF;
 RETURN;
 END;
 $BODY$ LANGUAGE plpgsql VOLATILE;

 While IN is a reserved word, OUT is not, so I guess that's a bit of a
 problem with existing code.
 Perhaps some other words or symbols can be used.

 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_upgrade and statistics

2012-03-13 Thread Bruce Momjian
On Tue, Mar 13, 2012 at 12:33:09AM -0700, Daniel Farina wrote:
 On Mon, Mar 12, 2012 at 8:10 PM, Bruce Momjian br...@momjian.us wrote:
  To answer your specific question, I think clearing the last analyzed
  fields should cause autovacuum to run on analyze those tables.  What I
  don't know is whether not clearing the last vacuum datetime will cause
  the table not to be analyzed.
 
 Thank you very much for this reference.  I will look into it.

I assume a missing last_analyze would trigger an auto-analyze, but I am
unclear if we assume a last_vacuum included an analyze;  I think you
need to look at autovacuum.c for the details;  let me know if you need
help.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Explicitly specifying use of IN/OUT variable in PL/pgSQL functions

2012-03-13 Thread Tom Lane
Joel Jacobson j...@trustly.com writes:
 In situations when it's natural and makes sense to mix IN/OUT
 variables and columns names, it would be nice to being able to
 explicitly specifying you are referring to the IN or OUT variable with
 a specific name.

Can't you qualify them with the function name?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_upgrade and statistics

2012-03-13 Thread Kevin Grittner
Bruce Momjian br...@momjian.us wrote:
 
 I just received a sobering blog comment stating that pg_upgrade
 took 5 minutes on a 0.5TB database, but analyze took over an hour:
 
Yeah, we have had similar experiences.  Even if this can't be done
for every release or for every data type, bringing over statistics
from the old release as a starting point would really help minimize
downtime on large databases.
 
Of course, release docs should indicate which statistics *won't* be
coming across, and should probably recommend a database ANALYZE or
VACUUM ANALYZE be done when possible.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_upgrade and statistics

2012-03-13 Thread Alvaro Herrera

Excerpts from Bruce Momjian's message of mar mar 13 11:14:43 -0300 2012:
 On Tue, Mar 13, 2012 at 12:33:09AM -0700, Daniel Farina wrote:
  On Mon, Mar 12, 2012 at 8:10 PM, Bruce Momjian br...@momjian.us wrote:
   To answer your specific question, I think clearing the last analyzed
   fields should cause autovacuum to run on analyze those tables.  What I
   don't know is whether not clearing the last vacuum datetime will cause
   the table not to be analyzed.
  
  Thank you very much for this reference.  I will look into it.
 
 I assume a missing last_analyze would trigger an auto-analyze,

You're wrong.  Autovacuum does not consider time, only dead/live tuple
counts.  The formulas it uses are in the autovacuum docs; some details
(such as the fact that it skips tables that do not have stat entries)
might be missing.

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_upgrade and statistics

2012-03-13 Thread Bruce Momjian
On Tue, Mar 13, 2012 at 11:34:16AM -0300, Alvaro Herrera wrote:
 
 Excerpts from Bruce Momjian's message of mar mar 13 11:14:43 -0300 2012:
  On Tue, Mar 13, 2012 at 12:33:09AM -0700, Daniel Farina wrote:
   On Mon, Mar 12, 2012 at 8:10 PM, Bruce Momjian br...@momjian.us wrote:
To answer your specific question, I think clearing the last analyzed
fields should cause autovacuum to run on analyze those tables.  What I
don't know is whether not clearing the last vacuum datetime will cause
the table not to be analyzed.
   
   Thank you very much for this reference.  I will look into it.
  
  I assume a missing last_analyze would trigger an auto-analyze,
 
 You're wrong.  Autovacuum does not consider time, only dead/live tuple
 counts.  The formulas it uses are in the autovacuum docs; some details
 (such as the fact that it skips tables that do not have stat entries)
 might be missing.

Oh, yes.  Thank you for the correction;  not sure what I was thinking.

How would they trigger an autovacuum then?

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_upgrade and statistics

2012-03-13 Thread Alvaro Herrera

Excerpts from Bruce Momjian's message of mar mar 13 11:49:26 -0300 2012:
 
 On Tue, Mar 13, 2012 at 11:34:16AM -0300, Alvaro Herrera wrote:
  
  Excerpts from Bruce Momjian's message of mar mar 13 11:14:43 -0300 2012:
   On Tue, Mar 13, 2012 at 12:33:09AM -0700, Daniel Farina wrote:
On Mon, Mar 12, 2012 at 8:10 PM, Bruce Momjian br...@momjian.us wrote:
 To answer your specific question, I think clearing the last analyzed
 fields should cause autovacuum to run on analyze those tables.  What I
 don't know is whether not clearing the last vacuum datetime will cause
 the table not to be analyzed.

Thank you very much for this reference.  I will look into it.
   
   I assume a missing last_analyze would trigger an auto-analyze,
  
  You're wrong.  Autovacuum does not consider time, only dead/live tuple
  counts.  The formulas it uses are in the autovacuum docs; some details
  (such as the fact that it skips tables that do not have stat entries)
  might be missing.
 
 Oh, yes.  Thank you for the correction;  not sure what I was thinking.
 
 How would they trigger an autovacuum then?

We don't have any mechanism to trigger it currently.  Maybe we could
inject fake messages to the stats collector so that it'd believe the
tables have lots of new tuples and an analyze is necessary.

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] subselect in the column list

2012-03-13 Thread Kevin Grittner
amit sehas cu...@yahoo.com wrote:
 If we have a query of the form:
 
 Select *, (Select * FROM T2 WHERE p2 = T1.p1) FROM T1 ORDER
 BY 1 WHERE p3 = 75
 
 In SQL, if a subselect is present in the column-list, is the
 result set of this subselect considered to be a part of a
 single tuple returned from the outer query, or does the result
 set of the whole query look like a cross product of results
 of outer and inner query 
 
I very much doubt that the above query conforms to the SQL standard.
It doesn't take much effort to find out what PostgreSQL does with
it, if that's what you want to know:
 
test=# create table t1 (id1 int, val1 text);
CREATE TABLE
test=# create table t2 (id2 int, val2 text);
CREATE TABLE
test=# insert into t1 values (1,'one'),(2,'two');
INSERT 0 2
test=# insert into t2 values (1,'einz'),(3,'drei');
INSERT 0 2
test=# select *, (select * from t2 where id2 = id1) from t1;
ERROR:  subquery must return only one column
LINE 1: select *, (select * from t2 where id2 = id1) from t1;
  ^
 
Now, if your goal is to get the value of the row from the subquery
as a RECORD, that can be done; every table has a RECORD definition
with the same name:
 
test=# select *, (select t2 from t2 where id2 = id1) from t1;
 id1 | val1 |t2
-+--+--
   1 | one  | (1,einz)
   2 | two  | 
(2 rows)
 
Not only does such a subquery need to be limited to a single column
(in this example a column of type RECORD), but it must only return
one row; otherwise you get a different error:
 
test=# insert into t2 values (1,'uno');
INSERT 0 1
test=# select *, (select t2 from t2 where id2 = id1) from t1;
ERROR:  more than one row returned by a subquery used as an
expression
 
The pgsql-hackers list is intended for discussing the development of
the PostgreSQL product, not for questions about how to use it.  The
question would probably have been more appropriate on pgsql-general.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Command Triggers, patch v11

2012-03-13 Thread Alvaro Herrera

Excerpts from Andres Freund's message of mar mar 13 08:22:26 -0300 2012:

 - I think list_command_triggers should do a heap_lock_tuple(LockTupleShared)
  on the command trigger tuple. But then again just about nothing else does :(

If you want to do something like that, I think it's probably more
appropriate to use LockDatabaseObject than heap_lock_tuple.

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_upgrade and statistics

2012-03-13 Thread Bruce Momjian
On Tue, Mar 13, 2012 at 09:28:58AM -0500, Kevin Grittner wrote:
 Bruce Momjian br...@momjian.us wrote:
  
  I just received a sobering blog comment stating that pg_upgrade
  took 5 minutes on a 0.5TB database, but analyze took over an hour:
  
 Yeah, we have had similar experiences.  Even if this can't be done
 for every release or for every data type, bringing over statistics
 from the old release as a starting point would really help minimize
 downtime on large databases.
  
 Of course, release docs should indicate which statistics *won't* be
 coming across, and should probably recommend a database ANALYZE or
 VACUUM ANALYZE be done when possible.

Having a works timetimes behavior is really not good;  some users
aren't going to notice until it is too late that they need to run
analyze.  It is fine for hard-core folks like Kevin, but not for the
average user.

At best, pg_upgrade needs to copy over the statistics it can, and adjust
the system statistics to cause autoanalyze to run on those that can't be
migrated.  Frankly, as Tom stated, we have been adjusting the system
statistics collection so often that I have avoided hard-coding that
information into pg_upgrade --- it could potentially make pg_upgrade
less reliable, i.e. vacuumdb --all --analyze always works.

We might decide we want a consistently slow process rather than one that
is sometimes fast and sometimes slow.

As you can see, I am at a loss in how to improve this.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_upgrade and statistics

2012-03-13 Thread Bruce Momjian
On Tue, Mar 13, 2012 at 12:08:41PM -0300, Alvaro Herrera wrote:
   You're wrong.  Autovacuum does not consider time, only dead/live tuple
   counts.  The formulas it uses are in the autovacuum docs; some details
   (such as the fact that it skips tables that do not have stat entries)
   might be missing.
  
  Oh, yes.  Thank you for the correction;  not sure what I was thinking.
  
  How would they trigger an autovacuum then?
 
 We don't have any mechanism to trigger it currently.  Maybe we could
 inject fake messages to the stats collector so that it'd believe the
 tables have lots of new tuples and an analyze is necessary.

Ewe!  Yes, I thought some more and realized these are system _views_,
meaning we can't just update them with UPDATE.  It sounds like something
pg_upgrade will have to do with a server-side function, someday.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] CREATE FOREGIN TABLE LACUNA

2012-03-13 Thread David Fetter
Folks,

This is for 9.3, of course.

I noticed that CREATE FOREIGN TABLE (LIKE some_table) doesn't work.  I
believe it should, as it would:

- Remove a POLA violation
- Make data loading into an extant table even easier, especially if
  there need to be filtering or other cleanup steps

Come to think of it, which CREATE TABLE options are inappropriate to
CREATE FOREIGN TABLE?

Cheers,
David.
-- 
David Fetter da...@fetter.org http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david.fet...@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] foreign key locks, 2nd attempt

2012-03-13 Thread Bruce Momjian
On Tue, Mar 06, 2012 at 04:39:32PM -0300, Alvaro Herrera wrote:
 Here's a first attempt at a README illustrating this.  I intend this to
 be placed in src/backend/access/heap/README.tuplock; the first three
 paragraphs are stolen from the comment in heap_lock_tuple, so I'd remove
 those from there, directing people to this new file instead.  Is there
 something that you think should be covered more extensively (or at all)
 here?
...
 
 When there is a single locker in a tuple, we can just store the locking info
 in the tuple itself.  We do this by storing the locker's Xid in XMAX, and
 setting hint bits specifying the locking strength.  There is one exception
 here: since hint bit space is limited, we do not provide a separate hint bit
 for SELECT FOR SHARE, so we have to use the extended info in a MultiXact in
 that case.  (The other cases, SELECT FOR UPDATE and SELECT FOR KEY SHARE, are
 presumably more commonly used due to being the standards-mandated locking
 mechanism, or heavily used by the RI code, so we want to provide fast paths
 for those.)

Are those tuple bits actually hint bits?  They seem quite a bit more
powerful than a hint.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] foreign key locks, 2nd attempt

2012-03-13 Thread Robert Haas
On Mon, Mar 12, 2012 at 9:24 PM, Noah Misch n...@leadboat.com wrote:
 When we lock an update-in-progress row, we walk the t_ctid chain and lock all
 descendant tuples.  They may all have uncommitted xmins.  This is essential to
 ensure that the final outcome of the updating transaction does not affect
 whether the locking transaction has its KEY SHARE lock.  Similarly, when we
 update a previously-locked tuple, we copy any locks (always KEY SHARE locks)
 to the new version.  That new tuple is both uncommitted and has locks, and we
 cannot easily sacrifice either property.  Do you see a way to extend your
 scheme to cover these needs?

No, I think that sinks it.  Good analysis.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Website stylesheet for local docs

2012-03-13 Thread Bruce Momjian
On Mon, Feb 27, 2012 at 05:26:04PM +0100, Magnus Hagander wrote:
 On Mon, Feb 27, 2012 at 16:20, Tom Lane t...@sss.pgh.pa.us wrote:
  Magnus Hagander mag...@hagander.net writes:
  On Mon, Feb 27, 2012 at 04:37, Robert Haas robertmh...@gmail.com wrote:
  Why not change the default?  Does anyone really prefer the bare bones
  doc output?
 
  Yes, Peter made a point about preferring that back when we changed the
  developer docs to be on the main website (how it got worse but at
  least he could work on his local build).
 
  FWIW, I don't especially like the website style either --- it's too busy
  calling attention to itself with colored backgrounds etc.
 
 There we go, at least two people, and people who do a lot of builds
 and checks of the docs, like the current format. So I think that's a
 good argument to keep the current format the default, and just add a
 target like my suggestion as an *option* :-)

One other problem with the website style is that there is a zero left
margin, making it harder to read.  That might work fine when displayed
inside our existing website div blocks, but not stand-alone.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pl/python long-lived allocations in datum-dict transformation

2012-03-13 Thread Tom Lane
=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= wulc...@wulczer.org writes:
 I came up with a stack of context structures that gets pushed when a
 PL/Python starts being executed and popped when it returns. At first
 they contained just a scratch memory context used by PLyDict_FromTuple.
 Then under the premise of confirming the usefulness of introducing such
 contexts I removed the global PLy_curr_procedure variable and changed
 all users to get the current procedure from the context. It seems to
 have worked, so the total count of global variables is unchanged - hooray!

Applied with some adjustments --- mainly, I thought you were being
too incautious about ensuring that the stack got popped once it'd been
pushed.  The easiest way to fix that was to do the pushes after the
SPI_connect calls, which required decoupling the behavior from
CurrentMemoryContext, which seemed like a good idea anyway.

 While testing I found one more leak, this time caused by allocating a
 structure for caching array type I/O functions and never freeing it.
 Attached as separate patch.

Applied also, but surely if we're leaking memory from the input
descriptors then we should worry about the output ones too?
I made it do that, but if that's wrong, somebody explain why.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] foreign key locks, 2nd attempt

2012-03-13 Thread Alvaro Herrera

Excerpts from Bruce Momjian's message of mar mar 13 14:00:52 -0300 2012:
 
 On Tue, Mar 06, 2012 at 04:39:32PM -0300, Alvaro Herrera wrote:

  When there is a single locker in a tuple, we can just store the locking info
  in the tuple itself.  We do this by storing the locker's Xid in XMAX, and
  setting hint bits specifying the locking strength.  There is one exception
  here: since hint bit space is limited, we do not provide a separate hint bit
  for SELECT FOR SHARE, so we have to use the extended info in a MultiXact in
  that case.  (The other cases, SELECT FOR UPDATE and SELECT FOR KEY SHARE, 
  are
  presumably more commonly used due to being the standards-mandated locking
  mechanism, or heavily used by the RI code, so we want to provide fast paths
  for those.)
 
 Are those tuple bits actually hint bits?  They seem quite a bit more
 powerful than a hint.

I'm not sure what's your point.  We've had a hint bit for SELECT FOR
UPDATE for ages.  Even 8.2 had HEAP_XMAX_EXCL_LOCK and
HEAP_XMAX_SHARED_LOCK.  Maybe they are misnamed and aren't really
hints, but it's not the job of this patch to fix that problem.

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] foreign key locks, 2nd attempt

2012-03-13 Thread Robert Haas
On Mon, Mar 12, 2012 at 3:28 PM, Simon Riggs si...@2ndquadrant.com wrote:
 I agree with you that some worst case performance tests should be
 done. Could you please say what you think the worst cases would be, so
 those can be tested? That would avoid wasting time or getting anything
 backwards.

I've thought about this some and here's what I've come up with so far:

1. SELECT FOR SHARE on a large table on a system with no write cache.

2. A small parent table (say 30 rows or so) and a larger child table
with a many-to-one FK relationship to the parent (say 100 child rows
per parent row), with heavy update activity on the child table, on a
system where fsyncs are very slow.  This should generate lots of mxid
consumption, and every 1600 or so mxids (I think) we've got to fsync;
does that generate a noticeable performance hit?

3. It would be nice to test the impact of increased mxid lookups in
the parent, but I've realized that the visibility map will probably
mask a good chunk of that effect, which is a good thing.  Still, maybe
something like this: a fairly large parent table, say a million rows,
but narrow rows, so that many of them fit on a page, with frequent
reads and occasional updates (if there are only reads, autovacuum
might end with all the visibility map bits set); plus a child table
with one or a few rows per parent which is heavily updated.  In theory
this ought to be good for the patch, since the the more fine-grained
locking will avoid blocking, but in this case the parent table is
large enough that you shouldn't get much blocking anyway, yet you'll
still pay the cost of mxid lookups because the occasional updates on
the parent will clear VM bits.  This might not be the exactly right
workload to measure this effect, but if it's not maybe someone can
devote a little time to thinking about what would be.

4. A plain old pgbench run or two, to see whether there's any
regression when none of this matters at all...

This isn't exactly a test case, but from Noah's previous comments I
gather that there is a theoretical risk of mxid consumption running
ahead of xid consumption.  We should try to think about whether there
are any realistic workloads where that might actually happen.  I'm
willing to believe that there aren't, but not just because somebody
asserts it.  The reason I'm concerned about this is because, if it
should happen, the result will be more frequent anti-wraparound
vacuums on every table in the cluster.  Those are already quite
painful for some users.

It would be nice if Noah or someone else who has reviewed this patch
in detail could comment further.  I am shooting from the hip here, a
bit.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_upgrade and statistics

2012-03-13 Thread Greg Stark
On Tue, Mar 13, 2012 at 1:38 AM, Daniel Farina dan...@heroku.com wrote:
 You probably are going to ask: why not just run ANALYZE and be done
 with it?

Uhm yes. If analyze takes a long time then something is broken. It's
only reading a sample which should be pretty much a fixed number of
pages per table. It shouldn't take much longer on your large database
than on your smaller databases.

Perhaps you're running vacuum analyze by mistake?

If Analyze is taking a long time then we're getting the worst of both
worlds. The statistics are very poor for certain metrics (namely
ndistinct). The main reason we don't do better is because we don't
want to do a full scan.


-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] patch for parallel pg_dump

2012-03-13 Thread Robert Haas
On Mon, Mar 12, 2012 at 11:35 PM, Joachim Wieland j...@mcknight.de wrote:
 How do you mean it's unused? pg_dump_sort.c uses relpages to dump the
 largest tables first. What you don't want to see in a parallel dump is
 a worker starting to dump a large table while everybody else is
 already idle...

What I mean is that the function ArchiveEntry() is defined in
pg_backup_archiver.c, and it takes an argument called relpages, and
the string relpages does not appear anywhere else in that file.

 The backend can have a wrapper function around this that calls ereport
 using the error_string and error_code, and any front-end code that
 wants to use this can do so directly.

 I tried this actually (patch attached) but then I wanted to test it
 and couldn't find anything that used pgpipe() on Windows.

 pg_basebackup/pg_basebackup.c is using it but it's in an #ifndef WIN32
 and the same is true for postmaster/syslogger.c. Am I missing
 something or has this Windows implementation become stale by now? I'll
 append the patch but haven't adapted the pg_dump patch yet to use it.
 Should we still go forward the way you proposed?

Dunno.  Can we get an opinion on that from one of the Windows guys?
Andrew, Magnus?

 +/*
 + * The parallel error handler is called for any die_horribly() in a
 child or master process.
 + * It then takes control over shutting down the rest of the gang.
 + */

 I think this needs to be revised to take control in exit_nicely(),
 maybe by using on_exit_nicely().  Trapping die_horribly() won't catch
 everything.

 It's actually not designed to catch everything. This whole error
 handler thing is only there to report a single error to the user which
 is hopefully the root cause of why everybody is shutting down. Assume
 for example that we cannot get a lock on one table in a worker. Then
 the worker would die_horribly() saying that it cannot get a lock. The
 master would receive that message and shut down. Shutting down for the
 master means killing all the other workers.

 The master terminates because a worker died. And all the other workers
 die because the master killed them. Yet the root cause for the
 termination was the fact that one of the workers couldn't get a lock,
 and this is the one and only message that the user should see.

 If a child terminates without leaving a message, the master will still
 detect it and just say a worker process died unexpectedly (this part
 was actually broken, but now it's fixed :-) )

All that may be true, but I still don't see why it's right for this to
apply in the cases where the worker thread says die_horribly(), but
not in the cases where the worker says exit_horribly().

 Or we change fmtQualifiedId to take an int and then we always pass the
 archive version instead of the Archive* ?

Hmm, I think that might make sense.

 +enum escrow_action { GET, SET };
 +static void
 +parallel_error_handler_escrow_data(enum escrow_action act,
 ParallelState *pstate)
 +{
 +       static ParallelState *s_pstate = NULL;
 +
 +       if (act == SET)
 +               s_pstate = pstate;
 +       else
 +               *pstate = *s_pstate;
 +}

 This seems like a mighty complicated way to implement a global variable.

 Well, we talked about that before, when you complained that you
 couldn't get rid of the global g_conn because of the exit handler.
 You're right that in fact it is an indirect global variable here but
 it's clearly limited to the use of the error handler and you can be
 sure that nobody other than this function writes to it or accesses it
 without calling this function.

Sure, but since all the function does is write to it or access it,
what good does that do me?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] patch for parallel pg_dump

2012-03-13 Thread Andrew Dunstan



On 03/13/2012 01:53 PM, Robert Haas wrote:


I tried this actually (patch attached) but then I wanted to test it
and couldn't find anything that used pgpipe() on Windows.

pg_basebackup/pg_basebackup.c is using it but it's in an #ifndef WIN32
and the same is true for postmaster/syslogger.c. Am I missing
something or has this Windows implementation become stale by now? I'll
append the patch but haven't adapted the pg_dump patch yet to use it.
Should we still go forward the way you proposed?

Dunno.  Can we get an opinion on that from one of the Windows guys?
Andrew, Magnus?




I haven't had time to review this patch or even follow all the 
discussion as I was hoping. I'll try to review the whole thing shortly.


cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_upgrade and statistics

2012-03-13 Thread Kevin Grittner
Greg Stark st...@mit.edu wrote:
 Daniel Farina dan...@heroku.com wrote:
 You probably are going to ask: why not just run ANALYZE and be
 done with it?
 
 Uhm yes. If analyze takes a long time then something is broken.
 It's only reading a sample which should be pretty much a fixed
 number of pages per table. It shouldn't take much longer on your
 large database than on your smaller databases.
 
On a small database:
 
cc=# analyze CaseHist;
ANALYZE
Time: 255.107 ms
cc=# select relpages, reltuples from pg_class where relname =
'CaseHist';
 relpages | reltuples 
--+---
 1264 | 94426
(1 row)
 
Same table on a much larger database (and much more powerful
hardware):
 
cir=# analyze CaseHist;
ANALYZE
Time: 143450.467 ms
cir=# select relpages, reltuples from pg_class where relname =
'CaseHist';
 relpages |  reltuples  
--+-
  3588659 | 2.12391e+08
(1 row)
 
Either way, there are about 500 tables in the database.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] patch: CREATE OR REPLACE FUNCTION autocomplete

2012-03-13 Thread Peter Eisentraut
On sön, 2012-02-19 at 20:23 +0100, Pavel Stehule wrote:
 other very simple patch - enhance autocomplete to support CREATE OR
 REPLACE FUNCTION statement

I see two problems with this:

- We also have other CREATE OR REPLACE variants that this won't
complete.

- It won't complete something like CREATE OR Rtab, only CREATE Otab.
That would be pretty weird.  I think you need to complete each word
separately for consistent behavior.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_upgrade and statistics

2012-03-13 Thread Bruce Momjian
On Tue, Mar 13, 2012 at 05:46:06PM +, Greg Stark wrote:
 On Tue, Mar 13, 2012 at 1:38 AM, Daniel Farina dan...@heroku.com wrote:
  You probably are going to ask: why not just run ANALYZE and be done
  with it?
 
 Uhm yes. If analyze takes a long time then something is broken. It's
 only reading a sample which should be pretty much a fixed number of
 pages per table. It shouldn't take much longer on your large database
 than on your smaller databases.
 
 Perhaps you're running vacuum analyze by mistake?

pg_upgrade recommends running this command:

vacuumdb --all --analyze-only

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_upgrade and statistics

2012-03-13 Thread Bruce Momjian
On Tue, Mar 13, 2012 at 01:18:58PM -0500, Kevin Grittner wrote:
 Greg Stark st...@mit.edu wrote:
  Daniel Farina dan...@heroku.com wrote:
  You probably are going to ask: why not just run ANALYZE and be
  done with it?
  
  Uhm yes. If analyze takes a long time then something is broken.
  It's only reading a sample which should be pretty much a fixed
  number of pages per table. It shouldn't take much longer on your
  large database than on your smaller databases.
  
 On a small database:
  
 cc=# analyze CaseHist;
 ANALYZE
 Time: 255.107 ms
 cc=# select relpages, reltuples from pg_class where relname =
 'CaseHist';
  relpages | reltuples 
 --+---
  1264 | 94426
 (1 row)
  
 Same table on a much larger database (and much more powerful
 hardware):
  
 cir=# analyze CaseHist;
 ANALYZE
 Time: 143450.467 ms
 cir=# select relpages, reltuples from pg_class where relname =
 'CaseHist';
  relpages |  reltuples  
 --+-
   3588659 | 2.12391e+08
 (1 row)
  
 Either way, there are about 500 tables in the database.

That is 2.5 minutes.  How large is that database?

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] LIST OWNED BY...

2012-03-13 Thread Bruce Momjian
On Wed, Feb 29, 2012 at 01:27:43PM -0500, Robert Haas wrote:
 On Wed, Feb 29, 2012 at 12:20 PM, Thom Brown t...@linux.com wrote:
  On 29 February 2012 17:16, Tom Lane t...@sss.pgh.pa.us wrote:
  Thom Brown t...@linux.com writes:
  So could we introduce either a command to show which objects are owned
  by a particular role, or allow a dry-run of DROP OWNED BY?
 
  It's always been possible to do that:
 
         begin;
         drop owned by joe;
         rollback;
 
  I believe this is already the recommended approach if you're concerned
  about what DROP CASCADE will do.
 
  No, the cascade part is fine.  It's the objects which won't cause a
  cascade that are an issue.  Putting it in a transaction for rolling
  back doesn't help find out what it intends to drop.
 
  How can the user tell what the statement would drop (ignoring cascades)?
 
 It's certainly possible to write a query for this, but I think this
 gets back to the old argument about whether every client (and every
 end-user) should be required to reimplement this, or whether maybe we
 ought to provide some server functionality around it.

Is this a TODO?

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: pg_stat_statements normalisation without invasive changes to the parser (was: Next steps on pg_stat_statements normalisation)

2012-03-13 Thread Peter Geoghegan
On 2 March 2012 20:10, Tom Lane t...@sss.pgh.pa.us wrote:
 I do intend to take this one up in due course

I probably should have exposed the query_id directly in the
pg_stat_statements view, perhaps as query_hash. The idea of that
would be to advertise the potential non-uniqueness of the value - a
collision is *extremely* unlikely (as I've previously calculated), but
we cannot preclude the possibility, and as such it isn't *really*
usable as a primary key. BTW, even if there is a collision, we at
least know that there can't be a situation where one user's query
entry gets spurious statistics from the execution of some other
user's, or one database gets statistics from another, since their
corresponding oid values separately form part of the dynahash key,
alongside query_id.

The other reason why I'd like to do this is that I'd like to build on
this work for 9.3, and add a new column - plan_hash. When a new mode,
pg_stat_statements.plan_hash (or somesuch) is disabled (as it is by
default), this is always null, and we get the same 9.2 behaviour. When
it is enabled, however, all existing entries are invalidated, for a
clean slate. We then start hashing both the query tree *and* the query
plan. It's a whole lot less useful if we only hash the latter. Now,
entries within the view use the plan_hash as their key (or maybe a
composite of query_hash and plan_hash). This often results in entries
with duplicate query_hash values, as the planner generates different
plans for equivalent queries, but that doesn't matter; you can easily
write an aggregate query with a GROUP BY query_hash clause if that's
what you happen to want to see.

When this optional mode is enabled, at that point we'd probably also
separately instrument planning time, as recently proposed by Fujii.

Does that seem like an interesting idea?

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_upgrade and statistics

2012-03-13 Thread Kevin Grittner
Bruce Momjian br...@momjian.us wrote:
 On Tue, Mar 13, 2012 at 01:18:58PM -0500, Kevin Grittner wrote:
 
 cir=# analyze CaseHist;
 ANALYZE
 Time: 143450.467 ms
 cir=# select relpages, reltuples from pg_class where relname =
 'CaseHist';
  relpages |  reltuples  
 --+-
   3588659 | 2.12391e+08
 (1 row)
  
 Either way, there are about 500 tables in the database.
 
 That is 2.5 minutes.  How large is that database?
 
cir=# select pg_size_pretty(pg_database_size('cir'));
 pg_size_pretty 

 2563 GB
(1 row)
 
In case you meant How large is that table that took 2.5 minutes to
analyze?:
 
cir=# select pg_size_pretty(pg_total_relation_size('CaseHist'));
 pg_size_pretty 

 44 GB
(1 row)
 
I've started a database analyze, to see how long that takes.  Even
if each table took 1/4 second (like on the small database) with over
500 user tables, plus the system tables, it'd be 15 minutes.  I'm
guessing it'll run over an hour, but I haven't timed it lately, so
-- we'll see.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] patch for parallel pg_dump

2012-03-13 Thread Robert Haas
On Tue, Mar 13, 2012 at 9:59 AM, Kevin Grittner
kevin.gritt...@wicourts.gov wrote:
 Tom Lane t...@sss.pgh.pa.us wrote:

 (I'm also unconvinced that sorting by relation size is a good idea
 anyway.  Anything that makes the dump order less predictable gets
 push-back, IME.)

 Given that people often use diff on files from pg_dump,
 unpredictable ordering can be a bad thing.  On the other hand, that
 is not something you would probably want to do with the output of a
 *parallel* dump, so if it only affect that, it probably makes sense.
 It seems like a reasonable heuristic to avoid having all but some
 big table done, and having to wait for that while the other
 processors are sitting idle.

Yeah, I think it's a good heuristic.  Finishing the dump in the
minimum possible time is sufficiently similar to the knapsack problem
as to make me suspect that there is no easy way to be certain of
getting the optimal dump order (and we don't even have perfect
information, since we know little about the characteristics of the
underlying storage).  But dumping tables in descending order figures
to get many easy cases right - e.g. suppose there are 200 tables of
size X and 1 table of size 100*X, and we have 3 workers to play with.
If we dump the tables in an essentially random order (relative to
size) then the overall time will get longer the more little tables we
dump before we start the big one.

Now, if we have tables of sizes 10*X, 9*X, 8*X, 6*X, and 5*X and two
workers, then the first worker will get the 10*X table, the second
worker will get the 9*X table, then the second worker will start the
8*X table, then the first worker will get the 6*X and 5*X tables and,
assuming dump time is a uniform function of table size, we'll finish
after 21 time units.  Had we been smarter, we could have assigned the
9*X, 6*X, and 5*X tables to one worker and the 10*X and 8*X tables to
the other and finished in just 20 time units.  There's probably a way
to construct a more extreme example of this, but I think in practice
if there's any loss due to this kind of effect it will be small, and
descending-size order certainly seems more likely to be right than
leading it to chance.

A bigger problem is dumping relations A and B at the same time might
involve a lot more I/O contention than dumping relations A and C at
the same time if, say, A and B are on the same tablespace and C is
not.  I have no idea what to do about that in general, but for a first
version of this feature I think it's fine to punt.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] LIST OWNED BY...

2012-03-13 Thread Robert Haas
On Tue, Mar 13, 2012 at 2:56 PM, Bruce Momjian br...@momjian.us wrote:
 Is this a TODO?

If you're going to create one, link to Álvaro's message.  But I'm not
sure we should without a better sense of what we actually want to do,
from the options he laid out.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_upgrade and statistics

2012-03-13 Thread Kevin Grittner
Kevin Grittner kevin.gritt...@wicourts.gov wrote:
 Bruce Momjian br...@momjian.us wrote:
 
 That is 2.5 minutes.  How large is that database?
 
I dug around a little and found that we had turned on vacuum cost
limits on the central databases, because otherwise the web team
complained about performance during maintenance windows.  On the
county database we generally don't have users working all night, so
we do maintenance during off hours, and run without cost-based
limits.
 
When the full run completes, I'll try analyze on that table again,
in a session with the limits off.
 
Maybe vacuumdb should have an option to disable the limits, and we
recommend that after pg_upgrade?
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Potential reference miscounts and segfaults in plpython.c

2012-03-13 Thread Tom Lane
=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= wulc...@wulczer.org writes:
 Here are the updated patches which use PLy_elog instead of plain elog.
 The difference is that they will get marked for translation and that the
 original Python exception will show up in the errdetail field.

Applied, thanks.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_upgrade and statistics

2012-03-13 Thread Bruce Momjian
On Tue, Mar 13, 2012 at 02:07:14PM -0500, Kevin Grittner wrote:
 Bruce Momjian br...@momjian.us wrote:
  On Tue, Mar 13, 2012 at 01:18:58PM -0500, Kevin Grittner wrote:
  
  cir=# analyze CaseHist;
  ANALYZE
  Time: 143450.467 ms
  cir=# select relpages, reltuples from pg_class where relname =
  'CaseHist';
   relpages |  reltuples  
  --+-
3588659 | 2.12391e+08
  (1 row)
   
  Either way, there are about 500 tables in the database.
  
  That is 2.5 minutes.  How large is that database?
  
 cir=# select pg_size_pretty(pg_database_size('cir'));
  pg_size_pretty 
 
  2563 GB
 (1 row)
  
 In case you meant How large is that table that took 2.5 minutes to
 analyze?:
  
 cir=# select pg_size_pretty(pg_total_relation_size('CaseHist'));
  pg_size_pretty 
 
  44 GB
 (1 row)
  
 I've started a database analyze, to see how long that takes.  Even
 if each table took 1/4 second (like on the small database) with over
 500 user tables, plus the system tables, it'd be 15 minutes.  I'm
 guessing it'll run over an hour, but I haven't timed it lately, so
 -- we'll see.

OK, so a single 44GB tables took 2.5 minutes to analyze;  that is not
good.  It would require 11 such tables to reach 500GB (0.5 TB), and
would take 27 minutes.  The report I had was twice as long, but still in
the ballpark of too long.  :-(

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] about EncodeDateTime() arguments

2012-03-13 Thread Peter Eisentraut
On lör, 2012-03-10 at 18:47 -0500, Tom Lane wrote:
  void EncodeDateTime(struct pg_tm * tm, fsec_t fsec, const int *tzp, const 
  char *tzn, int style, char *str)
 
 It appears to me that null-ness of tzp and tzn are used as a 3-way flag
 to identify the style of timezone output wanted (none, numeric, or alpha).
 It would probably be better yet if it went like
 
 enum tzstyle, int tzp, const char *tzn
 
 where tzp or tzn would be examined only if tzstyle said so. 

It's not quite a three-way flag, because it also depends on the style,
which time zone style is used.  But I liked the idea of making print
the time zone more explicit and getting rid of the funny pointers.  I
added a bit of documentation and code deduplication in the attached
patch, and it already looks much more understandable.
diff --git i/src/backend/utils/adt/date.c w/src/backend/utils/adt/date.c
index 85e8fd0..2da4e04 100644
--- i/src/backend/utils/adt/date.c
+++ w/src/backend/utils/adt/date.c
@@ -1131,7 +1131,7 @@ time_out(PG_FUNCTION_ARGS)
 	char		buf[MAXDATELEN + 1];
 
 	time2tm(time, tm, fsec);
-	EncodeTimeOnly(tm, fsec, NULL, DateStyle, buf);
+	EncodeTimeOnly(tm, fsec, false, 0, DateStyle, buf);
 
 	result = pstrdup(buf);
 	PG_RETURN_CSTRING(result);
@@ -1918,7 +1918,7 @@ timetz_out(PG_FUNCTION_ARGS)
 	char		buf[MAXDATELEN + 1];
 
 	timetz2tm(time, tm, fsec, tz);
-	EncodeTimeOnly(tm, fsec, tz, DateStyle, buf);
+	EncodeTimeOnly(tm, fsec, true, tz, DateStyle, buf);
 
 	result = pstrdup(buf);
 	PG_RETURN_CSTRING(result);
diff --git i/src/backend/utils/adt/datetime.c w/src/backend/utils/adt/datetime.c
index f495c3f..56515f1 100644
--- i/src/backend/utils/adt/datetime.c
+++ w/src/backend/utils/adt/datetime.c
@@ -3679,39 +3679,54 @@ EncodeDateOnly(struct pg_tm * tm, int style, char *str)
 
 /* EncodeTimeOnly()
  * Encode time fields only.
+ *
+ * tm and fsec are the value to encode, print_tz determines whether to include
+ * a time zone (the difference between time and timetz types), tz is the
+ * numeric time zone offset, style is the date style, str is where to write the
+ * output.
  */
 void
-EncodeTimeOnly(struct pg_tm * tm, fsec_t fsec, int *tzp, int style, char *str)
+EncodeTimeOnly(struct pg_tm * tm, fsec_t fsec, bool print_tz, int tz, int style, char *str)
 {
 	sprintf(str, %02d:%02d:, tm-tm_hour, tm-tm_min);
 	str += strlen(str);
 
 	AppendSeconds(str, tm-tm_sec, fsec, MAX_TIME_PRECISION, true);
 
-	if (tzp != NULL)
-		EncodeTimezone(str, *tzp, style);
+	if (print_tz)
+		EncodeTimezone(str, tz, style);
 }
 
 
 /* EncodeDateTime()
  * Encode date and time interpreted as local time.
- * Support several date styles:
+ *
+ * tm and fsec are the value to encode, print_tz determines whether to include
+ * a time zone (the difference between timestamp and timestamptz types), tz is
+ * the numeric time zone offset, tzn is the textual time zone, which if
+ * specified will be used instead of tz by some styles, style is the date
+ * style, str is where to write the output.
+ *
+ * Supported date styles:
  *	Postgres - day mon hh:mm:ss  tz
  *	SQL - mm/dd/ hh:mm:ss.ss tz
  *	ISO - -mm-dd hh:mm:ss+/-tz
  *	German - dd.mm. hh:mm:ss tz
  *	XSD - -mm-ddThh:mm:ss.ss+/-tz
- * Variants (affects order of month and day for Postgres and SQL styles):
- *	US - mm/dd/
- *	European - dd/mm/
  */
 void
-EncodeDateTime(struct pg_tm * tm, fsec_t fsec, int *tzp, char **tzn, int style, char *str)
+EncodeDateTime(struct pg_tm * tm, fsec_t fsec, bool print_tz, int tz, const char *tzn, int style, char *str)
 {
 	int			day;
 
 	Assert(tm-tm_mon = 1  tm-tm_mon = MONTHS_PER_YEAR);
 
+	/*
+	 * Negative tm_isdst means we have no valid time zone translation.
+	 */
+	if (tm-tm_isdst  0)
+		print_tz = false;
+
 	switch (style)
 	{
 		case USE_ISO_DATES:
@@ -3729,14 +3744,8 @@ EncodeDateTime(struct pg_tm * tm, fsec_t fsec, int *tzp, char **tzn, int style,
 
 			AppendTimestampSeconds(str + strlen(str), tm, fsec);
 
-			/*
-			 * tzp == NULL indicates that we don't want *any* time zone info
-			 * in the output string. *tzn != NULL indicates that we have alpha
-			 * time zone info available. tm_isdst != -1 indicates that we have
-			 * a valid time zone translation.
-			 */
-			if (tzp != NULL  tm-tm_isdst = 0)
-EncodeTimezone(str, *tzp, style);
+			if (print_tz)
+EncodeTimezone(str, tz, style);
 
 			if (tm-tm_year = 0)
 sprintf(str + strlen(str),  BC);
@@ -3762,12 +3771,12 @@ EncodeDateTime(struct pg_tm * tm, fsec_t fsec, int *tzp, char **tzn, int style,
 			 * TZ abbreviations in the Olson database are plain ASCII.
 			 */
 
-			if (tzp != NULL  tm-tm_isdst = 0)
+			if (print_tz)
 			{
-if (*tzn != NULL)
-	sprintf(str + strlen(str),  %.*s, MAXTZLEN, *tzn);
+if (tzn)
+	sprintf(str + strlen(str),  %.*s, MAXTZLEN, tzn);
 else
-	EncodeTimezone(str, *tzp, style);
+	EncodeTimezone(str, tz, style);
 			}
 
 			if (tm-tm_year = 0)
@@ -3785,12 +3794,12 @@ EncodeDateTime(struct pg_tm * tm, fsec_t fsec, 

Re: [HACKERS] pg_upgrade and statistics

2012-03-13 Thread Tom Lane
Greg Stark st...@mit.edu writes:
 On Tue, Mar 13, 2012 at 1:38 AM, Daniel Farina dan...@heroku.com wrote:
 You probably are going to ask: why not just run ANALYZE and be done
 with it?

 Uhm yes. If analyze takes a long time then something is broken. It's
 only reading a sample which should be pretty much a fixed number of
 pages per table. It shouldn't take much longer on your large database
 than on your smaller databases.

The data collection work does scale according to the statistics target,
which is something that's crept up quite a lot since the code was
originally written.

I wonder whether it'd be worth recommending that people do an initial
ANALYZE with a low stats target, just to get some stats in place,
and then go back to analyze at whatever their normal setting is.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] wal_buffers, redux

2012-03-13 Thread Robert Haas
On Mon, Mar 12, 2012 at 4:45 PM, Jeff Janes jeff.ja...@gmail.com wrote:
 Rerunning all 4 benchmarks (both 16MB and 32MB wal_buffers on both
 machines) with fsync=off (as well as synchronous_commit=off still)
 might help clarify things.

I reran the 32-client benchmark on the IBM machine with fsync=off and got this:

32MB: tps = 26809.442903 (including connections establishing)
16MB: tps = 26651.320145 (including connections establishing)

That's a speedup of nearly a factor of two, so clearly fsync-related
stalls are a big problem here, even with wal_buffers cranked up
through the ceiling.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_upgrade and statistics

2012-03-13 Thread Kevin Grittner
Bruce Momjian br...@momjian.us wrote:
 
 cir=# analyze CaseHist;
 ANALYZE
 Time: 143450.467 ms
 
 OK, so a single 44GB tables took 2.5 minutes to analyze;  that is
 not good.  It would require 11 such tables to reach 500GB (0.5
 TB), and would take 27 minutes.  The report I had was twice as
 long, but still in the ballpark of too long.  :-(
 
We have a sister machine to the one used for that benchmark -- same
hardware and database.  The cost limit didn't seem to make much
difference:
 
cir=# set vacuum_cost_delay = 0;
SET
cir=# \timing on 
Timing is on.
cir=# analyze CaseHist ;
ANALYZE
Time: 146169.728 ms
 
So it really does seem to take that long.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: pg_stat_statements normalisation without invasive changes to the parser (was: Next steps on pg_stat_statements normalisation)

2012-03-13 Thread Tom Lane
Peter Geoghegan pe...@2ndquadrant.com writes:
 I probably should have exposed the query_id directly in the
 pg_stat_statements view, perhaps as query_hash.

FWIW, I think that's a pretty bad idea; the hash seems to me to be
strictly an internal matter.  Given the sponginess of its definition
I don't really want it exposed to users.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Command Triggers, patch v11

2012-03-13 Thread Dimitri Fontaine
Hi,

Andres Freund and...@anarazel.de writes:
 I did a short review of what I found after merging master

Thanks!

 - I still find it strange not to fire on cascading actions

We don't build statement for cascading so we don't fire command
triggers. The user view is that there was no drop command on the sub
objects, only on the main one.

I know it's not ideal, but that's a limit we have to bite for 9.2
unfortunately.

 - I dislike the missing locking leading to strange errors uppon concurrent
 changes. But then thats just about all the rest of commands/* is handling
 it...

 - I think list_command_triggers should do a heap_lock_tuple(LockTupleShared)
  on the command trigger tuple. But then again just about nothing else does :(

As you say, about nothing else does. I think that's a work for another
patch.

 - ExecBeforeOrInsteadOfCommandTriggers is referenced in
 exec_command_triggers_internal comments
 - InitCommandContext comments are outdated
 - generally comments look a bit outdated

Fixed.

 - shouldn't the command trigger stuff for ALTER TABLE be done in inside
 AlterTable instead of utility.c?

Right, done.

 - you have repetitions of the following pattern:
 void
 ExecBeforeCommandTriggers(CommandContext cmd)
 {
   /* that will execute under command trigger memory context */
   if (cmd != NULL  cmd-before != NIL)
   exec_command_triggers_internal(cmd, cmd-before, BEFORE);

   /* switch back to the command Memory Context now */
   MemoryContextSwitchTo(cmd-oldmctx);
 }

 1. Either cmd != NULL does not need to be checked or you need to check it
 before the MemoryContextSwitchTo

I've fixed that code.

 2. the switch to cmd-oldmctx made me very wary at first because I wasn't sure
 its guaranteed to be non NULL

 - why is there a special CommandTriggerContext if its not reset separately?
 Should it be reset? I have to say that I dislike the api around this.

Some call sites need to be able to call those functions a dynamic number
of times. I could add a reset boolean parameter that would mostly be
true in all call site and false in two of them (RemoveObjects,
RemoveRelations), and add a new function to just reset the memory
context then.

Or maybe you have a better idea about the ideal API here?

 - an AFTER .. ALTER AGGREATE ... SET SCHEMA has the wrong schema. Probably the
 same problem exists elsewhere. Or is that as-designed? Would be inconsistent
 with the way object names are handled.

I'm surprised, here's an excerpt from the added regression tests:

alter function notfun(int) set schema cmd;
NOTICE:  snitch: BEFORE any ALTER FUNCTION
NOTICE:  snitch: BEFORE ALTER FUNCTION public notfun
NOTICE:  snitch: AFTER ALTER FUNCTION cmd notfun
NOTICE:  snitch: AFTER any ALTER FUNCTION

 - what does that mean?
 +   cmd.objectname = NULL;  /* composite object name */

User mapping and casts object names are composite, and I don't know how
to represent that in a single text structure.

 - DropPropertyStmt seems to be an unused leftover?

Seems so, cleaned out.

Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_upgrade and statistics

2012-03-13 Thread Peter Eisentraut
On tis, 2012-03-13 at 15:44 -0400, Tom Lane wrote:
 I wonder whether it'd be worth recommending that people do an initial
 ANALYZE with a low stats target, just to get some stats in place,
 and then go back to analyze at whatever their normal setting is.

Perhaps going even further, ANALYZE could have a quick-and-dirty mode,
where it just analyzes say 10 random pages per table and fills in as
much data as it can from that.  And then it does the real analyze.  (Or
it could even insert fake statistics to trigger autoanalyze.)  That way,
you could have a database back in business in less than a minute.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_upgrade and statistics

2012-03-13 Thread Kevin Grittner
Bruce Momjian br...@momjian.us wrote:
 
 OK, so a single 44GB tables took 2.5 minutes to analyze;  that is
 not good.  It would require 11 such tables to reach 500GB (0.5
 TB), and would take 27 minutes.  The report I had was twice as
 long, but still in the ballpark of too long.  :-(
 
But it's really 600 tables of different sizes, which wound up
actually taking:
 
cir=# analyze;
ANALYZE
Time: 3433794.609 ms
 
Just under one hour.
 
Now, if I remember right, the cluster was down for about three
minutes to run pg_upgrade.  Until there are some statistics for key
tables, though, it's not really usable.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_upgrade and statistics

2012-03-13 Thread Bruce Momjian
On Tue, Mar 13, 2012 at 10:10:02PM +0200, Peter Eisentraut wrote:
 On tis, 2012-03-13 at 15:44 -0400, Tom Lane wrote:
  I wonder whether it'd be worth recommending that people do an initial
  ANALYZE with a low stats target, just to get some stats in place,
  and then go back to analyze at whatever their normal setting is.
 
 Perhaps going even further, ANALYZE could have a quick-and-dirty mode,
 where it just analyzes say 10 random pages per table and fills in as
 much data as it can from that.  And then it does the real analyze.  (Or
 it could even insert fake statistics to trigger autoanalyze.)  That way,
 you could have a database back in business in less than a minute.

OK, good idea.  Kevin, can you test this:

  PGOPTIONS='-c default_statistics_target=10' vacuumdb --all --analyze-only

Is it faster?  Thanks.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_upgrade and statistics

2012-03-13 Thread Kevin Grittner
Bruce Momjian br...@momjian.us wrote:
 
 OK, good idea.  Kevin, can you test this:
 
   PGOPTIONS='-c default_statistics_target=10' vacuumdb --all
 --analyze-only
 
 Is it faster?  Thanks.
 
Well, I just did something similar in psql -- I disabled the delays
by:
 
set vacuum_cost_delay = 0;
 
I checked for any statistics overrides, and found none (if I did
this right):
 
cir=# select attrelid::regclass, attname, attstattarget from
pg_attribute where attstattarget  0;
 attrelid | attname | attstattarget 
--+-+---
(0 rows)
 
I went even lower than you suggested:
 
set default_statistics_target = 4;
 
And it was much faster, but still more time than the pg_upgrade run
itself:
 
cir=# analyze;
ANALYZE
Time: 474319.826 ms
 
A little under 8 minutes.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Command Triggers, patch v11

2012-03-13 Thread Andres Freund
On Tuesday, March 13, 2012 09:07:32 PM Dimitri Fontaine wrote:
 Hi,
 
 Andres Freund and...@anarazel.de writes:
  I did a short review of what I found after merging master
 
 Thanks!
 
  - I still find it strange not to fire on cascading actions
 
 We don't build statement for cascading so we don't fire command
 triggers. The user view is that there was no drop command on the sub
 objects, only on the main one.
 
 I know it's not ideal, but that's a limit we have to bite for 9.2
 unfortunately.
Hm. Especially in partially replicated scenarios I think that will bite. But 
then: There will be a 9.3 at some point ;)

  - I dislike the missing locking leading to strange errors uppon
  concurrent changes. But then thats just about all the rest of commands/*
  is handling it...
  
  - I think list_command_triggers should do a
  heap_lock_tuple(LockTupleShared)
  
   on the command trigger tuple. But then again just about nothing else
   does :(
 
 As you say, about nothing else does. I think that's a work for another
 patch.
Not sure, that way the required work is getting bigger and bigger. But I can 
live with that... I think the command trigger work will make better 
concurrency safeness of DDL necessary.

  2. the switch to cmd-oldmctx made me very wary at first because I wasn't
  sure its guaranteed to be non NULL
  
  - why is there a special CommandTriggerContext if its not reset
  separately? Should it be reset? I have to say that I dislike the api
  around this.
 
 Some call sites need to be able to call those functions a dynamic number
 of times. I could add a reset boolean parameter that would mostly be
 true in all call site and false in two of them (RemoveObjects,
 RemoveRelations), and add a new function to just reset the memory
 context then.
 Or maybe you have a better idea about the ideal API here?
I wonder if the answer is making the API more symmetric. Seems more future-
proof in combination to being cleaner.

//create a new memory context
InitCommandContext(cmd);

if(CommandFiresTriggers(cmd)){
//still in current memory context, after all not much memory should be 
allocated here
cmd.foo = bar;
//switches memory context during function execution, resets it afterwards
ExecBeforeCommandTriggers(cmd);
}

if(CommandFiresTriggers(cmd)){
cmd.zap = blub;
ExecAfterCommandTriggers(cmd);
}

//drop the memory context
CleanupCommandContext(cmd);

I find the changing of memory context in CommandFires[After]Trigger + switch 
back in Exec*CommandTrigger rather bad style and I don't really see the point 
anyway.

  - an AFTER .. ALTER AGGREATE ... SET SCHEMA has the wrong schema.
  Probably the same problem exists elsewhere. Or is that as-designed?
  Would be inconsistent with the way object names are handled.
 
 I'm surprised, here's an excerpt from the added regression tests:
 
 alter function notfun(int) set schema cmd;
 NOTICE:  snitch: BEFORE any ALTER FUNCTION
 NOTICE:  snitch: BEFORE ALTER FUNCTION public notfun
 NOTICE:  snitch: AFTER ALTER FUNCTION cmd notfun
 NOTICE:  snitch: AFTER any ALTER FUNCTION
I was not looking at ALTER FUNCTION but ALTER AGGREGATE. And I looked wrongly. 
Sorry for that.

Generally, uppon rereading, I have to say that I am not very happy with the 
decision that ANY triggers are fired from other places than the specific 
triggers. That seams to be a rather dangerous/confusing route to me. 
Especially because sometimes errors (permissions, duplicated names, etc) are 
raised differently in ANY than in specific triggers now:

postgres=# ALTER AGGREGATE bar.array_agg_union3(anyarray) SET SCHEMA public;
NOTICE:  when BEFORE, tag ALTER AGGREGATE, objectid NULL, schemaname NULL, 
objectname NULL
ERROR:  aggregate bar.array_agg_union3(anyarray) does not exist

postgres=# ALTER AGGREGATE public.array_agg_union3(anyarray) SET SCHEMA 
public;
NOTICE:  when BEFORE, tag ALTER AGGREGATE, objectid NULL, schemaname NULL, 
objectname NULL
ERROR:  function array_agg_union3(anyarray) is already in schema public

postgres=# ALTER AGGREGATE public.array_agg_union3(anyarray) SET SCHEMA bar;
NOTICE:  when BEFORE, tag ALTER AGGREGATE, objectid NULL, schemaname NULL, 
objectname NULL
NOTICE:  when BEFORE, tag ALTER AGGREGATE, objectid 16415, schemaname public, 
objectname array_agg_union3
NOTICE:  when AFTER, tag ALTER AGGREGATE, objectid 16415, schemaname bar, 
objectname array_agg_union3
NOTICE:  when AFTER, tag ALTER AGGREGATE, objectid NULL, schemaname NULL, 
objectname NULL


Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] about EncodeDateTime() arguments

2012-03-13 Thread Tom Lane
Peter Eisentraut pete...@gmx.net writes:
 On lör, 2012-03-10 at 18:47 -0500, Tom Lane wrote:
 It appears to me that null-ness of tzp and tzn are used as a 3-way flag
 to identify the style of timezone output wanted (none, numeric, or alpha).

 It's not quite a three-way flag, because it also depends on the style,
 which time zone style is used.  But I liked the idea of making print
 the time zone more explicit and getting rid of the funny pointers.  I
 added a bit of documentation and code deduplication in the attached
 patch, and it already looks much more understandable.

Yeah, looks nicer to me too.

Should we propagate this fix into ecpg's copy of the code as well?
I'm not sure how far the backend has diverged from that copy.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_upgrade and statistics

2012-03-13 Thread Greg Stark
On Tue, Mar 13, 2012 at 7:30 PM, Bruce Momjian br...@momjian.us wrote:
 OK, so a single 44GB tables took 2.5 minutes to analyze;  that is not
 good.  It would require 11 such tables to reach 500GB (0.5 TB), and
 would take 27 minutes.  The report I had was twice as long, but still in
 the ballpark of too long.  :-(

Well my claim was that it shouldn't scale that way.

Assuming you have the same statistics target the sample size is the
same regardless of the size of the table. When the table is small
there might be a significant number of tuples on the same pages so the
number of pages to read might be much smaller than the sample size. As
the table grows it would eventually reach the point where each page
has about 1 sampled tuple on it but from that point on increasing the
size of the table won't increase the amount of i/o Postgres does.

However on thinking about it further hard drives don't really work
that way. Each page that's read in causes a seek -- if the head is on
a different track. Reading two pages on the same page is about the
same speed as reading a single page on that track. So even though
Postgres is doing the same amount of i/o the hard drive is actually
doing more and more seeks up to the point where each sample is a
separate seek.

analyze.c samples 300 * statistics_target rows -- that means the
default is it to 30,000 rows. So the point where each row is on a
separate page would be around 8kB*30,000 or 240MB. And if each row
causes a seek, and each seek takes 5ms then that should be about 150s
per table. Once your table is large enough to take 150s to analyze it
shouldn't really take any longer when it grows larger. I'm not sure at
what size that would happen but I'm guessing it would be at about 8MB
* 30,000 or about 240GB per table plus or minus an order of magnitude.

So in short, I guess the time to analyze does increase as the table
grows larger. Not because the number of rows being sample grows but
because the rows are farther apart and that causes more iops.

hmph. One thing that could speed up analyze on raid arrays would be
doing prefetching so more than one spindle can be busy. Sacrificing
statistical accuracy by reading a less random sample on contiguous
blocks of rows would also be faster but less accurate -- we've gone
around on that topic at least once in the past.

+1 to the idea that pg_upgrade could run analyze with a
statistics_target of 10 and then let autovacuum analyze it again later
with the normal targets at its leisure.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_upgrade and statistics

2012-03-13 Thread Bruce Momjian
On Tue, Mar 13, 2012 at 09:28:33PM +, Greg Stark wrote:
 hmph. One thing that could speed up analyze on raid arrays would be
 doing prefetching so more than one spindle can be busy. Sacrificing
 statistical accuracy by reading a less random sample on contiguous
 blocks of rows would also be faster but less accurate -- we've gone
 around on that topic at least once in the past.
 
 +1 to the idea that pg_upgrade could run analyze with a
 statistics_target of 10 and then let autovacuum analyze it again later
 with the normal targets at its leisure.

Well, that's going to take quite a bit of infrastructure.  We can tell
users to use the PGOPTIONS setting when they run vacuumdb, but we are
going to need some way to cause autovacuum to run later with a higher
statistics target  ---  right now I don't think autovacuum cares how
large a statistics target the previous analyze run used.  It might be a
simple case of causing autovacuum to analyze if it has statistics that
don't match the current statistics target, but I don't think autovacuum
has easy access to those values.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_upgrade and statistics

2012-03-13 Thread Bruce Momjian
On Tue, Mar 13, 2012 at 03:29:22PM -0500, Kevin Grittner wrote:
 I went even lower than you suggested:
  
 set default_statistics_target = 4;
  
 And it was much faster, but still more time than the pg_upgrade run
 itself:
  
 cir=# analyze;
 ANALYZE
 Time: 474319.826 ms
  
 A little under 8 minutes.

What is the target=10 duration?  I think 10 is as low as we can
acceptably recommend.  Should we recommend they run vacuumdb twice, once
with default_statistics_target = 4, and another with the default?

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] wal_buffers, redux

2012-03-13 Thread Robert Haas
On Tue, Mar 13, 2012 at 3:48 PM, Robert Haas robertmh...@gmail.com wrote:
 On Mon, Mar 12, 2012 at 4:45 PM, Jeff Janes jeff.ja...@gmail.com wrote:
 Rerunning all 4 benchmarks (both 16MB and 32MB wal_buffers on both
 machines) with fsync=off (as well as synchronous_commit=off still)
 might help clarify things.

 I reran the 32-client benchmark on the IBM machine with fsync=off and got 
 this:

 32MB: tps = 26809.442903 (including connections establishing)
 16MB: tps = 26651.320145 (including connections establishing)

 That's a speedup of nearly a factor of two, so clearly fsync-related
 stalls are a big problem here, even with wal_buffers cranked up
 through the ceiling.

And here's a tps plot with wal_buffers = 16MB, fsync = off.  The
performance still bounces up and down, so there's obviously some other
factor contributing to latency spikes, but equally clearly, needing to
wait for fsyncs makes it a lot worse.  I bet if we could understand
why that happens, we could improve things here a good deal.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
attachment: tps-16MB.no-fsync.png
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] wal_buffers, redux

2012-03-13 Thread Robert Haas
On Tue, Mar 13, 2012 at 4:55 AM, Andres Freund and...@anarazel.de wrote:
 On Tuesday, March 13, 2012 03:26:34 AM Robert Haas wrote:
 Meanwhile, here are some TPS graphs at 16MB and 32MB on the IBM POWER7
 machine.  32 clients, 1800 seconds, scale factor 300, synchronous
 commit off.
 That graph makes me cringe because its pretty representative of what I have
 seen in practise. Any chance we can get a ~10s average in there? I find it
 pretty hard to read more than spikey.

I tried that but don't find it an improvement; however I'm attaching
an example for your consideration.

 Whats the filesystem + options on this?

ext4, default options.  Fedora 16, 3.2.6-3.fc16.ppc64.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
attachment: tps-16MB-10s.png
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_upgrade and statistics

2012-03-13 Thread Robert Haas
On Tue, Mar 13, 2012 at 5:42 PM, Bruce Momjian br...@momjian.us wrote:
 What is the target=10 duration?  I think 10 is as low as we can
 acceptably recommend.  Should we recommend they run vacuumdb twice, once
 with default_statistics_target = 4, and another with the default?

I'm not sure why we're so glibly rejecting Dan's original proposal.
Sure, adjusting pg_upgrade when we whack around pg_statistic is work,
but who ever said that a workable in-place upgrade facility would be
maintenance-free?  We're operating under a number of restrictions
imposed by the need to be pg_upgrade-compatible, and this doesn't
strike me as a particularly severe one by comparison -- we can always
arrange to NOT migrate statistics between incompatible versions; that
doesn't mean that we shouldn't migrate them when they ARE compatible.
Also, unlike the alternatives thus far proposed, Dan's idea actually
fixes the problem.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_upgrade and statistics

2012-03-13 Thread Kevin Grittner
Bruce Momjian br...@momjian.us wrote:
 
 What is the target=10 duration?  I think 10 is as low as we can
 acceptably recommend.  Should we recommend they run vacuumdb
 twice, once with default_statistics_target = 4, and another with
 the default?
 
Here are the results at various settings.
 
1   :  172198.892 ms
2   :  295536.814 ms
4   :  474319.826 ms
10  :  750458.312 ms
100 :  3433794.609 ms
 
I'm not sure what's best for a general approach to the problem.  For
my own part, I'd be inclined to cherry-pick tables if I were in a
hurry.
 
I hope we at least bring over relpages and reltuples, to give the
optimizer *some* clue what it's looking at.  I wouldn't thing those
would be changing semantics or format very often.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] INHERIT vs INHERITS

2012-03-13 Thread Jaime Casanova
Hi,

Has anyone ever complained about the inconsistency (at least it seems
that to me) of using INHERITS in CREATE TABLE and INHERIT in ALTER
TABLE?
Anyone, besides me, think we should fix that? Maybe support both
version in both commands or choosing one using it everywhere?

-- 
Jaime Casanova         www.2ndQuadrant.com
Professional PostgreSQL: Soporte 24x7 y capacitación

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_upgrade and statistics

2012-03-13 Thread Kevin Grittner
Robert Haas robertmh...@gmail.com wrote:
 
 I'm not sure why we're so glibly rejecting Dan's original
 proposal.  Sure, adjusting pg_upgrade when we whack around
 pg_statistic is work, but who ever said that a workable in-place
 upgrade facility would be maintenance-free?  We're operating under
 a number of restrictions imposed by the need to be pg_upgrade-
 compatible, and this doesn't strike me as a particularly severe
 one by comparison -- we can always arrange to NOT migrate
 statistics between incompatible versions; that doesn't mean that
 we shouldn't migrate them when they ARE compatible.  Also, unlike
 the alternatives thus far proposed, Dan's idea actually fixes the
 problem.
 
In case it got lost with my various timings, I agree with Robert on
all of the above.  The three-minute downtime for pg_upgrade to
upgrade our multi-TB databases is *very* impressive; but I think we
lose bragging rights if we follow that up with -- oh, but the
database isn't really fully *usable* until you run a one-hour
analyze afterward.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] wal_buffers, redux

2012-03-13 Thread Josh Berkus

 That's a speedup of nearly a factor of two, so clearly fsync-related
 stalls are a big problem here, even with wal_buffers cranked up
 through the ceiling.

H.   Do you have any ability to test on XFS?

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] INHERIT vs INHERITS

2012-03-13 Thread Kevin Grittner
Jaime Casanova ja...@2ndquadrant.com wrote:
 
 Has anyone ever complained about the inconsistency (at least it
 seems that to me) of using INHERITS in CREATE TABLE and INHERIT in
 ALTER TABLE?
 
They make sense to me as an English speaker.  CREATE TABLE more or
less has a *description* of the table to be created, while ALTER
TABLE contains *actions* to take.  ADD COLUMN, DROP COLUMN, INHERIT.
They are sub-commands.
 
In normal conversation you might say Jane rides the bus to work. 
But if you want to change that, you might say Jane, ride your
bicycle to work.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_upgrade and statistics

2012-03-13 Thread Daniel Farina
On Tue, Mar 13, 2012 at 3:30 PM, Robert Haas robertmh...@gmail.com wrote:
 On Tue, Mar 13, 2012 at 5:42 PM, Bruce Momjian br...@momjian.us wrote:
 What is the target=10 duration?  I think 10 is as low as we can
 acceptably recommend.  Should we recommend they run vacuumdb twice, once
 with default_statistics_target = 4, and another with the default?

 I'm not sure why we're so glibly rejecting Dan's original proposal.

 Dan's idea actually fixes the problem.

I appreciate your support, but I don't feel dismissed; I see the
obvious appeal of not having to port the catalog if a fast/good enough
regeneration technique can be found, so I'm glad people are trying
those out and measuring things.  I think the main problem that is hard
to work around lies in our inability to trigger autoanalyze one-shot.

I can't really speak on the behalf of a smaller operation (where
pg_upgrade taking on the task of hacking catalogs is clearly very
desirable -- worth serious consideration), but for the scale of our
operation having to do our own catalog hacking at the cost of
possible-terribleness that can result from a botched pg_statistic is
not hugely concerning.  Rather, the more insurmountable and general
problem we keep encountering is how we can trigger throttled
maintenance on a special basis.  It is definitely in my interest to --
some day -- be able to run VACUUM FULL and REINDEX (provided
incremental-self-defragmenting indexes don't get written first)
without any disastrous impact on the user at all, including when they
attempt to drop the table (in which we should yield the lock and let
it happen) or alter its column definition.

-- 
fdr

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_upgrade and statistics

2012-03-13 Thread Andrew Dunstan



On 03/13/2012 06:30 PM, Robert Haas wrote:

On Tue, Mar 13, 2012 at 5:42 PM, Bruce Momjianbr...@momjian.us  wrote:

What is the target=10 duration?  I think 10 is as low as we can
acceptably recommend.  Should we recommend they run vacuumdb twice, once
with default_statistics_target = 4, and another with the default?

I'm not sure why we're so glibly rejecting Dan's original proposal.
Sure, adjusting pg_upgrade when we whack around pg_statistic is work,
but who ever said that a workable in-place upgrade facility would be
maintenance-free?  We're operating under a number of restrictions
imposed by the need to be pg_upgrade-compatible, and this doesn't
strike me as a particularly severe one by comparison -- we can always
arrange to NOT migrate statistics between incompatible versions; that
doesn't mean that we shouldn't migrate them when they ARE compatible.
Also, unlike the alternatives thus far proposed, Dan's idea actually
fixes the problem.


I agree.

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Chronic performance issue with Replication Failover and FSM.

2012-03-13 Thread Josh Berkus
All,

I've discovered a built-in performance issue with replication failover
at one site, which I couldn't find searching the archives.  I don't
really see what we can do to fix it, so I'm posting it here in case
others might have clever ideas.

1. The Free Space Map is not replicated between servers.

2. Thus, when we fail over to a replica, it starts with a blank FSM.

3. I believe replica also starts with zero counters for autovacuum.

4. On a high-UPDATE workload, this means that the replica assumes tables
have no free space until it starts to build a new FSM or autovacuum
kicks in on some of the tables, much later on.

5. If your hosting is such that you fail over a lot (such as on AWS),
then this causes cumulative table bloat which can only be cured by a
VACUUM FULL.

I can't see any way around this which wouldn't also bog down
replication.  Clever ideas, anyone?

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_upgrade and statistics

2012-03-13 Thread Bruce Momjian
On Tue, Mar 13, 2012 at 05:33:29PM -0500, Kevin Grittner wrote:
 Bruce Momjian br...@momjian.us wrote:
  
  What is the target=10 duration?  I think 10 is as low as we can
  acceptably recommend.  Should we recommend they run vacuumdb
  twice, once with default_statistics_target = 4, and another with
  the default?
  
 Here are the results at various settings.
  
 1   :  172198.892 ms
 2   :  295536.814 ms
 4   :  474319.826 ms
 10  :  750458.312 ms
 100 :  3433794.609 ms

Thanks, good numbers to know.

 I'm not sure what's best for a general approach to the problem.  For
 my own part, I'd be inclined to cherry-pick tables if I were in a
 hurry.
  
 I hope we at least bring over relpages and reltuples, to give the
 optimizer *some* clue what it's looking at.  I wouldn't thing those
 would be changing semantics or format very often.

True, but we don't migrate them either.

This is the exact same problem you would have restoring a pg_dump
backup.  The improvement needs to go into pg_dump, and then pg_upgrade
can make use of it.

Another idea is to just copy over pg_statistic like we copy of
pg_largeobject now, and force autovacuum to run.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_upgrade and statistics

2012-03-13 Thread Tom Lane
Bruce Momjian br...@momjian.us writes:
 Another idea is to just copy over pg_statistic like we copy of
 pg_largeobject now, and force autovacuum to run.

That would be an automatic crash in a 9.1 to 9.2 migration, as well as
any other release where we changed the column layout of pg_statistic.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_upgrade and statistics

2012-03-13 Thread Bruce Momjian
On Tue, Mar 13, 2012 at 08:30:17PM -0400, Tom Lane wrote:
 Bruce Momjian br...@momjian.us writes:
  Another idea is to just copy over pg_statistic like we copy of
  pg_largeobject now, and force autovacuum to run.
 
 That would be an automatic crash in a 9.1 to 9.2 migration, as well as
 any other release where we changed the column layout of pg_statistic.

It might be a solution for cases where we don't modify it.  I frankly am
worried that if we copy over statistics even in ASCII that don't match
what the server expects, it might lead to a crash, which has me back to
wanting to speed up vacuumdb.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_upgrade and statistics

2012-03-13 Thread Bruce Momjian
On Tue, Mar 13, 2012 at 08:22:51PM -0400, Bruce Momjian wrote:
 On Tue, Mar 13, 2012 at 05:33:29PM -0500, Kevin Grittner wrote:
  Bruce Momjian br...@momjian.us wrote:
   
   What is the target=10 duration?  I think 10 is as low as we can
   acceptably recommend.  Should we recommend they run vacuumdb
   twice, once with default_statistics_target = 4, and another with
   the default?
   
  Here are the results at various settings.
   
  1   :  172198.892 ms
  2   :  295536.814 ms
  4   :  474319.826 ms
  10  :  750458.312 ms
  100 :  3433794.609 ms
 
 Thanks, good numbers to know.

OK, new crazy idea.  Kevin has shown that his database can get a single
bucket in 2.8 minutes.  What if we have pg_upgrade create a vacuumdb
script that generates increasingly accurate statistics, e.g. it runs for
default_statistics_target values of 1, 10, and default (100).  That
would give basic statistics quickly (2.8 minutes), and full statistics
in an hour, for Kevin's database.

PGOPTIONS='-c default_statistics_target=1' vacuumdb --all --analyze-only
PGOPTIONS='-c default_statistics_target=10' vacuumdb --all 
--analyze-only
vacuumdb --all --analyze-only

The only problem I see is that users who use non-default statistics per
table would not be affected by the increasing default_statistics_target
values.

The upside is this would work for all releases of Postgres.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Chronic performance issue with Replication Failover and FSM.

2012-03-13 Thread Daniel Farina
On Tue, Mar 13, 2012 at 4:53 PM, Josh Berkus j...@agliodbs.com wrote:
 All,

 I've discovered a built-in performance issue with replication failover
 at one site, which I couldn't find searching the archives.  I don't
 really see what we can do to fix it, so I'm posting it here in case
 others might have clever ideas.

 1. The Free Space Map is not replicated between servers.

 2. Thus, when we fail over to a replica, it starts with a blank FSM.

 3. I believe replica also starts with zero counters for autovacuum.

 4. On a high-UPDATE workload, this means that the replica assumes tables
 have no free space until it starts to build a new FSM or autovacuum
 kicks in on some of the tables, much later on.

 5. If your hosting is such that you fail over a lot (such as on AWS),
 then this causes cumulative table bloat which can only be cured by a
 VACUUM FULL.

 I can't see any way around this which wouldn't also bog down
 replication.  Clever ideas, anyone?

Would it bog it down by much?

(1 byte per 8kb) * 2TB = 250MB.  Even if you doubled or tripled it for
pointer-overhead reasons it's pretty menial, whereas VACUUM traffic is
already pretty intense.  Still, it's clearly...work.

-- 
fdr

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] wal_buffers, redux

2012-03-13 Thread Fujii Masao
On Wed, Mar 14, 2012 at 7:20 AM, Robert Haas robertmh...@gmail.com wrote:
 On Tue, Mar 13, 2012 at 3:48 PM, Robert Haas robertmh...@gmail.com wrote:
 On Mon, Mar 12, 2012 at 4:45 PM, Jeff Janes jeff.ja...@gmail.com wrote:
 Rerunning all 4 benchmarks (both 16MB and 32MB wal_buffers on both
 machines) with fsync=off (as well as synchronous_commit=off still)
 might help clarify things.

 I reran the 32-client benchmark on the IBM machine with fsync=off and got 
 this:

 32MB: tps = 26809.442903 (including connections establishing)
 16MB: tps = 26651.320145 (including connections establishing)

 That's a speedup of nearly a factor of two, so clearly fsync-related
 stalls are a big problem here, even with wal_buffers cranked up
 through the ceiling.

 And here's a tps plot with wal_buffers = 16MB, fsync = off.  The
 performance still bounces up and down, so there's obviously some other
 factor contributing to latency spikes

Initialization of WAL file? Do the latency spikes disappear if you start
benchmark after you prepare lots of the recycled WAL files?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Chronic performance issue with Replication Failover and FSM.

2012-03-13 Thread Fujii Masao
On Wed, Mar 14, 2012 at 8:53 AM, Josh Berkus j...@agliodbs.com wrote:
 All,

 I've discovered a built-in performance issue with replication failover
 at one site, which I couldn't find searching the archives.  I don't
 really see what we can do to fix it, so I'm posting it here in case
 others might have clever ideas.

 1. The Free Space Map is not replicated between servers.

 2. Thus, when we fail over to a replica, it starts with a blank FSM.

 3. I believe replica also starts with zero counters for autovacuum.

 4. On a high-UPDATE workload, this means that the replica assumes tables
 have no free space until it starts to build a new FSM or autovacuum
 kicks in on some of the tables, much later on.

If it's really a high-UPDATE workload, wouldn't autovacuum start soon?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] wal_buffers, redux

2012-03-13 Thread Robert Haas
On Tue, Mar 13, 2012 at 10:02 PM, Fujii Masao masao.fu...@gmail.com wrote:
 On Wed, Mar 14, 2012 at 7:20 AM, Robert Haas robertmh...@gmail.com wrote:
 On Tue, Mar 13, 2012 at 3:48 PM, Robert Haas robertmh...@gmail.com wrote:
 On Mon, Mar 12, 2012 at 4:45 PM, Jeff Janes jeff.ja...@gmail.com wrote:
 Rerunning all 4 benchmarks (both 16MB and 32MB wal_buffers on both
 machines) with fsync=off (as well as synchronous_commit=off still)
 might help clarify things.

 I reran the 32-client benchmark on the IBM machine with fsync=off and got 
 this:

 32MB: tps = 26809.442903 (including connections establishing)
 16MB: tps = 26651.320145 (including connections establishing)

 That's a speedup of nearly a factor of two, so clearly fsync-related
 stalls are a big problem here, even with wal_buffers cranked up
 through the ceiling.

 And here's a tps plot with wal_buffers = 16MB, fsync = off.  The
 performance still bounces up and down, so there's obviously some other
 factor contributing to latency spikes

 Initialization of WAL file? Do the latency spikes disappear if you start
 benchmark after you prepare lots of the recycled WAL files?

The latency spikes seem to correspond to checkpoints, so I don't think
that's it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Too many IO?

2012-03-13 Thread Tatsuo Ishii
I have created a 29GB test database by using standard pgbnech -i -s
2000.  Then I executed:

explain (analyze, buffers) select * from pgbench_accounts where aid in
(select cast(random()*2 as int) from generate_series(1,500));

Nested Loop  (cost=30.00..6075.07 rows=1 width=97) (actual time=23.051.
.13570.739 rows=500 loops=1)
   Buffers: shared hit=1255 read=1250
   -  HashAggregate  (cost=30.00..32.00 rows=200 width=4) (actual time=0.474..0
.723 rows=500 loops=1)
 -  Function Scan on generate_series  (cost=0.00..17.50 rows=1000 width
=0) (actual time=0.097..0.264 rows=500 loops=1)
   -  Index Scan using pgbench_accounts_pkey on pgbench_accounts  (cost=0.00..3
0.20 rows=1 width=97) (actual time=27.085..27.138 rows=1 loops=500)
 Index Cond: (aid = (((random() * 2::double precision))::integer
))
 Buffers: shared hit=1255 read=1250
 Total runtime: 13571.020 ms

As you can see, this query generated 1255+1250 = 2505 times block read
either from the buffer or the disk. In my understanding the query
accesses an index tuple, which will need access to root page and
several number of meta pages (I mean index pages they are not either
root or leaf pages) and 1 leaf page, then access 1 heap block. So I
expected total number of IO would be somewhat:

500 index leaf pages + 500 heap blocks = 1000

However I saw 1505 more accesses in total. My guess is this number
mainly comes from index meta page access. So my guess is we need 3
page accesses (to traverse b tree index tree) before reaching the leaf
page in average. Am I correct or the number is execessive?
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Chronic performance issue with Replication Failover and FSM.

2012-03-13 Thread Daniel Farina
On Tue, Mar 13, 2012 at 7:05 PM, Fujii Masao masao.fu...@gmail.com wrote:
 If it's really a high-UPDATE workload, wouldn't autovacuum start soon?

Also, while vacuum cleanup records are applied, could not the standby
also update its free space map, without having to send the actual FSM
updates?  I guess that's bogging down of another variety.

-- 
fdr

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] wal_buffers, redux

2012-03-13 Thread Robert Haas
On Tue, Mar 13, 2012 at 6:44 PM, Josh Berkus j...@agliodbs.com wrote:
 That's a speedup of nearly a factor of two, so clearly fsync-related
 stalls are a big problem here, even with wal_buffers cranked up
 through the ceiling.

 H.   Do you have any ability to test on XFS?

It seems I do.

XFS, with fsync = on:
tps = 14746.687499 (including connections establishing)
XFS, with fsync = off:
tps = 25121.876560 (including connections establishing)

No real dramatic difference there, maybe a bit slower.

On further thought, it may be that this is just a simple case of too
many checkpoints.  With fsync=off, we don't have to actually write all
that dirty data back to disk.  I'm going to try cranking up
checkpoint_segments and see what happens.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Command Triggers, patch v11

2012-03-13 Thread Robert Haas
On Tue, Mar 13, 2012 at 5:06 PM, Andres Freund and...@anarazel.de wrote:
 Generally, uppon rereading, I have to say that I am not very happy with the
 decision that ANY triggers are fired from other places than the specific
 triggers. That seams to be a rather dangerous/confusing route to me.

I agree. I think that's a complete non-starter.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] foreign key locks, 2nd attempt

2012-03-13 Thread Noah Misch
On Tue, Mar 13, 2012 at 01:46:24PM -0400, Robert Haas wrote:
 On Mon, Mar 12, 2012 at 3:28 PM, Simon Riggs si...@2ndquadrant.com wrote:
  I agree with you that some worst case performance tests should be
  done. Could you please say what you think the worst cases would be, so
  those can be tested? That would avoid wasting time or getting anything
  backwards.
 
 I've thought about this some and here's what I've come up with so far:
 
 1. SELECT FOR SHARE on a large table on a system with no write cache.

Easy enough that we may as well check it.  Share-locking an entire large table
is impractical in a real application, so I would not worry if this shows a
substantial regression.

 2. A small parent table (say 30 rows or so) and a larger child table
 with a many-to-one FK relationship to the parent (say 100 child rows
 per parent row), with heavy update activity on the child table, on a
 system where fsyncs are very slow.  This should generate lots of mxid
 consumption, and every 1600 or so mxids (I think) we've got to fsync;
 does that generate a noticeable performance hit?

More often than that; each 2-member mxid takes 4 bytes in an offsets file and
10 bytes in a members file.  So, more like one fsync per ~580 mxids.  Note
that we already fsync the multixact SLRUs today, so any increase will arise
from the widening of member entries from 4 bytes to 5.  The realism of this
test is attractive.  Nearly-static parent tables are plenty common, and this
test will illustrate the impact on those designs.

 3. It would be nice to test the impact of increased mxid lookups in
 the parent, but I've realized that the visibility map will probably
 mask a good chunk of that effect, which is a good thing.  Still, maybe
 something like this: a fairly large parent table, say a million rows,
 but narrow rows, so that many of them fit on a page, with frequent
 reads and occasional updates (if there are only reads, autovacuum
 might end with all the visibility map bits set); plus a child table
 with one or a few rows per parent which is heavily updated.  In theory
 this ought to be good for the patch, since the the more fine-grained
 locking will avoid blocking, but in this case the parent table is
 large enough that you shouldn't get much blocking anyway, yet you'll
 still pay the cost of mxid lookups because the occasional updates on
 the parent will clear VM bits.  This might not be the exactly right
 workload to measure this effect, but if it's not maybe someone can
 devote a little time to thinking about what would be.

You still have HEAP_XMAX_{INVALID,COMMITTED} to reduce the pressure on mxid
lookups, so I think something more sophisticated is needed to exercise that
cost.  Not sure what.

 4. A plain old pgbench run or two, to see whether there's any
 regression when none of this matters at all...

Might as well.

 This isn't exactly a test case, but from Noah's previous comments I
 gather that there is a theoretical risk of mxid consumption running
 ahead of xid consumption.  We should try to think about whether there
 are any realistic workloads where that might actually happen.  I'm
 willing to believe that there aren't, but not just because somebody
 asserts it.  The reason I'm concerned about this is because, if it
 should happen, the result will be more frequent anti-wraparound
 vacuums on every table in the cluster.  Those are already quite
 painful for some users.

Yes.  Pre-release, what can we really do here other than have more people
thinking about ways it might happen in practice?  Post-release, we could
suggest monitoring methods or perhaps have VACUUM emit a WARNING when a table
is using more mxid space than xid space.


Also consider a benchmark that does plenty of non-key updates on a parent
table with no activity on the child table.  We'll pay the overhead of
determining that the key column(s) have not changed, but it will never pay off
by preventing a lock wait.  Granted, this is barely representative of
application behavior.  Perhaps, too, we already have a good sense of this cost
from the HOT benchmarking efforts and have no cause to revisit it.

Thanks,
nm

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_upgrade and statistics

2012-03-13 Thread Euler Taveira
On 13-03-2012 21:34, Bruce Momjian wrote:
 It might be a solution for cases where we don't modify it.  I frankly am
 worried that if we copy over statistics even in ASCII that don't match
 what the server expects, it might lead to a crash, which has me back to
 wanting to speed up vacuumdb.
 
That was discussed in another thread some time ago [1]. Adopting a hack
solution is not the way to go. It could lead to bad consequences in a near 
future.

For 9.2, we could advise users to divide the ANALYZE step into
ANALYZE-per-table steps and run them all in parallel. This ANALYZE-per-table
ranking could be accomplished using a simple approach like '... row_number()
OVER (ORDER BY pg_relation_size(oid) ... WHERE row_number % n = x' (tip stolen
from Simon's book).


[1] http://archives.postgresql.org/message-id/4f10a728.7090...@agliodbs.com


-- 
   Euler Taveira de Oliveira - Timbira   http://www.timbira.com.br/
   PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] initdb and fsync

2012-03-13 Thread Jeff Davis
On Tue, 2012-03-13 at 09:42 +0100, Andres Freund wrote:
 for recursively everything in dir:
posix_fadvise(fd, POSIX_FADV_DONTNEED);
 
 for recursively everything in dir:
fsync(fd);

Wow, that made a huge difference!

  no sync:  ~ 1.0s
  sync: ~10.0s
  fadvise+sync: ~ 1.3s

Patch attached.

Now I feel much better about it. Most people will either have fadvise, a
write cache (rightly or wrongly), or actually need the sync. Those that
have none of those can use -N.

Regards,
Jeff Davis


initdb-fsync-20120313.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] patch for parallel pg_dump

2012-03-13 Thread Joachim Wieland
On Tue, Mar 13, 2012 at 1:53 PM, Robert Haas robertmh...@gmail.com wrote:
 What I mean is that the function ArchiveEntry() is defined in
 pg_backup_archiver.c, and it takes an argument called relpages, and
 the string relpages does not appear anywhere else in that file.

Uhm, that's kinda concerning, isn't it... fixed...


[...pgpipe...]
 Dunno.  Can we get an opinion on that from one of the Windows guys?
 Andrew, Magnus?

Waiting for the verdict here...


 If a child terminates without leaving a message, the master will still
 detect it and just say a worker process died unexpectedly (this part
 was actually broken, but now it's fixed :-) )

 All that may be true, but I still don't see why it's right for this to
 apply in the cases where the worker thread says die_horribly(), but
 not in the cases where the worker says exit_horribly().

Hm, I'm not calling the error handler from exit_horribly because it
doesn't have the AH. It looks like the code assumes that
die_horribly() is called whenever AH is available and if not,
exit_horribly() should be called which eventually calls these
preregistered exit-hooks via exit_nicely() to clean up the connection.

I think we should somehow unify both functions, the code is not very
consistent in this respect, it also calls exit_horribly() when it has
AH available. See for example pg_backup_tar.c

Or is there another distinction between them? The question how to
clean it up basically brings us back to the question what to do about
global variables in general and for error handlers in particular.


 Or we change fmtQualifiedId to take an int and then we always pass the
 archive version instead of the Archive* ?

 Hmm, I think that might make sense.

Done.


 +enum escrow_action { GET, SET };
 +static void
 +parallel_error_handler_escrow_data(enum escrow_action act,
 ParallelState *pstate)
 +{
 +       static ParallelState *s_pstate = NULL;
 +
 +       if (act == SET)
 +               s_pstate = pstate;
 +       else
 +               *pstate = *s_pstate;
 +}

 This seems like a mighty complicated way to implement a global variable.

 Well, we talked about that before, when you complained that you
 couldn't get rid of the global g_conn because of the exit handler.
 You're right that in fact it is an indirect global variable here but
 it's clearly limited to the use of the error handler and you can be
 sure that nobody other than this function writes to it or accesses it
 without calling this function.

 Sure, but since all the function does is write to it or access it,
 what good does that do me?

It encapsulates the variable so that it can only be used for one
specific use case.

Attaching a new version.


parallel_pg_dump_4.diff.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers