Re: [HACKERS] PL/Perl Returned Array

2011-08-13 Thread Peter Eisentraut
On fre, 2011-08-12 at 20:09 -0700, Darren Duncan wrote:
 David E. Wheeler wrote:
  On Aug 12, 2011, at 6:17 PM, Alex Hunsaker wrote:
  Anyway, the attached patch fixes it for me. That is when we don't have
  an array state, just return an empty array.  (Also adds some
  additional comments)
  
  Fix confirmed, thank you!
  
  +1 to getting this committed before the next beta/RC.
 
 Policy question.  If this problem hadn't been detected until after 9.1 was 
 declared production, would it have been considered a bug to fix in 9.1.x or 
 would it have been delayed to 9.2 since that fix might be considered an 
 incompatibility?

Surely this would be a bug fix.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inserting heap tuples in bulk in COPY

2011-08-13 Thread Dean Rasheed
On 12 August 2011 23:19, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 Triggers complicate this. I believe it is only safe to group tuples
 together
 like this if the table has no triggers. A BEFORE ROW trigger might run a
 SELECT on the table being copied to, and check if some of the tuples
 we're
 about to insert exist. If we run BEFORE ROW triggers for a bunch of
 tuples
 first, and only then insert them, none of the trigger invocations will
 see
 the other rows as inserted yet. Similarly, if we run AFTER ROW triggers
 after inserting a bunch of tuples, the trigger for each of the insertions
 would see all the inserted rows. So at least for now, the patch simply
 falls
 back to inserting one row at a time if there are any triggers on the
 table.

I guess DEFAULT values could also suffer from a similar problem to
BEFORE ROW triggers though (in theory a DEFAULT could be based on a
function that SELECTs from the table being copied to).

Regards,
Dean

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] VACUUM FULL versus system catalog cache invalidation

2011-08-13 Thread Tom Lane
Simon Riggs si...@2ndquadrant.com writes:
 With HOT, there is very little need to perform a VACUUM FULL on any
 shared catalog table. Look at the indexes...

This is not really a useful argument.  People do do VAC FULL on
catalogs, whether we think they should or not.  Also, it's not only
shared catalogs that are at stake.

We could avoid fixing the bug if we forbade both VAC FULL and CLUSTER
on all system catalogs, period, no exceptions.  That doesn't seem like
a workable answer though.  Some usage patterns do see bloat on the
catalogs, especially if you disable or dial down autovacuum.

 In the unlikely event we do actually have to VACUUM FULL a shared
 catalog table, nuke any cache entry for the whole shared catalog.

You're still not getting the point.  We *do* nuke all cache entries
after a VAC FULL.  That does not avoid this bug.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inserting heap tuples in bulk in COPY

2011-08-13 Thread Tom Lane
Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
 The patch is WIP, mainly because I didn't write the WAL replay routines 
 yet, but please let me know if you see any issues.

Why do you need new WAL replay routines?  Can't you just use the existing
XLOG_HEAP_NEWPAGE support?

By any large, I think we should be avoiding special-purpose WAL entries
as much as possible.

Also, I strongly object to turning regular heap_insert into a wrapper
around some other more complicated operation.  That will likely have
bad consequences for the performance of every other operation.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] VACUUM FULL versus system catalog cache invalidation

2011-08-13 Thread Tom Lane
I wrote:
 Yeah.  Also, to my mind this is only a fix that will be used in 9.0 and
 9.1 --- now that it's occurred to me that we could use tuple xmin/xmax
 to invalidate catcaches instead of recording individual TIDs, I'm
 excited about doing that instead for 9.2 and beyond.  I believe that
 that could result in a significant reduction in sinval traffic, which
 would be a considerable performance win.

On closer inspection this idea doesn't seem workable.  I was imagining
that after a transaction commits, we could find obsoleted catcache
entries by looking for tuples with xmax equal to the transaction's XID.
But a catcache entry made before the transaction had done the update
wouldn't contain the correct xmax, so we'd fail to remove it.  The only
apparent way to fix that would be to go out to disk and consult the
current on-disk xmax, which would hardly be any cheaper than just
dropping the cache entry and then reloading it when/if needed.

All is not entirely lost, however: there's still some possible
performance benefit to be gained here, if we go to the scheme of
identifying victim catcache entries by hashvalue only.  Currently,
each heap_update in a cached catalog has to issue two sinval messages
(per cache!): one against the old TID and one against the new TID.
We'd be able to reduce that to one message in the common case where the
hashvalue remains the same because the cache key columns didn't change.

Another thing we could consider doing, if one-in-2^32 hash collisions
seems too high, is widening the hash values to 64 bits.  I'm not
terribly excited about that, because a lot of the caches are on OID
columns for which there'd be zero benefit.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] psql: bogus descriptions displayed by \d+

2011-08-13 Thread Peter Eisentraut
On fre, 2011-08-12 at 16:14 -0400, Robert Haas wrote:
 On Fri, Aug 12, 2011 at 4:11 PM, Peter Eisentraut pete...@gmx.net wrote:
  A table is either a base table, a derived table, a transient table, or
  a viewed table. (SQL/MED adds foreign table.)
 
  Just FYI.
 
 Base table seems clear enough, and a transient table sounds like a
 temporary table, but what is a derived table?  Is a viewed table a
 view?

A base table is either a permanent base table or (one of various kinds
of) a temporary base table.  A derived table is the result of a table
expression, so this is more of a notional syntactic term.  A transient
table is, roughly speaking, OLD or NEW in a trigger.  A viewed table is
a view.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] VACUUM FULL versus system catalog cache invalidation

2011-08-13 Thread Robert Haas
On Sat, Aug 13, 2011 at 12:18 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 I wrote:
 Yeah.  Also, to my mind this is only a fix that will be used in 9.0 and
 9.1 --- now that it's occurred to me that we could use tuple xmin/xmax
 to invalidate catcaches instead of recording individual TIDs, I'm
 excited about doing that instead for 9.2 and beyond.  I believe that
 that could result in a significant reduction in sinval traffic, which
 would be a considerable performance win.

 On closer inspection this idea doesn't seem workable.  I was imagining
 that after a transaction commits, we could find obsoleted catcache
 entries by looking for tuples with xmax equal to the transaction's XID.
 But a catcache entry made before the transaction had done the update
 wouldn't contain the correct xmax, so we'd fail to remove it.  The only
 apparent way to fix that would be to go out to disk and consult the
 current on-disk xmax, which would hardly be any cheaper than just
 dropping the cache entry and then reloading it when/if needed.

 All is not entirely lost, however: there's still some possible
 performance benefit to be gained here, if we go to the scheme of
 identifying victim catcache entries by hashvalue only.  Currently,
 each heap_update in a cached catalog has to issue two sinval messages
 (per cache!): one against the old TID and one against the new TID.
 We'd be able to reduce that to one message in the common case where the
 hashvalue remains the same because the cache key columns didn't change.

Cool.

 Another thing we could consider doing, if one-in-2^32 hash collisions
 seems too high, is widening the hash values to 64 bits.  I'm not
 terribly excited about that, because a lot of the caches are on OID
 columns for which there'd be zero benefit.

Yeah, and I can't get excited about the increased memory usage, either.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] index-only scans

2011-08-13 Thread Robert Haas
On Fri, Aug 12, 2011 at 5:39 PM, Kevin Grittner
kevin.gritt...@wicourts.gov wrote:
 Robert Haas robertmh...@gmail.com wrote:

 That's one of the points I asked for feedback on in my original
 email.  How should the costing be done?

 It seems pretty clear that there should be some cost adjustment.  If
 you can't get good numbers somehow on what fraction of the heap
 accesses will be needed, I would suggest using a magic number based
 on the assumption that half the heap access otherwise necessary will
 be done.  It wouldn't be the worst magic number in the planner.  Of
 course, real numbers are always better if you can get them.

It wouldn't be that difficult (I think) to make VACUUM and/or ANALYZE
gather some statistics; what I'm worried about is that we'd have
correlation problems.  Consider a wide table with an index on (id,
name), and the query:

SELECT name FROM tab WHERE id = 12345

Now, suppose that we know that 50% of the heap pages have their
visibility map bits set.  What's the chance that this query won't need
a heap fetch?  Well, the chance is 50% *if* you assume that a row
which has been quiescent for a long time is just as likely to be
queried as one that has been recently inserted or updated.  However,
in many real-world use cases, nothing could be farther from the truth.

What do we do about that?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] psql: bogus descriptions displayed by \d+

2011-08-13 Thread Robert Haas
On Sat, Aug 13, 2011 at 1:56 PM, Peter Eisentraut pete...@gmx.net wrote:
 On fre, 2011-08-12 at 16:14 -0400, Robert Haas wrote:
 On Fri, Aug 12, 2011 at 4:11 PM, Peter Eisentraut pete...@gmx.net wrote:
  A table is either a base table, a derived table, a transient table, or
  a viewed table. (SQL/MED adds foreign table.)
 
  Just FYI.

 Base table seems clear enough, and a transient table sounds like a
 temporary table, but what is a derived table?  Is a viewed table a
 view?

 A base table is either a permanent base table or (one of various kinds
 of) a temporary base table.  A derived table is the result of a table
 expression, so this is more of a notional syntactic term.  A transient
 table is, roughly speaking, OLD or NEW in a trigger.  A viewed table is
 a view.

Ah.

Well, I guess I'm somewhat open to the idea of making all the places
where we mean, specifically, a table say base table.  Then we could
substitute the term table where we now say relation.

On the other hand, I am also not entirely sure such a change in
terminology would be a net improvement in clarity, even though it does
seem better in some cases.  For example, the CREATE TABLE command does
not create a viewed table; nor is there any CREATE VIEWED TABLE
command.  On the flip side, for something like CLUSTER, ERROR: %s is
not a base table does seem like it could be more clear than just
ERROR: %s is not a table.

So I'm not sure what to do.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] our buffer replacement strategy is kind of lame

2011-08-13 Thread Robert Haas
On Fri, Aug 12, 2011 at 10:51 PM, Greg Stark st...@mit.edu wrote:
 On Fri, Aug 12, 2011 at 5:05 AM, Robert Haas robertmh...@gmail.com wrote:
 Only 96 of the 14286 buffers in sample_data are in shared_buffers,
 despite the fact that we have 37,218 *completely unused* buffers lying
 around.  That sucks, because it means that the sample query did a
 whole lot of buffer eviction that was completely needless.  The ring
 buffer is there to prevent trashing the shared buffer arena, but here
 it would have been fine to trash the arena: there wasn't anything
 there we would have minded losing (or, indeed, anything at all).

 I don't disagree with the general thrust of your point, but I just
 wanted to point out one case where not using free buffers even though
 they're available might make sense:

 If you execute a large batch delete or update or even just set lots of
 hint bits you'll dirty a lot of buffers. The ring buffer forces the
 query that is actually dirtying all these buffers to also do the i/o
 to write them out. Otherwise you leave them behind to slow down other
 queries. This was one of the problems with the old vacuum code which
 the ring buffer replaced. It left behind lots of dirtied buffers for
 other queries to do i/o on.

Interesting point.

After thinking about this some more, I'm coming around to the idea
that we need to distinguish between:

1. Ensuring a sufficient supply of evictable buffers, and
2. Evicting a buffer.

The second obviously needs to be done only when needed, but the first
one should really be done as background work.  Currently, the clock
sweep serves both functions, and that's not good.  We shouldn't ever
let ourselves get to the point where there are no buffers at all with
reference count zero, so that the next guy who needs a buffer has to
spin the clock hand around until the reference counts get low enough.
Maintaining a sufficient supply of refcount-zero buffers should be
done as a background task; and possibly we ought to put them all in a
linked list so that the next guy who needs a buffer can just pop one
off.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] psql: bogus descriptions displayed by \d+

2011-08-13 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On the other hand, I am also not entirely sure such a change in
 terminology would be a net improvement in clarity, even though it does
 seem better in some cases.  For example, the CREATE TABLE command does
 not create a viewed table; nor is there any CREATE VIEWED TABLE
 command.  On the flip side, for something like CLUSTER, ERROR: %s is
 not a base table does seem like it could be more clear than just
 ERROR: %s is not a table.

 So I'm not sure what to do.

Yeah.  Even if the standard is consistent about using base table
versus just table when they mean a plain table, I do not believe that
that distinction is commonly understood among users.  I think people
would tend to think that base table means some subset of plain tables,
and would get confused.  One fairly likely interpretation would be that
base table means the parent table of an inheritance tree, for
instance.  If we have to stop and specify what we mean by base table
every other time we use the phrase, we have achieved nothing except
pedantry.

On the whole I prefer table for plain tables and relation for
everything table-ish.  We can quibble about whether indexes, say, are
relations within that meaning ... but the PG code and docs have long
used relation in the more general meaning, and I doubt that we'll get
far if we try to redefine its meaning now.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] index-only scans

2011-08-13 Thread Kääriäinen Anssi

Now, suppose that we know that 50% of the heap pages have their
visibility map bits set.  What's the chance that this query won't need
a heap fetch?  Well, the chance is 50% *if* you assume that a row
which has been quiescent for a long time is just as likely to be
queried as one that has been recently inserted or updated.  However,
in many real-world use cases, nothing could be farther from the truth.

What do we do about that?


The example is much more realistic if the query is a fetch of N latest rows 
from a table. Very common use case, and the whole relation's visibility 
statistics are completely wrong for that query. Wouldn't it be great if there 
was something like pg_stat_statements that would know the statistics per query, 
derived from usage...

Even if the statistics are not available per query, the statistics could be 
calculated from the relation usage: the weighted visibility of the pages would 
be pages_visible_when_read / total_pages_read for the relation. That percentage 
would minimize the average cost of the plans much better than just the 
non-weighted visibility percentage.

For the above example, if the usage is 90% read the N latest rows and we assume 
they are never visible, the weighted visibility percentage would be 10% while 
the non-weighted visibility percentage could be 90%. Even if the visibility 
percentage would be incorrect for the queries reading old rows, by definition 
of the weighted visibility percentage there would not be too many of them.

The same idea could of course be used to calculate the effective cache hit 
ratio for each table. Cache hit ratio would have the problem of feedback loops, 
though.

Of course, keeping such statistic could be more expensive than the benefit it 
gives. On the other hand, page hit percentage is already available...

 - Anssi
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] our buffer replacement strategy is kind of lame

2011-08-13 Thread Greg Stark
On Sat, Aug 13, 2011 at 8:52 PM, Robert Haas robertmh...@gmail.com wrote:

 and possibly we ought to put them all in a
 linked list so that the next guy who needs a buffer can just pop one

The whole point of the clock sweep algorithm is to approximate an LRU
without needing to maintain a linked list. The problem with a linked
list is that you need to serialize access to it so every time you
reference a buffer you need to wait on a lock for the list so you can
move that buffer around in the list.

It does kind of seem like your numbers indicate we're missing part of
the picture though. The idea with the clock sweep algorithm is that
you keep approximately 1/nth of the buffers with each of the n values.
If we're allowing nearly all the buffers to reach a reference count of
5 then you're right that we've lost any information about which
buffers have been referenced most recently.

I wonder if we're suppoesd to be doing something like advancing the
clock hand each time we increment a reference count as well?


-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] index-only scans

2011-08-13 Thread Heikki Linnakangas

On 13.08.2011 23:35, Kääriäinen Anssi wrote:


Now, suppose that we know that 50% of the heap pages have their
visibility map bits set.  What's the chance that this query won't need
a heap fetch?  Well, the chance is 50% *if* you assume that a row
which has been quiescent for a long time is just as likely to be
queried as one that has been recently inserted or updated.  However,
in many real-world use cases, nothing could be farther from the truth.

What do we do about that?


The example is much more realistic if the query is a fetch of N latest rows 
from a table. Very common use case, and the whole relation's visibility 
statistics are completely wrong for that query.


That is somewhat compensated by the fact that tuples that are accessed 
more often are also more likely to be in cache. Fetching the heap tuple 
to check visibility is very cheap when the tuple is in cache.


I'm not sure how far that compensates it, though. I'm sure there's 
typically nevertheless a fairly wide range of pages that have been 
modified since the last vacuum, but not in cache anymore.



Wouldn't it be great if there was something like pg_stat_statements that would 
know the statistics per query, derived from usage...

Even if the statistics are not available per query, the statistics could be 
calculated from the relation usage: the weighted visibility of the pages would 
be pages_visible_when_read / total_pages_read for the relation. That percentage 
would minimize the average cost of the plans much better than just the 
non-weighted visibility percentage.

For the above example, if the usage is 90% read the N latest rows and we assume 
they are never visible, the weighted visibility percentage would be 10% while 
the non-weighted visibility percentage could be 90%. Even if the visibility 
percentage would be incorrect for the queries reading old rows, by definition 
of the weighted visibility percentage there would not be too many of them.

The same idea could of course be used to calculate the effective cache hit 
ratio for each table. Cache hit ratio would have the problem of feedback loops, 
though.


Yeah, I'm not excited about making the planner and statistics more 
dynamic. Feedback loops and plan instability are not fun. I think we 
should rather be more aggressive in setting the visibility map bits.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Further news on Clang - spurious warnings

2011-08-13 Thread Greg Stark
I realize it's a bit late to jump in here with the path already having been
committed. But I think there's a point that was missed in the discussion.
One reason to do the test as Tom recommended is that the warning probably
indicates that the test as written was just going to be optimized away as
dead code. I think the cast to unsigned is the least likely idiom to be
optimized away whereas any of the formulations based on comparing the enum
with enum labels is quite likely to be.


[HACKERS] VACUUM FULL versus TOAST

2011-08-13 Thread Tom Lane
So I've gotten things fixed to the point where the regression tests seem
to not fall over when contending with concurrent vacuum full pg_class,
and now expanded the scope of the testing to all the system catalogs.
What's failing for me now is this chunk in opr_sanity:

*** 209,219 
  NOT p1.proisagg AND NOT p2.proisagg AND
  (p1.proargtypes[3]  p2.proargtypes[3])
  ORDER BY 1, 2;
!  proargtypes | proargtypes 
! -+-
! 1114 |1184
! (1 row)
! 
  SELECT DISTINCT p1.proargtypes[4], p2.proargtypes[4]
  FROM pg_proc AS p1, pg_proc AS p2
  WHERE p1.oid != p2.oid AND
--- 209,215 
  NOT p1.proisagg AND NOT p2.proisagg AND
  (p1.proargtypes[3]  p2.proargtypes[3])
  ORDER BY 1, 2;
! ERROR:  missing chunk number 0 for toast value 23902886 in pg_toast_2619
  SELECT DISTINCT p1.proargtypes[4], p2.proargtypes[4]
  FROM pg_proc AS p1, pg_proc AS p2
  WHERE p1.oid != p2.oid AND

On investigation, this turns out to occur when the planner is trying to
fetch the value of a toasted attribute in a cached pg_statistic tuple,
and a concurrent vacuum full pg_statistic has just finished.  The
problem of course is that vacuum full reassigned all the toast item OIDs
in pg_statistic, so the one we have our hands on is no longer correct.

In general, *any* access to a potentially toasted attribute value in a
catcache entry is at risk here.  I don't think it's going to be
feasible, either from a notational or efficiency standpoint, to insist
that callers always re-lock the source catalog before fetching a
catcache entry from which we might wish to extract a potentially toasted
attribute.

I am thinking that the most reasonable solution is instead to fix VACUUM
FULL/CLUSTER so that they don't change existing toast item OIDs when
vacuuming a system catalog.  They already do some pretty ugly things to
avoid changing the toast table's OID in this case, and locking down the
item OIDs too doesn't seem that much harder.  (Though I've not actually
looked at the code yet...)

The main potential drawback here is that if any varlena items that had
not previously been toasted got toasted, they would require additional
OIDs to be assigned, possibly leading to a duplicate-OID failure.  This
should not happen unless somebody decides to play with the attstorage
properties of a system catalog, and I don't feel too bad about a small
possibility of VAC FULL failing after that.  (Note it should eventually
succeed if you keep trying, since the generated OIDs would keep
changing.)

Thoughts?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] our buffer replacement strategy is kind of lame

2011-08-13 Thread Robert Haas
On Sat, Aug 13, 2011 at 4:40 PM, Greg Stark st...@mit.edu wrote:
 On Sat, Aug 13, 2011 at 8:52 PM, Robert Haas robertmh...@gmail.com wrote:
 and possibly we ought to put them all in a
 linked list so that the next guy who needs a buffer can just pop one

 The whole point of the clock sweep algorithm is to approximate an LRU
 without needing to maintain a linked list. The problem with a linked
 list is that you need to serialize access to it so every time you
 reference a buffer you need to wait on a lock for the list so you can
 move that buffer around in the list.

Right, but the same doesn't apply to what I'm talking about.  You just
put each guy into the linked list when his reference count goes down
to 0.  When you want to allocate a buffer, you pop somebody off.  If
his reference count is still 0, you forget about him and pop the next
guy, and keep going until you find a suitable victim.

The problem with just running the clock sweep every time you need a
victim buffer is that you may have to pass over a large number of
unevictable buffers before you find someone you can actually kick out.
 That's work that should really be getting done in the background, not
at the absolute last moment when we discover, hey, we need a buffer.

 It does kind of seem like your numbers indicate we're missing part of
 the picture though. The idea with the clock sweep algorithm is that
 you keep approximately 1/nth of the buffers with each of the n values.
 If we're allowing nearly all the buffers to reach a reference count of
 5 then you're right that we've lost any information about which
 buffers have been referenced most recently.

It doesn't seem right to have 1/nth of the buffers at each of n values
because that might not match the actual workload.  For example, you
might have lightning-hot buffers that fill 50% of your available
cache.  Trying to artificially force some of those down to usage count
4 or 3 doesn't actually get you anywhere.  In fact, AFAICS, there's no
direct reason to care about about how many buffers have usage count 1
vs. usage count 5.  What IS important for performance is that there
are enough with usage count 0.  Any other distinction is only useful
to the extent that it allows us to make a better decision about which
buffers we should push down to 0 next.

 I wonder if we're suppoesd to be doing something like advancing the
 clock hand each time we increment a reference count as well?

That precise thing seems far too expensive, but I agree that
something's missing.  Right now we decrease usage counts when the
clock hand moves (which is driven by buffer allocation), and we
increase them when buffers are pinned (which is driven by access to
already-resident buffers).  I feel like that's comparing apples and
oranges.  If some subset of shared_buffers is very hot relative to the
uncached pages, then everything's going to converge to 5 (or whatever
arbitrary maximum we care to set in lieu of 5).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] VACUUM FULL versus TOAST

2011-08-13 Thread Heikki Linnakangas

On 14.08.2011 01:13, Tom Lane wrote:

On investigation, this turns out to occur when the planner is trying to
fetch the value of a toasted attribute in a cached pg_statistic tuple,
and a concurrent vacuum full pg_statistic has just finished.  The
problem of course is that vacuum full reassigned all the toast item OIDs
in pg_statistic, so the one we have our hands on is no longer correct.

In general, *any* access to a potentially toasted attribute value in a
catcache entry is at risk here.  I don't think it's going to be
feasible, either from a notational or efficiency standpoint, to insist
that callers always re-lock the source catalog before fetching a
catcache entry from which we might wish to extract a potentially toasted
attribute.

I am thinking that the most reasonable solution is instead to fix VACUUM
FULL/CLUSTER so that they don't change existing toast item OIDs when
vacuuming a system catalog.  They already do some pretty ugly things to
avoid changing the toast table's OID in this case, and locking down the
item OIDs too doesn't seem that much harder.  (Though I've not actually
looked at the code yet...)


How about detoasting all datums before caching them? It's surprising 
that a datum that is supposedly in a catalog cache, actually needs disk 
access to use.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Ignore lost+found when checking if a directory is empty

2011-08-13 Thread Bruce Momjian
Tom Lane wrote:
 Alvaro Herrera alvhe...@commandprompt.com writes:
  Excerpts from Jeff Davis's message of mar ago 09 16:03:26 -0400 2011:
  I think I agree with Peter here that it's not a very good idea, and I
  don't see a big upside. With tablespaces it seems to make a little bit
  more sense, but I'd still lean away from that idea.
 
  What if the init script tries to start postmaster before the filesystems
  are mounted?  ISTM requiring a subdir is a good sanity check that the
  system is ready to run.  Not creating stuff directly on the mountpoint
  ensures consistency.
 
 I went looking in the archives for previous discussions of this idea.
 Most of them seem to focus on tablespaces rather than the primary data
 directory, but the objections to doing it are pretty much the same

FYI, the 9.0+ code will create a subdirectory under the tablespace
directory named after the catversion number, and it doesn't check that
the directory is empty, particularly so pg_upgrade can do its magic.
So, I believe lost+found would work in such a case, but again, the
security issues are real.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] vacuum scheduling (was: index-only scans)

2011-08-13 Thread Robert Haas
On Sat, Aug 13, 2011 at 5:31 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 Yeah, I'm not excited about making the planner and statistics more dynamic.
 Feedback loops and plan instability are not fun. I think we should rather be
 more aggressive in setting the visibility map bits.

As far as I can see, the only way we're really going to make any
headway setting visibility map bits more aggressively is to get
autovacuum to do it for us.  I mean, we could consider doing it when
we perform a HOT cleanup; and we could also consider having the
background writer do it when it evicts buffers.  But frankly, those
sorts of things are a drop in the bucket.  They're just not going to
occur frequently enough to really make a dent.  Maybe you could make
it work for the case where the working set fits in shared_buffers,
because then the background writer isn't doing anything anyway and so,
hey, why not set visibility map bits?  But you're not going to get
much of a performance win out of index-only scans on that workload
anyway.  If the working set does NOT fit in shared_buffers, you're
going to be evicting pages like crazy.  I can't really see us deciding
to dirty all of those pages and write and flush XLOG for them instead
of just booting them out.  Even if you have the background writer try
to do some of that, other backends that are just throwing things over
the fence to get foreground work done are going to run circles around
it, so it's going to take a long, long time to make much headway by
this route.

Assuming you believe the above analysis (which you may not, so feel
free to argue if you see something I'm missing), that leaves
autovacuum.  There are many things to like about running autovacuum
more aggressively, and even on insert-only tables.  Setting visibility
map bits isn't only useful for index-only scans: it also significantly
speeds up sequential scans that are not I/O-bound, because it enables
the visibility checks on the offending pages to be skipped.  On the
flip side, setting visibility map bits (and hint bits) has to be
considered a lower priority than vacuuming tables that contain dead
tuples, because bloat is a far, far bigger problem than a few unset
visibility map bits.  There are already situations where vacuum can't
keep up, especially with exciting values of vacuum_cost_delay, and I
don't think we can afford to stir any more work into that pot without
prioritizing it.

We already have this parameter called vacuum_cost_delay that is
intensively problematic.  If you set it too low, vacuum can consume
too much I/O bandwidth and you suffer a performance drop.  On the
other hand, if you set it too high (which is easy to do), autovacuum
can't keep up and you get bloat.  In the worst case, this sends the
system into a tailspin: as all the tables get bigger, it gets harder
and harder for vacuum to keep up.  What would be really nice is to
make this parameter self-tuning, at least to some degree.  We used to
run checkpoints at full speed, but we now have this nifty parameter
called checkpoint_completion_target that makes them run slowly, but
not so slowly that they don't complete fast enough.  It would be nice
to do the same thing for autovacuum.  If we can estimate that a table
will next need vacuuming in 37 minutes and that at the current
vacuum_cost_delay setting the currently-in-progress vacuum will finish
in 83 minutes (or 83 hours), we should realize that we have a problem
and vacuum faster.  Similarly, if we notice that tables are coming due
for vacuuming faster than autovacuum workers are processing them, we
should try to speed up.

You can imagine trying to solve this problem by teaching autovacuum to
estimate, for every table in the cluster, the time remaining until the
next vacuum needs to start; and the time at which we will be able to
start the next few vacuums based on when we expect to have enough
autovacuum workers available.  We could then detect whether or not
we're keeping up, and adjust accordingly.  This would allow us, in
turn, to run vacuum with a higher cost delay by default, and let it
decide for itself when it needs to speed up.  And if we want vacuum to
do lower-priority tasks, such as set visibility map bits, we could
teach it to do so only when it's keeping up.

This all seems like a lot of work, but at the moment I'm not seeing
another way to deliver on the suggested goal of setting visibility
map bits more aggressively.  Does anyone else?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] OperationalError: FATAL: lock AccessShareLock on object 0/1260/0 is already

2011-08-13 Thread Robert Haas
On Fri, Aug 12, 2011 at 7:19 PM, daveg da...@sonic.net wrote:
 This seems to be bug month for my client. Now there are seeing periods
 where all new connections fail immediately with the error:

   FATAL:  lock AccessShareLock on object 0/1260/0 is already held

 This happens on postgresql 8.4.7 on a large (512GB, 32 core) system that has
 been up for months. It started happening sporadicly a few days ago. It will
 do this for a period of several minutes to an hour and then go back to
 normal for hours or days.

 One complete failing session out of several hundred around that time:
 -
 2011-08-09 00:01:04.446  8823  [unknown]  [unknown]  LOG:  connection 
 received: host=op05.xxx port=34067
 2011-08-09 00:01:04.446  8823  c77  apps  LOG:  connection authorized: 
 user=apps database=c77
 2011-08-09 00:01:04.449  8823  c77  apps  FATAL:  lock AccessShareLock on 
 object 0/1260/0 is already held
 --

 What can I do to help track this down?

I've seen that error (though not that exact fact pattern) caused by
bad RAM.  It's unclear to me what else could cause it.

In terms of debugging, it seems like it might be sensible to start by
injecting some debugging code that dumps out the contents of the LOCK
and LOCALLOCK structures at the point the error occurs.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] OperationalError: FATAL: lock AccessShareLock on object 0/1260/0 is already

2011-08-13 Thread daveg
On Sun, Aug 14, 2011 at 12:16:39AM -0400, Robert Haas wrote:
 On Fri, Aug 12, 2011 at 7:19 PM, daveg da...@sonic.net wrote:
  This seems to be bug month for my client. Now there are seeing periods
  where all new connections fail immediately with the error:
 
    FATAL:  lock AccessShareLock on object 0/1260/0 is already held
 
  This happens on postgresql 8.4.7 on a large (512GB, 32 core) system that has
  been up for months. It started happening sporadicly a few days ago. It will
  do this for a period of several minutes to an hour and then go back to
  normal for hours or days.
 
  One complete failing session out of several hundred around that time:
  -
  2011-08-09 00:01:04.446  8823  [unknown]  [unknown]  LOG:  connection 
  received: host=op05.xxx port=34067
  2011-08-09 00:01:04.446  8823  c77  apps  LOG:  connection authorized: 
  user=apps database=c77
  2011-08-09 00:01:04.449  8823  c77  apps  FATAL:  lock AccessShareLock on 
  object 0/1260/0 is already held
  --
 
  What can I do to help track this down?
 
 I've seen that error (though not that exact fact pattern) caused by
 bad RAM.  It's unclear to me what else could cause it.

I'll look into that. I think it is only happening on one host, so that might
make sense. On the other hand, these are pretty fancy hosts all ECC and that
so I'd hope they would have squeaked about bad ram.
 
 In terms of debugging, it seems like it might be sensible to start by
 injecting some debugging code that dumps out the contents of the LOCK
 and LOCALLOCK structures at the point the error occurs.

Hmm, we will update to 9.0 next week on these hosts, so I'll try to hold off
on this part at least until then.

-dg
 
-- 
David Gould   da...@sonic.net  510 536 1443510 282 0869
If simplicity worked, the world would be overrun with insects.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers