date:20110812

Re: [HACKERS] WIP: Fast GiST index build

2011-08-12 Thread Heikki Linnakangas


On 11.08.2011 23:30, Alexander Korotkov wrote:

On Thu, Aug 11, 2011 at 2:28 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com  wrote:


On 10.08.2011 22:44, Alexander Korotkov wrote:


Manual and readme updates.



Thanks, I'm reviewing these now.

Do we want to expose the level-step and buffersize parameters to users?
They've been useful during testing, but I'm thinking we should be able to
guess good enough values for them automatically, and just remove the
options. It's pretty much impossible for a user to tune them correctly, it
would require deep knowledge of the buffering algorithm.

I'm thinking that even when you explicitly turn buffering on, we should
still process the first 1 or so tuples with simple inserts. That way we
always have a sample of tuples to calculate the average tuple size from.
It's plausible that if the input data is ordered, looking at the first N
tuples will give skewed sample, but I don't think there's much danger of
that in practice. Even if the data is ordered, the length of GiST tuples
shouldn't vary much.

What happens if we get the levelstep and pagesPerBuffer estimates wrong?
How sensitive is the algorithm to that? Or will we run out of memory? Would
it be feasible to adjust those in the middle of the index build, if we e.g
exceed the estimated memory usage greatly?



I see the following risks.

For levelstep:
Too small: not so great IO benefit as can be
Too large:
   1) If subtree doesn't fit effective_cache, much more IO then should be
(because of cache misses during buffer emptying)
   2) If last pages of buffers don't fit to maintenance_work_mem, possible
OOM


Hmm, we could avoid running out of memory if we used a LRU cache 
replacement policy on the buffer pages, instead of explicitly unloading 
the buffers. 1) would still apply, though.



For buffersize:
Too small: less IO benefit, becuse buffer size is relatively small in
comparison with sub-tree size.
Too large: greater CPU overhead (because of more penalty calls) then can be
with same IO benefit.

Thereby I propose following.
1) Too large levelstep is greatest risk. Let's use pessimistic estimate for
it. Pessimistic estimate has following logic:
largest sub-tree =  maximal tuples per page =  minimal tuple size
Thereby always using minimal tuple size in levelstep calculation we exclude
greatest risks.
2) Risks of buffersize are comparable and not too critical. Thats why I
propose to use size of first 1 tuples for estimate.


Yep, sounds reasonable.

I think it would also be fairly simple to decrease levelstep and/or 
adjust buffersize on-the-fly. The trick would be in figuring out the 
heuristics on when to do that.


Another thing occurred to me while looking at the buffer emptying 
process: At the moment, we stop emptying after we've flushed 1/2 buffer 
size worth of tuples. The point of that is to avoid overfilling a 
lower-level buffer, in the case that the tuples we emptied all landed on 
the same lower-level buffer. Wouldn't it be fairly simple to detect that 
case explicitly, and stop the emptying process only if one of the 
lower-level buffers really fills up? That should be more efficient, as 
you would have swap between different subtrees less often.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] our buffer replacement strategy is kind of lame

2011-08-12 Thread Simon Riggs

On Fri, Aug 12, 2011 at 5:05 AM, Robert Haas robertmh...@gmail.com wrote:

 On
 the other hand, the buffer manager has *no problem at all* trashing
 the buffer arena if we're faulting in pages for an index scan rather
 than a sequential scan.  If you manage to get all of sample_data into
 memory (by running many copies of the above query in parallel, you can
 get each one to allocate its own ring buffer, and eventually pull in
 all the pages), and then run some query that probes an index which is
 too large to fit in shared_buffers, it cheerfully blows the whole
 sample_data table out without a second thought.  Had you sequentially
 scanned a big table, of course, there would be some protection, but an
 index scan can stomp all over everything with complete impunity.

That's a good observation and I think we should do this

* Make an IndexScan use a ring buffer once it has used 32 blocks. The
vast majority won't do that, so we avoid overhead on the common path.

* Make an BitmapIndexScan use a ring buffer when we know that the
index is larger than 32 blocks. (Ignore upper parts of tree for that
calc).


-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] our buffer replacement strategy is kind of lame

2011-08-12 Thread Simon Riggs

On Fri, Aug 12, 2011 at 5:05 AM, Robert Haas robertmh...@gmail.com wrote:

 rhaas=# select usagecount, sum(1) from pg_buffercache group by 1 order by 1;
 usagecount |  sum
 +---
          1 |   136
          2 |    12
          3 |    72
          4 |     7
          5 | 13755
            | 37218
 (6 rows)

 Only 96 of the 14286 buffers in sample_data are in shared_buffers,
 despite the fact that we have 37,218 *completely unused* buffers lying
 around.  That sucks, because it means that the sample query did a
 whole lot of buffer eviction that was completely needless.  The ring
 buffer is there to prevent trashing the shared buffer arena, but here
 it would have been fine to trash the arena: there wasn't anything
 there we would have minded losing (or, indeed, anything at all).

You're missing an important point. The SeqScan is measurably faster
when using the ring buffer because of the effects of L2 cacheing on
the buffers.

Also, your logic is a little off, since you're doing the scan on an
otherwise quiet machine soon after startup and there are some
completely unused buffers. That isn't the typical state of the buffer
cache and so spoiling the cache is not acceptable in the general case.
The above case doesn't suck because it carefully reserved parts of
shared buffers for other users when spoiling the cache would be of no
benefit to the user, so it worked great from my perspective.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] plpython crash

2011-08-12 Thread Jan Urbański

On 11/08/11 18:01, Jean-Baptiste Quenot wrote:
 Hi there,
 
 plpython crashes on me on various 64-bit Ubuntu hosts, see the gdb
 backtrace at: https://gist.github.com/1140005
 
 Do you believe there was recent bugfixes regarding PLyMapping_ToTuple() ?
 
 This is PG 9.0.4 with HEAD of plpython taken in march 2011 and backported.
 
 Please tell me if you need more information.

Hi,

there were no changes to that area of plpython after March 2011.

Could you try to see if the error also appears if you run your app with
current PostgreSQL HEAD (both the server and plpython)?

Which Python version is that? You can get that info by running:

do $$ import sys; plpy.info(sys.version) $$ language plpythonu;

Could you try to extract a self-contained example of how to reproduce
it? If the bug only appears under your application's specific workload,
perhaps you could try running it with Postgres compiled with -O0,
because compiling with -O2 causes the gdb backtrace to be missing
optimised out values and inlined functions?

Cheers,
Jan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] index-only scans

2011-08-12 Thread Cédric Villemain

2011/8/12 Robert Haas robertmh...@gmail.com:
 On Thu, Aug 11, 2011 at 5:39 PM, Cédric Villemain
 cedric.villemain.deb...@gmail.com wrote:
 2011/8/11 Robert Haas robertmh...@gmail.com:
 Please find attached a patch implementing a basic version of
 index-only scans.  This patch is the work of my colleague Ibrar Ahmed
 and myself, and also incorporates some code from previous patches
 posted by Heikki Linnakanagas.

 Great!

 Glad you like it...!

 Can this faux heap tuple be appended by the data from another index
 once it has been created ? ( if the query involves those 2 index)

 I don't see how to make that work.  In general, a query like SELECT
 a, b FROM foo WHERE a = 1 AND b = 1 can only use both indexes if we
 use a bitmap index scan on each followed by a bitmapand and then a
 bitmap heap scan.  However, this patch only touches the index-scan
 path, which only knows how to use one index for any given query.

I thought of something like that:  'select a,b from foo where a=1
order by b limit 100' (or: where a=1 and b now() )


 Actually, I can see a possible way to allow an index-only type
 optimization to be used for bitmap scans.  As you scan the index, any
 tuples that can be handled index-only get returned immediately; the
 rest are thrown into a bitmap.  Once you're done examining the index,
 you then do a bitmap heap scan to get the tuples that couldn't be
 handled index-only.  This seems like it might be our best hope for a
 fast count(*) type optimization, especially if you could combine it
 with some method of scanning the index in physical order rather than
 logical order.

IIRC we expose some ideas around that, yes. (optimizing bitmap)

Maybe a question that will explain me more about the feature
limitation (if any):
Does an index-only scan used when the table has no vismap set will
cost (in duration, IO, ...) more than a normal Index scan ?


 But even that trick would only work with a single index.  I'm not sure
 there's any good way to assemble pieces of tuples from two different
 indexes, at least not efficiently.

okay.


 --
 Robert Haas
 EnterpriseDB: http://www.enterprisedb.com
 The Enterprise PostgreSQL Company




-- 
Cédric Villemain +33 (0)6 20 30 22 52
http://2ndQuadrant.fr/
PostgreSQL: Support 24x7 - Développement, Expertise et Formation

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] our buffer replacement strategy is kind of lame

2011-08-12 Thread Simon Riggs

On Fri, Aug 12, 2011 at 5:05 AM, Robert Haas robertmh...@gmail.com wrote:

 The general problem here is that we are not very smart about handling
 workloads with weak locality - i.e. the working set is larger than
 shared buffers.  If the working set fits in shared_buffers, we will
 keep it there, and it will be strongly protected against eviction.
 But if the working set *doesn't* fit in shared buffers, then we fall
 back on some not-very-smart heuristics to decide what to do: keep
 pages involved in index scans, ditch those involved in sequential
 scans.

 It seems to me that we need a smarter algorithm.  It seems that NetBSD
 has adopted the use of Clock-Pro, and there are some discussions of it
 being used in Linux as well, though I'm not sure whether that's
 actually (still?) the case.  Paper is here, although I find the
 description of the algorithm to be approximately clear as mud:

 http://www.cse.ohio-state.edu/~fchen/paper/papers/usenix05.pdf

 One of the important notions which many of the algorithms in the
 literature seem to hit on in one way or another is that you mustn't go
 and decide that all of your buffers - or nearly all your buffers - are
 hot.  That's really equivalent to saying that none of your buffers are
 hot; and it's also pretty easy to see that such a scenario would be
 disastrous for our current algorithm, which - if everything in
 shared_buffers has usage-count 5 all the time - will have to clock
 sweep a long way each time it needs to allocate a buffer.  In fact,
 the more of your buffers want to be hot (and thus hard to evict), the
 fewer of them should actually be allowed to acquire that status.

This is something I've been actively working on. I was on the trail of
possible reasons that increasing the size of shared buffers would slow
down PostgreSQL.

The worst case behaviour of the current freelist code is that it can
take up to 5 * shared_buffers checks before identifying a victim
buffer. That occurs when we have a working set exactly matching size
of shared buffers. There are problems when large portions of shared
buffers are very popular, so that the free-ish buffers are a small
proportion of the total buffers - this case gets worse if the
distribution of buffer allocations is non-random so you get say 10
free-ish blocks together, followed by N-10 very popular ones. That
makes every 11th freelist request take ~O(shared_buffers). These
problems will show themselves as sporadic BufFreelistLock contention.
Sporadic is the problem here since it may make the contention hard to
measure in aggregate, yet with real impact on performance.

For us to benefit from increased shared_buffers we need to have an
algorithm that is O(k) in its worst case behaviour and O(1) in its
best case behaviour - the latter aspect is provided by a correctly
functioning bgwriter. At the moment, we do nothing to enforce O(k)
behaviour.

The following patch implements O(k) behaviour using a heuristic limit.
My first attempt was a fixed heuristic limit, but that was less
suitable because there are actually 2 requirements. The value of k
must be small to have a measurable impact on the smoothness of
StrategyGetBuffer(), but k must also be fairly large to ensure the
choice of victim is a relatively good one. So the way to balance those
conflicting objectives is to have a progressively increasing search
window. An just to repeat, k is not a % of shared buffers, which would
still give O(N) behaviour.

I've not been reading the literature, given the problems we had in
2004/5 regarding patents in this area. I also think that since we rely
on the underlying filesystem for cacheing that we don't have exactly
the same problem as other systems.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


freelist_ok.v2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Fast GiST index build

2011-08-12 Thread Alexander Korotkov

On Fri, Aug 12, 2011 at 12:23 PM, Heikki Linnakangas 
heikki.linnakan...@enterprisedb.com wrote:

 I think it would also be fairly simple to decrease levelstep and/or adjust
 buffersize on-the-fly. The trick would be in figuring out the heuristics on
 when to do that.

I would be simple to decrease levelstep to the it's divider. It seems quite
hard to dicrease it, for example, from 3 to 2. Also, it's pretty hard to
detect that sub-tree actually doen't fit to the cache. I don't see much
difficulties in buffersize runtime tuning.


 Another thing occurred to me while looking at the buffer emptying process:
 At the moment, we stop emptying after we've flushed 1/2 buffer size worth of
 tuples. The point of that is to avoid overfilling a lower-level buffer, in
 the case that the tuples we emptied all landed on the same lower-level
 buffer. Wouldn't it be fairly simple to detect that case explicitly, and
 stop the emptying process only if one of the lower-level buffers really
 fills up? That should be more efficient, as you would have swap between
 different subtrees less often.

 Yes, it seems reasonable to me.

--
With best regards,
Alexander Korotkov.

[HACKERS] bgwriter and checkpoints

2011-08-12 Thread Simon Riggs

The bgwriter has been responsible for two main activities: incremental
page cleaning and checkpointing.

We've worked out the code to smooth checkpointing, but that now means
we have periods where we do both page cleaning and checkpointing, and
other times when we do just page cleaning. That means the processing
loops are complex and that makes latch based processing more difficult
to introduce.

Moreover, when we perform fsyncs at the end of checkpoint we stop
doing pagecleaning.

I propose to introduce a new process dedicated to checkpointing,
leaving the bgwriter process to perform pagecleaning, if that is
requested. Note that on smaller systems the bgwriter process would not
even be required to exist, if page cleaning was not used.

This will simplify the code, so that checkpoint happens fairly cleanly
- and can then be extended in other ways. The checkpointer would be
the important process from a shutdown perspective.

It will also simplify the bgwriter, potentially allowing us to
consider that it performs other tasks, such as HOT vacuuming or hint
bit setting. Nothing added in first version, but this clears the way
to allow these things to be considered and prototyped.

So the likelihood is this change will improve performance of itself in
large systems, but the main benefit is code simplicity which is
required for a range of potential futures.

Are there objections to this work beginning?

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] bgwriter and checkpoints

2011-08-12 Thread Cédric Villemain

2011/8/12 Simon Riggs si...@2ndquadrant.com:
 The bgwriter has been responsible for two main activities: incremental
 page cleaning and checkpointing.

 We've worked out the code to smooth checkpointing, but that now means
 we have periods where we do both page cleaning and checkpointing, and
 other times when we do just page cleaning. That means the processing
 loops are complex and that makes latch based processing more difficult
 to introduce.

 Moreover, when we perform fsyncs at the end of checkpoint we stop
 doing pagecleaning.

 I propose to introduce a new process dedicated to checkpointing,
 leaving the bgwriter process to perform pagecleaning, if that is
 requested. Note that on smaller systems the bgwriter process would not
 even be required to exist, if page cleaning was not used.

 This will simplify the code, so that checkpoint happens fairly cleanly
 - and can then be extended in other ways. The checkpointer would be
 the important process from a shutdown perspective.

 It will also simplify the bgwriter, potentially allowing us to
 consider that it performs other tasks, such as HOT vacuuming or hint
 bit setting. Nothing added in first version, but this clears the way
 to allow these things to be considered and prototyped.

 So the likelihood is this change will improve performance of itself in
 large systems, but the main benefit is code simplicity which is
 required for a range of potential futures.

 Are there objections to this work beginning?

No objections. (+100 for doing that)

-- 
Cédric Villemain +33 (0)6 20 30 22 52
http://2ndQuadrant.fr/
PostgreSQL: Support 24x7 - Développement, Expertise et Formation

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] plpython crash

2011-08-12 Thread Jean-Baptiste Quenot

Here is the same with -O0:

https://gist.github.com/1140005

sys.version reports this:

INFO:  2.6.6 (r266:84292, Sep 15 2010, 16:41:53)
[GCC 4.4.5]
-- 
Jean-Baptiste Quenot

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] our buffer replacement strategy is kind of lame

2011-08-12 Thread Simon Riggs

On Fri, Aug 12, 2011 at 11:53 AM, Simon Riggs si...@2ndquadrant.com wrote:

 I've not been reading the literature, given the problems we had in
 2004/5 regarding patents in this area. I also think that since we rely
 on the underlying filesystem for cacheing that we don't have exactly
 the same problem as other systems.

Reading the link you gave, I'm unimpressed by ClockPro. The
measurements made are all in low numbers of MB and the results show
that Clock is roughly same or better for large caches, but one might
read them multiple ways.

The zone of interest for us is above 1GB and the issues we care about
are contention, so even if we think ClockPro has a chance of improving
things we really don't have any measurements that support the cases we
care about.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] sha1, sha2 functions into core?

2011-08-12 Thread Marko Kreen

On Thu, Aug 11, 2011 at 5:46 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Marko Kreen mark...@gmail.com writes:
 On Wed, Aug 10, 2011 at 9:19 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 ... which this approach would create, because digest() isn't restricted
 to just those algorithms.  I think it'd be better to just invent two
 new functions, which also avoids issues for applications that currently
 expect the digest functions to be installed in pgcrypto's schema.

 I would suggest digest() with fixed list of algorithms: md5, sha1, sha2.

 The uncommon/obsolete algorithms that can be used
 from digest() if compiled with openssl, are not something we
 need to worry over.  In fact we have never supported them,
 as no testing has been done.

 Hmm ... they may be untested by us, but I feel sure that if we remove
 that functionality from pgcrypto, *somebody* is gonna complain.

Well, if you are worried about that, you can duplicate current
pgcrypto behaviour - if postgres is compiled against openssl,
you get whatever algorithms are available in that particular
version of openssl.

My point was that giving such open-ended list of algorithms
was bad idea, but there is no problem keeping old behaviour.

 I don't see anything much wrong with sha1(bytea/text) - bytea.
 There's no law that says it has to work exactly like md5() does.

The problem is that list of must-have algorithms is getting
quite long: md5, sha1, sha224, sha256, sha384, sha512,
+ at least 4 from upcoming sha3.

Another problem is that generic hashes are bad way for storing passwords
- identical passwords will look identical, and its easy to brute-force
passwords as the algorithms are very fast.

So the question is: is there actual *good* reason add each algorithm separately,
without uniform API to core functions?

If the user requests are about storing passwords, and we want to make
that easier, then we should import crypt() also, as that is the secure
way for password storage.  Then the md5(), md5_b() plus bunch of
sha-s will look silly.

-- 
marko

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] our buffer replacement strategy is kind of lame

2011-08-12 Thread Robert Haas

On Fri, Aug 12, 2011 at 4:33 AM, Simon Riggs si...@2ndquadrant.com wrote:
 You're missing an important point. The SeqScan is measurably faster
 when using the ring buffer because of the effects of L2 cacheing on
 the buffers.

I hadn't thought of that, but I think that's only true if the relation
won't fit in shared_buffers (or whatever portion of shared_buffers is
reasonably available, given the other activity on the machine).  In
this particular case, it's almost 20% faster if the relation is all in
shared_buffers; I tested it.  I think what's going on here is that
initscan() has a heuristic that tries to use a BufferAccessStrategy if
the relation is larger than 1/4 of shared_buffers.  That's probably a
pretty good heuristic in general, but in this case I have a relation
which just so happens to be 27.9% of shared_buffers but will still
fit.  As you say below, that may not be typical in real life, although
there are probably data warehousing systems where it's normal to have
only one big query running at a time.

 Also, your logic is a little off, since you're doing the scan on an
 otherwise quiet machine soon after startup and there are some
 completely unused buffers. That isn't the typical state of the buffer
 cache and so spoiling the cache is not acceptable in the general case.
 The above case doesn't suck because it carefully reserved parts of
 shared buffers for other users when spoiling the cache would be of no
 benefit to the user, so it worked great from my perspective.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] our buffer replacement strategy is kind of lame

2011-08-12 Thread Robert Haas

On Fri, Aug 12, 2011 at 4:36 AM, Simon Riggs si...@2ndquadrant.com wrote:
 On Fri, Aug 12, 2011 at 5:05 AM, Robert Haas robertmh...@gmail.com wrote:

 On
 the other hand, the buffer manager has *no problem at all* trashing
 the buffer arena if we're faulting in pages for an index scan rather
 than a sequential scan.  If you manage to get all of sample_data into
 memory (by running many copies of the above query in parallel, you can
 get each one to allocate its own ring buffer, and eventually pull in
 all the pages), and then run some query that probes an index which is
 too large to fit in shared_buffers, it cheerfully blows the whole
 sample_data table out without a second thought.  Had you sequentially
 scanned a big table, of course, there would be some protection, but an
 index scan can stomp all over everything with complete impunity.

 That's a good observation and I think we should do this

 * Make an IndexScan use a ring buffer once it has used 32 blocks. The
 vast majority won't do that, so we avoid overhead on the common path.

 * Make an BitmapIndexScan use a ring buffer when we know that the
 index is larger than 32 blocks. (Ignore upper parts of tree for that
 calc).

We'd need to think about what happens to the internal pages of the
btree, the leaf pages, and then the heap pages from the underlying
relation; those probably shouldn't all be treated the same.  Also, I
think the tricky part is figuring out when to apply the optimization
in the first place.  Once we decide we need a ring buffer, a very
small one (like 32 blocks) is probably the way to go.  But it will be
a loser to apply the optimization to data sets that would otherwise
have fit in shared_buffers.  This is a classic case of the LRU/MRU
problem.  You want to evict buffers in LRU fashion until the working
set gets larger than you can cache; and then you want to switch to MRU
to avoid uselessly caching pages that you'll never manage to revisit
before they're evicted.

The point of algorithms like Clock-Pro is to try to have the system
work that out on the fly, based on the actual workload, rather than
using heuristics.  I agree with you there's no convincing evidence
that Clock-Pro would be better for us; I mostly thought it was
interesting because it seems that the NetBSD and Linux guys find it
interesting, and they're managing much larger caches than the ones
we're dealing with.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] our buffer replacement strategy is kind of lame

2011-08-12 Thread Simon Riggs

On Fri, Aug 12, 2011 at 1:14 PM, Robert Haas robertmh...@gmail.com wrote:
 On Fri, Aug 12, 2011 at 4:33 AM, Simon Riggs si...@2ndquadrant.com wrote:
 You're missing an important point. The SeqScan is measurably faster
 when using the ring buffer because of the effects of L2 cacheing on
 the buffers.

 I hadn't thought of that, but I think that's only true if the relation
 won't fit in shared_buffers (or whatever portion of shared_buffers is
 reasonably available, given the other activity on the machine).  In
 this particular case, it's almost 20% faster if the relation is all in
 shared_buffers; I tested it.  I think what's going on here is that
 initscan() has a heuristic that tries to use a BufferAccessStrategy if
 the relation is larger than 1/4 of shared_buffers.  That's probably a
 pretty good heuristic in general, but in this case I have a relation
 which just so happens to be 27.9% of shared_buffers but will still
 fit.  As you say below, that may not be typical in real life, although
 there are probably data warehousing systems where it's normal to have
 only one big query running at a time.

I think there are reasonable arguments to make

* prefer_cache = off (default) | on a table level storage parameter,
=on will disable the use of BufferAccessStrategy

* make cache_spoil_threshold a parameter, with default 0.25

Considering the world of very large RAMs in which we now live, some
control of the above makes sense.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] our buffer replacement strategy is kind of lame

2011-08-12 Thread Robert Haas

On Fri, Aug 12, 2011 at 6:53 AM, Simon Riggs si...@2ndquadrant.com wrote:
 The worst case behaviour of the current freelist code is that it can
 take up to 5 * shared_buffers checks before identifying a victim
 buffer. That occurs when we have a working set exactly matching size
 of shared buffers. There are problems when large portions of shared
 buffers are very popular, so that the free-ish buffers are a small
 proportion of the total buffers - this case gets worse if the
 distribution of buffer allocations is non-random so you get say 10
 free-ish blocks together, followed by N-10 very popular ones. That
 makes every 11th freelist request take ~O(shared_buffers). These
 problems will show themselves as sporadic BufFreelistLock contention.
 Sporadic is the problem here since it may make the contention hard to
 measure in aggregate, yet with real impact on performance.

I've been thinking about this exact same set of problems.

 For us to benefit from increased shared_buffers we need to have an
 algorithm that is O(k) in its worst case behaviour and O(1) in its
 best case behaviour - the latter aspect is provided by a correctly
 functioning bgwriter. At the moment, we do nothing to enforce O(k)
 behaviour.

Check.

 The following patch implements O(k) behaviour using a heuristic limit.
 My first attempt was a fixed heuristic limit, but that was less
 suitable because there are actually 2 requirements. The value of k
 must be small to have a measurable impact on the smoothness of
 StrategyGetBuffer(), but k must also be fairly large to ensure the
 choice of victim is a relatively good one. So the way to balance those
 conflicting objectives is to have a progressively increasing search
 window. An just to repeat, k is not a % of shared buffers, which would
 still give O(N) behaviour.

This is a very interesting idea, and I think it has a lot of potential.

One additional thought I had is that right now I think it's far too
easy for buffers to get pushed all the way up to usage count 5,
because we bump the reference count every time the buffer is pinned,
which can easily happen many times per clock sweep.  I think we would
be better off having a used flag on each buffer.  Each pin sets the
used bit, but does nothing if it is already set.  The usage count is
only changed by the clock sweep, which does the following on each
buffer:

- If the used flag is set, then the flag is cleared and the usage
count increases by one to a maximum of 5.
- If the used flag is not set, then the usage count decreases by one.

That way, buffers can't get hot because of a fast flurry of access
followed by nothing - to get up to usage_count 5, they've actually got
to stay interesting for a period of time.  As a side effect, I believe
we could get rid of the spinlock that we currently take and release
for every buffer the clock sweep hits, because we could regard the
usage_count as protected by the BufFreelistLock rather than by the
buffer header spinlock; and the used flag doesn't require perfect
synchronization.  We'd only pin the buffer we actually plan to evict
(and could recheck the used flag before doing so, in case a memory
ordering effect made us miss the fact that it had been recently set).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] our buffer replacement strategy is kind of lame

2011-08-12 Thread Simon Riggs

On Fri, Aug 12, 2011 at 1:26 PM, Robert Haas robertmh...@gmail.com wrote:

  But it will be
 a loser to apply the optimization to data sets that would otherwise
 have fit in shared_buffers.

Spoiling the cache is a bad plan, even if it makes the current query faster.

I think we should make the optimisation stronger still and use
FADV_DONT_NEED on blocks that fall from the ring buffer. Spoiling the
OS cache is a problem as well.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] our buffer replacement strategy is kind of lame

2011-08-12 Thread Robert Haas

On Fri, Aug 12, 2011 at 8:28 AM, Simon Riggs si...@2ndquadrant.com wrote:
 On Fri, Aug 12, 2011 at 1:14 PM, Robert Haas robertmh...@gmail.com wrote:
 On Fri, Aug 12, 2011 at 4:33 AM, Simon Riggs si...@2ndquadrant.com wrote:
 You're missing an important point. The SeqScan is measurably faster
 when using the ring buffer because of the effects of L2 cacheing on
 the buffers.

 I hadn't thought of that, but I think that's only true if the relation
 won't fit in shared_buffers (or whatever portion of shared_buffers is
 reasonably available, given the other activity on the machine).  In
 this particular case, it's almost 20% faster if the relation is all in
 shared_buffers; I tested it.  I think what's going on here is that
 initscan() has a heuristic that tries to use a BufferAccessStrategy if
 the relation is larger than 1/4 of shared_buffers.  That's probably a
 pretty good heuristic in general, but in this case I have a relation
 which just so happens to be 27.9% of shared_buffers but will still
 fit.  As you say below, that may not be typical in real life, although
 there are probably data warehousing systems where it's normal to have
 only one big query running at a time.

 I think there are reasonable arguments to make

 * prefer_cache = off (default) | on a table level storage parameter,
 =on will disable the use of BufferAccessStrategy

 * make cache_spoil_threshold a parameter, with default 0.25

Yeah, something like that might make sense.  Of course, a completely
self-tuning system would be better, but I'm not sure we're going to
find one of those.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] our buffer replacement strategy is kind of lame

2011-08-12 Thread Robert Haas

On Fri, Aug 12, 2011 at 8:35 AM, Simon Riggs si...@2ndquadrant.com wrote:
 On Fri, Aug 12, 2011 at 1:26 PM, Robert Haas robertmh...@gmail.com wrote:
  But it will be
 a loser to apply the optimization to data sets that would otherwise
 have fit in shared_buffers.

 Spoiling the cache is a bad plan, even if it makes the current query faster.

 I think we should make the optimisation stronger still and use
 FADV_DONT_NEED on blocks that fall from the ring buffer. Spoiling the
 OS cache is a problem as well.

That would probably be better for really big tables, but it will be
worse for those where the entire table fits in the OS cache.

Caching spoiling means you're putting things into the cache which
won't still be there the next time they're needed.  If the data in
question fits in cache (and I don't think we can regard that as
uncommon, particularly for the OS cache, which can be half a terabyte)
then you don't want the anti-spoiling logic to kick in.

One thing we could consider for large sequential scans is doing
POSIX_FADV_SEQUENTIAL before beginning the scan (and maybe one more
time if the scan wraps).  That's basically throwing the problem of
whether or not to go LRU or MRU back on the OS, but the OS may well
have a better idea whether there's enough cache available to hold the
whole than we do.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] index-only scans

2011-08-12 Thread Oleg Bartunov


Robert,

I imagine we store positional information in gin index and return
tuples in relevant order - instant full-text search ! 
Great work, guys !


Oleg

On Thu, 11 Aug 2011, Robert Haas wrote:


Please find attached a patch implementing a basic version of
index-only scans.  This patch is the work of my colleague Ibrar Ahmed
and myself, and also incorporates some code from previous patches
posted by Heikki Linnakanagas.

I'm able to demonstrate a significant performance benefit from this
patch, even in a situation where it doesn't save any disk I/O, due to
reduced thrashing of shared_buffers.  Configuration settings:

max_connections = 100
shared_buffers = 400MB
maintenance_work_mem = 256MB
synchronous_commit = off
checkpoint_segments = 100
checkpoint_timeout = 30min
checkpoint_completion_target = 0.9
checkpoint_warning = 90s
seq_page_cost = 1.0
random_page_cost = 1.0
effective_cache_size = 3GB

Test setup:

pgbench -i -s 50
create table sample_data as select (random()*500)::int as aid,
repeat('a', 1000) as filler from generate_series(1,10);

Test queries:

select sum(aid) from sample_data a1 where exists (select * from
pgbench_accounts a where a.aid = a1.aid and a.aid != 1234567);
select sum(aid) from sample_data a1 where exists (select * from
pgbench_accounts a where a.aid = a1.aid and a.bid != 1234567);

On my laptop, the first query executes in about 555 ms, while the
second one takes about 1125 ms.  Inspection via pg_buffercache reveals
that the second one thrashes shared_buffers something fierce, while
the first one does not.  You can actually get the time for the first
query down to about 450 ms if you can persuade PostgreSQL to cache the
entire sample_data table - which is difficult, due the
BufferAccessStrategy stuff - and as soon as you run the second query,
it blows the table out of cache, so practically speaking you're not
going to get that faster time very often.  I expect that you could get
an even larger benefit from this type of query if you could avoid
actual disk I/O, rather than just buffer cache thrashing, but I
haven't come up with a suitable test cases for that yet (volunteers?).

There are several things about this patch that probably need further
thought and work, or at least some discussion.

1. The way that nodeIndexscan.c builds up the faux heap tuple is
perhaps susceptible to improvement.  I thought about building a
virtual tuple, but then what do I do with an OID column, if I have
one?  Or maybe this should be done some other way altogether.

2. Suppose we scan one tuple on a not-all-visible page followed by 99
tuples on all-visible pages.  The code as written will hold the pin on
the first heap page for the entire scan.  As soon as we hit the end of
the scan or another tuple where we have to actually visit the page,
the old pin will be released, but until then we hold onto it.  This
isn't totally stupid: the next tuple that's on a not-all-visible page
could easily be on the same not-all-visible page we already have
pinned.  And in 99% cases holding the pin for slightly longer is
probably completely harmless.  On the flip side, it could increase the
chances of interfering with VACUUM.  Then again, a non-index-only scan
would still interfere with the same VACUUM, so maybe we don't care.

3. The code in create_index_path() builds up a bitmapset of heap
attributes that get used for any purpose anywhere in the query, and
hangs it on the RelOptInfo so it doesn't need to be rebuilt for every
index under consideration.  However, if it were somehow possible to
have the rel involved without using any attributes at all, we'd
rebuild the cache over and over, since it would never become non-NULL.
I don't think that can happen right now, but future planner changes
might make it possible.

4. There are a couple of cases that use index-only scans even though
the EXPLAIN output sort of makes it look like they shouldn't.  For
example, in the above queries, an index-only scan is chosen even
though the query does SELECT * from the table being scanned.  Even
though the EXPLAIN (VERBOSE) output makes it look otherwise, it seems
that the target list of an EXISTS query is in fact discarded, e.g.:

create or replace function error() returns int as $$begin select 1/0;
end$$ language plpgsql;
select * from pgbench_accounts a where exists (select error() from
pgbench_branches b where b.bid = a.aid); -- doesn't error out!

Along similar lines, COUNT(*) does not preclude an index-only scan,
because the * is apparently just window dressing.  You'll still get
just a seq-scan unless you have an indexable qual in the query
somewhere, because...

5. We haven't made any planner changes at all, not even for cost
estimation.  It is not clear to me what the right way to do cost
estimation here is.  It seems that it would be useful to know what
percentage of the table is all-visible at planning time, but even if
we did, there could (and probably often will) be correlation between
the pages the query is

Re: [HACKERS] index-only scans

2011-08-12 Thread Robert Haas

On Fri, Aug 12, 2011 at 6:20 AM, Cédric Villemain
cedric.villemain.deb...@gmail.com wrote:
 Can this faux heap tuple be appended by the data from another index
 once it has been created ? ( if the query involves those 2 index)

 I don't see how to make that work.  In general, a query like SELECT
 a, b FROM foo WHERE a = 1 AND b = 1 can only use both indexes if we
 use a bitmap index scan on each followed by a bitmapand and then a
 bitmap heap scan.  However, this patch only touches the index-scan
 path, which only knows how to use one index for any given query.

 I thought of something like that:  'select a,b from foo where a=1
 order by b limit 100' (or: where a=1 and b now() )

Well... PostgreSQL can only use the index on a or the index on b, not
both.  This patch doesn't change that.  I'm not trying to use indexes
in some completely new way; I'm just trying to make them faster by
optimizing away the heap access.

 Actually, I can see a possible way to allow an index-only type
 optimization to be used for bitmap scans.  As you scan the index, any
 tuples that can be handled index-only get returned immediately; the
 rest are thrown into a bitmap.  Once you're done examining the index,
 you then do a bitmap heap scan to get the tuples that couldn't be
 handled index-only.  This seems like it might be our best hope for a
 fast count(*) type optimization, especially if you could combine it
 with some method of scanning the index in physical order rather than
 logical order.

 IIRC we expose some ideas around that, yes. (optimizing bitmap)

 Maybe a question that will explain me more about the feature
 limitation (if any):
 Does an index-only scan used when the table has no vismap set will
 cost (in duration, IO, ...) more than a normal Index scan ?

Yeah, it'll do a bit of extra work - the btree AM will cough up the
tuple uselessly, and we'll check the visibility map, also uselessly.
Then we'll end up doing it the regular way anyhow.  I haven't measured
that effect yet; hopefully it's fairly small.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] bgwriter and checkpoints

2011-08-12 Thread Robert Haas

On Fri, Aug 12, 2011 at 7:09 AM, Simon Riggs si...@2ndquadrant.com wrote:
 The bgwriter has been responsible for two main activities: incremental
 page cleaning and checkpointing.

 We've worked out the code to smooth checkpointing, but that now means
 we have periods where we do both page cleaning and checkpointing, and
 other times when we do just page cleaning. That means the processing
 loops are complex and that makes latch based processing more difficult
 to introduce.

I've thought about this before, and I think it's probably a good idea.
 At the very least that code needs some refactoring, and breaking it
into two completely separate processes may be the way to go.

One possible trouble spot is that this will have the effect of
increasing traffic on BgWriterCommLock.  Right now, page-cleaning
activity doesn't need to do that, because the bgwriter is both the
source of all the fsync requests and the guy who eventually executes
them, so it can just update its local state directly.  If the two
functions are separated, that gets a bit more complicated: the
checkpoint process, even when not checkpointing, will need to absorb
fsync requests every so often (maybe that could be driven off a latch,
so we wake it up when the request queue is 25% full, or something like
that?).  Or else the cleaning process will have to be in charge of
issuing the fsyncs and the checkpoint process will have to request
that it do so at the appropriate time.  Or maybe there's some third
solution I'm not thinking of.  Anyway, it may not be a big problem in
practice, just something I was worrying about...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] index-only scans

2011-08-12 Thread Cédric Villemain

2011/8/12 Robert Haas robertmh...@gmail.com:
 On Fri, Aug 12, 2011 at 6:20 AM, Cédric Villemain
 cedric.villemain.deb...@gmail.com wrote:
 Can this faux heap tuple be appended by the data from another index
 once it has been created ? ( if the query involves those 2 index)

 I don't see how to make that work.  In general, a query like SELECT
 a, b FROM foo WHERE a = 1 AND b = 1 can only use both indexes if we
 use a bitmap index scan on each followed by a bitmapand and then a
 bitmap heap scan.  However, this patch only touches the index-scan
 path, which only knows how to use one index for any given query.

 I thought of something like that:  'select a,b from foo where a=1
 order by b limit 100' (or: where a=1 and b now() )

 Well... PostgreSQL can only use the index on a or the index on b, not
 both.  This patch doesn't change that.  I'm not trying to use indexes
 in some completely new way; I'm just trying to make them faster by
 optimizing away the heap access.

For this kind of plan :
Bitmap Heap Scan
  Recheck Cond
   BitmapAnd
  Bitmap Index Scan
  Bitmap Index Scan

It may prevent useless Heap Fetch during Bitmap Heap Scan, isn't it ?


 Actually, I can see a possible way to allow an index-only type
 optimization to be used for bitmap scans.  As you scan the index, any
 tuples that can be handled index-only get returned immediately; the
 rest are thrown into a bitmap.  Once you're done examining the index,
 you then do a bitmap heap scan to get the tuples that couldn't be
 handled index-only.  This seems like it might be our best hope for a
 fast count(*) type optimization, especially if you could combine it
 with some method of scanning the index in physical order rather than
 logical order.

 IIRC we expose some ideas around that, yes. (optimizing bitmap)

 Maybe a question that will explain me more about the feature
 limitation (if any):
 Does an index-only scan used when the table has no vismap set will
 cost (in duration, IO, ...) more than a normal Index scan ?

 Yeah, it'll do a bit of extra work - the btree AM will cough up the
 tuple uselessly, and we'll check the visibility map, also uselessly.
 Then we'll end up doing it the regular way anyhow.  I haven't measured
 that effect yet; hopefully it's fairly small.

If it is small, or if we can reduce it to be near absent.
Then... why do we need to distinguish Index Scan at all ? (or
Index*-Only* scan, which aren't 100% 'Only' btw)
It is then just an internal optimisation on how we can access/use an
index. (really good and promising one)

-- 
Cédric Villemain +33 (0)6 20 30 22 52
http://2ndQuadrant.fr/
PostgreSQL: Support 24x7 - Développement, Expertise et Formation

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] bgwriter and checkpoints

2011-08-12 Thread Simon Riggs

On Fri, Aug 12, 2011 at 2:19 PM, Robert Haas robertmh...@gmail.com wrote:
 On Fri, Aug 12, 2011 at 7:09 AM, Simon Riggs si...@2ndquadrant.com wrote:
 The bgwriter has been responsible for two main activities: incremental
 page cleaning and checkpointing.

 We've worked out the code to smooth checkpointing, but that now means
 we have periods where we do both page cleaning and checkpointing, and
 other times when we do just page cleaning. That means the processing
 loops are complex and that makes latch based processing more difficult
 to introduce.

 I've thought about this before, and I think it's probably a good idea.
  At the very least that code needs some refactoring, and breaking it
 into two completely separate processes may be the way to go.

 One possible trouble spot is that this will have the effect of
 increasing traffic on BgWriterCommLock.  Right now, page-cleaning
 activity doesn't need to do that, because the bgwriter is both the
 source of all the fsync requests and the guy who eventually executes
 them, so it can just update its local state directly.  If the two
 functions are separated, that gets a bit more complicated: the
 checkpoint process, even when not checkpointing, will need to absorb
 fsync requests every so often (maybe that could be driven off a latch,
 so we wake it up when the request queue is 25% full, or something like
 that?).  Or else the cleaning process will have to be in charge of
 issuing the fsyncs and the checkpoint process will have to request
 that it do so at the appropriate time.  Or maybe there's some third
 solution I'm not thinking of.  Anyway, it may not be a big problem in
 practice, just something I was worrying about...

Yes, they would still need to talk. But the good news is that they
only actually need to talk once per checkpoint cycle so we can buffer
them to a certain extent in shared memory to remove the worst part of
such contention.

Checkpointing needs a little more time in its diary to receive those
messages than it has right now, so there's no easy route.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] bgwriter and checkpoints

2011-08-12 Thread Robert Haas

On Fri, Aug 12, 2011 at 9:33 AM, Simon Riggs si...@2ndquadrant.com wrote:
 Yes, they would still need to talk. But the good news is that they
 only actually need to talk once per checkpoint cycle so we can buffer
 them to a certain extent in shared memory to remove the worst part of
 such contention.

Yeah, some kind of special-purpose communication method between the
cleaning scan and the checkpoint process might help, if the lock
contention turns out to be a problem in practice.  Then again, maybe
I'm overthinking things: there's zero sign in any profiling I've done
that BgWriterCommLock is even mildly contended, so even worrying about
it at this point might be a waste of time.

 Checkpointing needs a little more time in its diary to receive those
 messages than it has right now, so there's no easy route.

Yeah.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Fast GiST index build

2011-08-12 Thread Robert Haas

On Thu, Aug 11, 2011 at 6:21 AM, Alexander Korotkov
aekorot...@gmail.com wrote:
 [ new patch ]

Some random comments:

- It appears that the noFollowFight flag is really supposed to be
called noFollowRight.

- In gist_private.h you've written halt-filled where you really mean
half-filled.

- It seems like you're using reloptions to set parameters that are
only going to do anything at index creation time.  IIUC, BUFFERING,
LEVELSTEP and BUFFERSIZE have no permanent meaning for that index;
they're just used ephemerally while constructing it.  If we're going
to expose such things as options, maybe they should be GUCs, not
reloptions.

- Function names should begin with gist or some other, appropriate
prefix, especially if they are non-static.  decreasePathRefcount(),
getNodeBuffer(), relocateBuildBuffersOnSplit(), adn
getNodeBufferBusySize() violate this rule, and it might be good to
change the static functions to follow it, too, just for consistency,
and to avoid renaming things if something that's currently static
later needs to be made non-static.

- validateBufferOption needs to use ereport(), not elog().

- This needs a bit of attention:

+   /* TODO: Write the WAL record */
+   if (RelationNeedsWAL(state-r))
+   recptr = gistXLogSplit(state-r-rd_node,
blkno, is_leaf,
+   dist,
oldrlink, oldnsn, InvalidBuffer, true);
+   else
+   recptr = GetXLogRecPtrForTemp();
+

I don't think the comment matches the code, since gistXLogSplit() does
in fact call XLogInsert().  Also, you should probably move the
RelationNeedsWAL() test inside gistXLogSplit().  Otherwise, every
single caller of gistXLogSplit() is going to need to do the same
dance.

- In gistBufferingPlaceToPage, you've got a series of loops that look like this:

+   for (ptr = dist; ptr; ptr = ptr-next)

The first loop allocates a bunch of buffers.  The second loop sets up
downlink pointers.   Then there's some other code.  Then there's a
third loop, which adds items to each page in turn and sets up right
links.  Then there's a fourth loop, which marks all those buffers
dirty.  Then you write XLOG.  Then there's a fifth loop, which sets
all the LSNs and TLIs, and a sixth loop, which does
UnlockReleaseBuffer() on each valid buffer in the list.  All of this
seems like it could be simplified.  In particular, the third and
fourth loops can certainly be merged - you should set the dirty bit at
the same time you're adding items to the page.  And the fifth and
sixth loops can also be merged.  You certainly don't need to set all
the LSNs and TLIs before releasing any of the buffer locks  pins.
I'm not sure if there's any more merging that can be done than that,
but you might want to have a look.

I'm also wondering how long this linked list can be.  It's not good to
have large numbers of buffers locked for a long period of time.  At
the very least, some comments are in order here.

Another general comment about this function is that it seems like it
is backwards.  The overall flow of the function is:

if (is_split)
{
/* complicated stuff */
}
else
{
/* simple case */
}

It seems like it might be better to flip that around and do this:

if (!is_split)
{
/* simple case */
return result;
}
/* complicated stuff */

It's easier to read and avoids having the indentation get too deep.

- As I look at this more, I see that a lot of the logic in
gistBufferingBuildPlaceToPage is copied from gistplacetopage().  It
would be nice to move the common bits to common subroutines that both
functions can call, instead of duplicating the code.

- On a related note, gistBufferingBuildPlaceToPage needs to do
START_CRIT_SECTION and END_CRIT_SECTION at appropriate points in the
sequence, as gistplacetopage() does.

- gistFindCorrectParent() seems to rely heavily on the assumption that
there's no concurrent activity going on in this index.  Otherwise,
it's got to be unsafe to release the buffer lock before using the
answer the function computes.  Some kind of comment seems like it
would be a good idea.

- On a more algorithmic note, I don't really understand why we attach
buffers to all pages on a level or none of them.  If it isn't
necessary to have buffers on every internal page in the tree, why do
we have them on every other level or every third level rather than,
say, creating them on the fly in whatever parts of the tree end up
busy?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] index-only scans

2011-08-12 Thread Robert Haas

On Fri, Aug 12, 2011 at 9:31 AM, Cédric Villemain
cedric.villemain.deb...@gmail.com wrote:
 Well... PostgreSQL can only use the index on a or the index on b, not
 both.  This patch doesn't change that.  I'm not trying to use indexes
 in some completely new way; I'm just trying to make them faster by
 optimizing away the heap access.

 For this kind of plan :
 Bitmap Heap Scan
  Recheck Cond
   BitmapAnd
      Bitmap Index Scan
      Bitmap Index Scan

 It may prevent useless Heap Fetch during Bitmap Heap Scan, isn't it ?

The input to the bitmap heap scan is just a TID bitmap.  It's too late
to decide we want the index tuples at that point; we've forgotten
where they are, and even if we remembered it, it would necessarily be
any cheaper than checking the heap.  We could optimize this to avoid a
heap fetch in the case where we don't need any of the tuple data at
all, but that's going to be somewhat rare, I think.

 Actually, I can see a possible way to allow an index-only type
 optimization to be used for bitmap scans.  As you scan the index, any
 tuples that can be handled index-only get returned immediately; the
 rest are thrown into a bitmap.  Once you're done examining the index,
 you then do a bitmap heap scan to get the tuples that couldn't be
 handled index-only.  This seems like it might be our best hope for a
 fast count(*) type optimization, especially if you could combine it
 with some method of scanning the index in physical order rather than
 logical order.

 IIRC we expose some ideas around that, yes. (optimizing bitmap)

 Maybe a question that will explain me more about the feature
 limitation (if any):
 Does an index-only scan used when the table has no vismap set will
 cost (in duration, IO, ...) more than a normal Index scan ?

 Yeah, it'll do a bit of extra work - the btree AM will cough up the
 tuple uselessly, and we'll check the visibility map, also uselessly.
 Then we'll end up doing it the regular way anyhow.  I haven't measured
 that effect yet; hopefully it's fairly small.

 If it is small, or if we can reduce it to be near absent.
 Then... why do we need to distinguish Index Scan at all ? (or
 Index*-Only* scan, which aren't 100% 'Only' btw)
 It is then just an internal optimisation on how we can access/use an
 index. (really good and promising one)

Right, that's kind of what I was going for.  But the overhead isn't
going to be exactly zero, so I think it makes sense to disable it in
the cases where it clearly can't work.  The question is really more
whether we need to get more fine-grained than that, and actually do a
cost-based comparison of an index-only scan vs. a regular index scan.
I hope not, but I haven't tested it yet.

One other thing to think about is that the choice to use an index-scan
is not free of externalities.  The index-only scan is hopefully faster
than touching all the heap pages, but even if it weren't, not touching
all of the heap pages potentially means avoiding evicting things from
shared_buffers that some *other* query might need.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Possible Bug in pg_upgrade

2011-08-12 Thread Dave Byrne


On 08/11/2011 12:02 AM, Peter Eisentraut wrote:

On ons, 2011-08-10 at 18:53 -0400, Tom Lane wrote:

Dave Byrnedby...@mdb.com  writes:

Attached is a patch that skips orphaned temporary relations in pg_upgrade if they 
are lingering around. It works for 9.0 -  9.1 upgrades, however I wasn't able 
to tell when pg_class.relistemp was added so if it was unavailable in versions 
prior to 9.0 an additional check will have to be added.

I'm inclined to think the correct fix is to revert the assumption that
the old and new databases contain exactly the same number of tables ...
that seems to have a lot of potential failure modes besides this one.

It's basically checking whether pg_dump -s worked.  That doesn't seem
like a good use of time.

If anyone has a suggestion for a better approach I'm happy to work on it 
and amend the patch. It is certainly is a corner case but it bit me when 
preparing for the upcoming 9.1 release. I would imagine others will hit 
it as well.





--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Reworking the writing of WAL

2011-08-12 Thread Simon Riggs

I present a number of connected proposals

1. Earlier, I suggested that the sync rep code would allow us to
redesign the way we write WAL, using ideas from group commit. My
proposal is that when when a backend needs to flush WAL to local disk
it will be added to a SHMQUEUE exactly the same as when we flush WAL
to sync standby. The WALWriter will be woken by latch and then perform
the actual work. When complete WALWriter will wake the queue in order,
so there is a natural group commit effect. The WAL queue will be
protected by a new lock WALFlushRequestLock, which should be much less
heavily contended than the way we do things now. Notably this approach
will mean that all waiters get woken quickly, without having to wait
for the queue of WALWriteLock requests to drain down, so commit will
be marginally quicker. On almost idle systems this will give very
nearly the same response time as having each backend write WAL
directly. On busy systems this will give optimal efficiency by having
WALWriter working in a very tight loop to perform the I/O instead of
queuing itself to get the WALWriteLock with all the other backends. It
will also allow piggybacking of commits even when WALInsertLock is not
available.

2. A further idea is to use the same queue to reduce contention on
accessing the ProcArray and Clog at end of transaction also. That
would not be part of the initial work, but I'd want to bear in mind
that possibility in the design stage at least if there were any
choices to make.

3. In addition, we will send the WAL to standby servers as soon as it
has been written, not flushed. As part of the chunk header the
WALSender would include the known WAL flush ptr. So we would be
sending WAL data to the standby ahead of it being flushed, but then
only applying data up the flush ptr. This would mean we don't flush
WAL fully and then send it, we partially overlap those operations to
give us the option of saying we don't want to fsync remotely for
additional speed (DRBD 'B' mode).

4. I'm tempted by the thought to make backends write their commit
records but not flush them, which fits in with the above.

5. And we would finally get rid of the group commit parameters.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] plpython crash

2011-08-12 Thread Jan Urbański

On 12/08/11 13:55, Jean-Baptiste Quenot wrote:
 Here is the same with -O0:
 
 https://gist.github.com/1140005
 
 sys.version reports this:
 
 INFO:  2.6.6 (r266:84292, Sep 15 2010, 16:41:53)
 [GCC 4.4.5]

I'm still at a loss. Did you reproduce it with git HEAD? I see that the
query being execute is select * from myfunc(); would it be possible to
share the code of myfunc?

Jan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] sha1, sha2 functions into core?

2011-08-12 Thread David E. Wheeler

On Aug 12, 2011, at 5:02 AM, Marko Kreen wrote:

 My point was that giving such open-ended list of algorithms
 was bad idea, but there is no problem keeping old behaviour.
 
 I don't see anything much wrong with sha1(bytea/text) - bytea.
 There's no law that says it has to work exactly like md5() does.
 
 The problem is that list of must-have algorithms is getting
 quite long: md5, sha1, sha224, sha256, sha384, sha512,
 + at least 4 from upcoming sha3.

+1

I think some sort of digest() function that takes a parameter naming the 
algorithm would be the way to go. That's not to say that the existing named 
functions could continue to exist -- md5() in core and sha1() in pg_crypto. But 
it sure seems to me like we ought to have just one function for digests (or 2, 
if we also have hexdigest()).

Best,

David


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reworking the writing of WAL

2011-08-12 Thread Robert Haas

On Fri, Aug 12, 2011 at 11:34 AM, Simon Riggs si...@2ndquadrant.com wrote:
 1. Earlier, I suggested that the sync rep code would allow us to
 redesign the way we write WAL, using ideas from group commit. My
 proposal is that when when a backend needs to flush WAL to local disk
 it will be added to a SHMQUEUE exactly the same as when we flush WAL
 to sync standby. The WALWriter will be woken by latch and then perform
 the actual work. When complete WALWriter will wake the queue in order,
 so there is a natural group commit effect. The WAL queue will be
 protected by a new lock WALFlushRequestLock, which should be much less
 heavily contended than the way we do things now. Notably this approach
 will mean that all waiters get woken quickly, without having to wait
 for the queue of WALWriteLock requests to drain down, so commit will
 be marginally quicker. On almost idle systems this will give very
 nearly the same response time as having each backend write WAL
 directly. On busy systems this will give optimal efficiency by having
 WALWriter working in a very tight loop to perform the I/O instead of
 queuing itself to get the WALWriteLock with all the other backends. It
 will also allow piggybacking of commits even when WALInsertLock is not
 available.

I like the idea of putting all the backends that are waiting for xlog
flush on a SHM_QUEUE, and having a single process do the flush and
then wake them all up.  That seems like a promising approach, and
should avoid quite a bit of context-switching and spinlocking that
would otherwise be necessary.  However, I think it's possible that the
overhead in the single-client case might be pretty significant, and
I'm wondering whether we might be able to set things up so that
backends can flush their own WAL in the uncontended case.

What I'm imagining is something like this:

struct {
slock_t mutex;
XLogRecPtr CurrentFlushLSN;
XLogRecPtr HighestFlushLSN;
SHM_QUEUE WaitersForCurrentFlush;
SHM_QUEUE WaitersForNextFlush;
};

To flush, you first acquire the mutex.  If the CurrentFlushLSN is not
InvalidXLogRecPtr, then there's a flush in progress, and you add
yourself to either WaitersForCurrentFlush or WaitersForNextFlush,
depending on whether your LSN is lower or higher than CurrentFlushLSN.
 If you queue on WaitersForNextFlush you advance HighestFlushLSN to
the LSN you need flushed.  You then release the spinlock and sleep on
your semaphore.

But if you get the mutex and find that CurrentFlushLSN is XLogRecPtr,
then you know that no flush is in progress.  In that case, you set
CurrentFlushLSN to the maximum of the LSN you need flushed and
HighestFlushLSN and move all WaitersForNextFlush over to
WaitersForCurrentFlush.  You then release the spinlock and perform the
flush.  After doing so, you reacquire the spinlock, remove everyone
from WaitersForCurrentFlush, note whether there are any
WaitersForNextFlush, and release the spinlock.  If there were any
WaitersForNextFlush, you set the WAL writer latch.  You then wake up
anyone you removed from WaitersForCurrentFlush.

Every time the WAL writer latch is set, the WAL writer wakes up and
performs any needed flush, unless there's already one in progress.

This allows processes to flush their own WAL when there's no
contention, but as more contention develops the work moves to the WAL
writer which will then run in a tight loop, as in your proposal.

 5. And we would finally get rid of the group commit parameters.

That would be great, and I think the performance will be quite a bit
better, too.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] VACUUM FULL versus system catalog cache invalidation

2011-08-12 Thread Tom Lane

I've been testing the problem reported by Dave Gould by running make
installcheck-parallel together with a tight loop of vacuum full
pg_class:

while true; do psql -c vacuum full pg_class regression; usleep 10; done

Even after fixing the cache-reset-recovery-order problem I described
yesterday, there are still occasional bizarre failures, for example
in this bit of truncate.sql:

BEGIN;
SELECT * FROM trunc_f;
TRUNCATE trunc_f;
SELECT * FROM trunc_f;
ROLLBACK;

BEGIN;
SELECT * FROM trunc_f;
TRUNCATE ONLY trunc_f;
ERROR:  attempted to update invisible tuple

I'm not sure that there's only one bug remaining here, but I've
identified this bug at least.  In the first transaction above, the
TRUNCATE has to update trunc_f's pg_class tuple with a new relfilenode
value.  Say the update invalidates the tuple at tid A and inserts the
updated version at tid B.  Then vacuum full gets control and rewrites
pg_class.  It correctly keeps both versions of the trunc_f tuple, but
now they will be at new tid locations, say C and D.  The original
transaction meanwhile was blocked trying to re-open pg_class to reload
trunc_f's pg_class tuple into its catcaches.  Now it successfully does
that, and the tuple it reads has tid D.  It finishes out the
transaction, and rolls back as per script.  In the rollback, we know
that we have to flush catcache entries for tuple versions inserted by
the failed transaction ... but what inval.c has stored is an inval event
against pg_class tid B.  There is no such entry in the catcaches, so
nothing happens, and the entry with tid D stays valid.  Ooops.
The relcache entry for trunc_f does get rebuilt correctly (without
consulting the bogus catcache entry), so the next SELECT goes through
all right.  But when the second TRUNCATE wants to again update the
pg_class tuple, it finds it in the catcaches ... and what it finds has
tid D, so it attempts to heap_update that TID instead of the actually
live tuple at tid C.

In this particular case, the incorrectly targeted tuple is the
just-invalidated version of trunc_f, so heap_update sees it as
XMIN_INVALID and fails with attempted to update invisible tuple.

The potential consequences are hugely worse than that, though: with a
bit more time between the two operations, it'd be possible for someone
else to reclaim the dead tuple and replace it with something else.
As long as the TID is live when we get to it, heap_update will blindly
replace it, whether or not it has anything to do with the tuple we think
we're replacing.  I suspect that some of the other bizarre cases I'm
seeing are artifacts of that type of outcome, but they're hard to
reproduce so I've not been able to pin them down yet.

In principle this same type of failure could have occurred before 9.0,
since catcache inval has always been TID-based.  It turns out we were
saved by the fact that the old VACUUM FULL implementation would chicken
out and not do any tuple moving if it found any INSERT_IN_PROGRESS or
DELETE_IN_PROGRESS tuples.  The pre-9.0 versions of CLUSTER reject such
tuples too, so I don't think we need to do anything in those branches.

But in 9.0 and up, we have a problem.  So far I've thought of two possible
avenues to fix it:

1. When a SHAREDINVALCATALOG_ID inval message comes in (telling us a VAC
FULL or CLUSTER just finished on a system catalog), enter that message
into the transaction's local inval list as well as executing it
immediately.  On transaction abort, redo the resulting cache flushes a
second time.  This would ensure we flushed any bad entries even though
they were logged with obsolete TIDs elsewhere in the inval list.
(We might need to do this on commit too, though right now I don't think
so.  Also, a cache reset event would a fortiori have to queue a similar
entry to blow things away a second time at transaction end.)

2. Forget about targeting catcache invals by TID, and instead just use the
key hash value to determine which cache entries to drop.

Approach #2 seems a lot less invasive and more trustworthy, but it has the
disadvantage that cache invals would become more likely to blow away
entries unnecessarily (because of chance hashvalue collisions), even
without any VACUUM FULL being done.  If we could make approach #1 work
reliably, it would result in more overhead during VACUUM FULL but less at
other times --- or at least we could hope so.  In an environment where
lots of sinval overflows and consequent resets happen, we might come out
behind due to doubling the number of catcache flushes forced by a reset
event.

Right at the moment I'm leaning to approach #2.  I wonder if anyone
sees it differently, or has an idea for a third approach?

BTW, going forward it might be interesting to think about invalidating
the catcaches based on xmin/xmax rather than specific TIDs.  That would
reduce the sinval traffic to one message per transaction that had
affected the catalogs, instead of one per TID.  But that clearly is not
going to lead to something we'd dare

Re: [HACKERS] Reworking the writing of WAL

2011-08-12 Thread Simon Riggs

On Fri, Aug 12, 2011 at 7:02 PM, Robert Haas robertmh...@gmail.com wrote:
 On Fri, Aug 12, 2011 at 11:34 AM, Simon Riggs si...@2ndquadrant.com wrote:
 1. Earlier, I suggested that the sync rep code would allow us to
 redesign the way we write WAL, using ideas from group commit. My
 proposal is that when when a backend needs to flush WAL to local disk
 it will be added to a SHMQUEUE exactly the same as when we flush WAL
 to sync standby. The WALWriter will be woken by latch and then perform
 the actual work. When complete WALWriter will wake the queue in order,
 so there is a natural group commit effect. The WAL queue will be
 protected by a new lock WALFlushRequestLock, which should be much less
 heavily contended than the way we do things now. Notably this approach
 will mean that all waiters get woken quickly, without having to wait
 for the queue of WALWriteLock requests to drain down, so commit will
 be marginally quicker. On almost idle systems this will give very
 nearly the same response time as having each backend write WAL
 directly. On busy systems this will give optimal efficiency by having
 WALWriter working in a very tight loop to perform the I/O instead of
 queuing itself to get the WALWriteLock with all the other backends. It
 will also allow piggybacking of commits even when WALInsertLock is not
 available.

 I like the idea of putting all the backends that are waiting for xlog
 flush on a SHM_QUEUE, and having a single process do the flush and
 then wake them all up.  That seems like a promising approach, and
 should avoid quite a bit of context-switching and spinlocking that
 would otherwise be necessary.

OK, good.

This work builds on Peter's latch-for-walwriter patch, so this won't
appear for a while yet, since it needs to happen after that. Hopefully
for Nov CF.

 However, I think it's possible that the
 overhead in the single-client case might be pretty significant, and
 I'm wondering whether we might be able to set things up so that
 backends can flush their own WAL in the uncontended case.

I think we should measure that and see. I don't think it will be that
bad, but I accept it might be wishful thinking on my part.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] VACUUM FULL versus system catalog cache invalidation

2011-08-12 Thread Robert Haas

On Fri, Aug 12, 2011 at 2:09 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 But in 9.0 and up, we have a problem.  So far I've thought of two possible
 avenues to fix it:

 1. When a SHAREDINVALCATALOG_ID inval message comes in (telling us a VAC
 FULL or CLUSTER just finished on a system catalog), enter that message
 into the transaction's local inval list as well as executing it
 immediately.  On transaction abort, redo the resulting cache flushes a
 second time.  This would ensure we flushed any bad entries even though
 they were logged with obsolete TIDs elsewhere in the inval list.
 (We might need to do this on commit too, though right now I don't think
 so.  Also, a cache reset event would a fortiori have to queue a similar
 entry to blow things away a second time at transaction end.)

 2. Forget about targeting catcache invals by TID, and instead just use the
 key hash value to determine which cache entries to drop.

 Approach #2 seems a lot less invasive and more trustworthy, but it has the
 disadvantage that cache invals would become more likely to blow away
 entries unnecessarily (because of chance hashvalue collisions), even
 without any VACUUM FULL being done.  If we could make approach #1 work
 reliably, it would result in more overhead during VACUUM FULL but less at
 other times --- or at least we could hope so.  In an environment where
 lots of sinval overflows and consequent resets happen, we might come out
 behind due to doubling the number of catcache flushes forced by a reset
 event.

 Right at the moment I'm leaning to approach #2.  I wonder if anyone
 sees it differently, or has an idea for a third approach?

I don't think it really matters whether we occasionally blow away an
entry unnecessarily due to a hash-value collision.  IIUC, we'd only
need to worry about hash-value collisions between rows in the same
catalog; and the number of entries that we have cached had better be
many orders of magnitude less than 2^32.  If the cache is large enough
that we're having hash value collisions more than once in a great
while, we probably should have flushed some entries out of it a whole
lot sooner and a whole lot more aggressively, because we're likely
eating memory like crazy.

On the other hand, having to duplicate sinval resets seems like a bad
thing.  So +1 for #2.

 BTW, going forward it might be interesting to think about invalidating
 the catcaches based on xmin/xmax rather than specific TIDs.  That would
 reduce the sinval traffic to one message per transaction that had
 affected the catalogs, instead of one per TID.  But that clearly is not
 going to lead to something we'd dare back-patch.

To be honest, I am not real keen on back-patching either of the
options you list above, either.  I think we should probably do
something about the problem Dave Gould reported (to wit: sinval resets
need to be smarter about the order they do things in), but I am not
sure it's a good idea to back-patch what seems like a fairly radical
overhaul of the sinval mechanism to fix a problem nobody's actually
complained about hitting in production, especially given there's no
certainty that this is the last bug.  Perhaps we should just fix this
one in master and consider back-patching it if and when we get some
plausibly related bug reports.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Extra check in 9.0 exclusion constraint unintended consequences

2011-08-12 Thread Robert Haas

On Thu, Aug 11, 2011 at 2:25 PM, Jeff Davis pg...@j-davis.com wrote:
 On Thu, 2011-08-11 at 11:58 -0400, Robert Haas wrote:
 I'm OK with adding a note either to the 9.0 docs only (which means it
 might be missed by a 9.0 user who only looks at the current docs) or
 with adding a note to all versions mentioning the difference in
 behavior with 9.0, but I'm not really sure which way to go with it.
 Or we could just not do anything at all.  Anyone else have an opinion?

 It seems to be somewhat of a burden to carry a version-specific note
 indefinitely... more clutter than helpful. So I'd vote for just changing
 the 9.0 docs.

 Or, we could add the 9.0-specific note to 9.0 and 9.1 docs, but leave it
 out of 'master'. That way it sticks around for a while but we don't have
 to remember to remove it later.

Having thought about this a bit further, I'm coming around to the view
that if it isn't worth adding this in master, it's not worth adding at
all.  I just don't think it's going to get any visibility as a
back-branch only doc patch.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: Change format of FDW options used in \d* commands (was: Re: [HACKERS] per-column FDW options, v5)

2011-08-12 Thread Robert Haas

2011/8/12 Shigeru Hanada shigeru.han...@gmail.com:
 (2011/08/12 1:05), Robert Haas wrote:
 On Thu, Aug 11, 2011 at 12:04 PM, Alvaro Herrera
 alvhe...@commandprompt.com  wrote:
 Excerpts from Robert Haas's message of jue ago 11 11:50:40 -0400 2011:
 2011/8/9 Shigeru Hanadashigeru.han...@gmail.com:
 I'd like to pick #3, and also change per-column options format.  In
 addition, I'd like to change options format for other FDW objects such
 as wrappers, servers and user mappings for consistency.  Of course, only
 if it's acceptable to break backward compatibility...

 I think it's fine to change the display format.  We haven't had these
 features for very long, so users hopefully shouldn't be expecting that
 everything is set in stone.  We have made far bigger changes to
 backslash commands that have been around for far longer (\df, I'm
 looking at you).

 We've never promised that backslash commands behave identically across
 releases.  I think they are more for human consumption than machine, so
 why would we care about changing one of them a bit?

 Yeah, I agree.

 Thanks for the comments.

 Attached patch changes various \d*+ commands to show FDW options in same
 format as OPTIONS clause.  IMHO the new format is easier to read.

 Example)
  old: {delimiter=,,quote=\}
  new: delimiter ',', quote ''

 All regression tests including contrib's installcheck passed.

 I'll add this patch to CF app as new item.

IMHO, the new format should put parentheses around the options list.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Further news on Clang - spurious warnings

2011-08-12 Thread Peter Eisentraut

On fre, 2011-08-12 at 11:46 -0700, David E. Wheeler wrote:
 I figure there might be warnings you haven't seen if you haven't been
 building with bonjour, perl, openssl, pam, libxml, or ossp-uuid, so
 I've attached the output.

We have a build farm member (smew) that covers all of that except
bonjour, but I can confirm that there are no additional warnings on
account of that.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Further news on Clang - spurious warnings

2011-08-12 Thread David E. Wheeler

On Aug 12, 2011, at 12:09 PM, Peter Eisentraut wrote:

 I figure there might be warnings you haven't seen if you haven't been
 building with bonjour, perl, openssl, pam, libxml, or ossp-uuid, so
 I've attached the output.
 
 We have a build farm member (smew) that covers all of that except
 bonjour, but I can confirm that there are no additional warnings on
 account of that.

Ah, great, thanks!

David


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] VACUUM FULL versus system catalog cache invalidation

2011-08-12 Thread Tom Lane

Robert Haas robertmh...@gmail.com writes:
 On Fri, Aug 12, 2011 at 2:09 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 2. Forget about targeting catcache invals by TID, and instead just use the
 key hash value to determine which cache entries to drop.

 Right at the moment I'm leaning to approach #2.  I wonder if anyone
 sees it differently, or has an idea for a third approach?

 To be honest, I am not real keen on back-patching either of the
 options you list above, either.  I think we should probably do
 something about the problem Dave Gould reported (to wit: sinval resets
 need to be smarter about the order they do things in), but I am not
 sure it's a good idea to back-patch what seems like a fairly radical
 overhaul of the sinval mechanism to fix a problem nobody's actually
 complained about hitting in production, especially given there's no
 certainty that this is the last bug.

What radical overhaul?  I'm talking about removing one if-test in
CatalogCacheIdInvalidate, viz

nextelt = DLGetSucc(elt);

if (hashValue != ct-hash_value)
continue;   /* ignore non-matching 
hash values */

-   if (ct-negative ||
-   ItemPointerEquals(pointer, ct-tuple.t_self))
{
if (ct-refcount  0 ||
(ct-c_list  ct-c_list-refcount  
0))
{

In any case, it is now clear to me that this bug is capable of eating
peoples' databases, as in what just happened to our most critical
table?  Uh, it's not there anymore, boss, but we seem to have duplicate
pg_class entries for this other table.  This may not have happened
yet in the field, but that's probably only because 9.0 hasn't been in
production that long.  Doing nothing about it in the released branches
is unacceptable IMO.  People that that happens to will not be satisfied
with a response like you shouldn't have been using VACUUM FULL on
system catalogs, it's buggy.

For that matter, I'm not convinced that Gould hasn't seen some form of
this --- he's mentioned seeing bizarre errors above and beyond the
could not find pg_class tuple for index 2662 one.

As for it not being the last bug, no, it's probably not.  That's a
pretty sucky excuse for not fixing it too.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] sha1, sha2 functions into core?

2011-08-12 Thread Marko Kreen

On Thu, Aug 11, 2011 at 5:46 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Marko Kreen mark...@gmail.com writes:
 On Wed, Aug 10, 2011 at 9:19 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 ... which this approach would create, because digest() isn't restricted
 to just those algorithms.  I think it'd be better to just invent two
 new functions, which also avoids issues for applications that currently
 expect the digest functions to be installed in pgcrypto's schema.

 I would suggest digest() with fixed list of algorithms: md5, sha1, sha2.

 The uncommon/obsolete algorithms that can be used
 from digest() if compiled with openssl, are not something we
 need to worry over.  In fact we have never supported them,
 as no testing has been done.

 Hmm ... they may be untested by us, but I feel sure that if we remove
 that functionality from pgcrypto, *somebody* is gonna complain.

If you dont want to break digest() but do not want such behaviour in core,
we could go with hash(data, algo) that has fixed number of digests,
but also couple non-cryptographic hashes like crc32, lookup2/3.
This would also fix the problem of people using hashtext() in user code.

-- 
marko

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] VACUUM FULL versus system catalog cache invalidation

2011-08-12 Thread Kevin Grittner

Robert Haas robertmh...@gmail.com wrote:
 
 Perhaps we should just fix this one in master and consider
 back-patching it if and when we get some plausibly related bug
 reports.
 
I'm not completely clear on what one would do to be vulnerable to
hitting the bug, or what the impact of hitting it would be.  Tom
said:
 
 The potential consequences are hugely worse than that, though:
 with a bit more time between the two operations, it'd be possible
 for someone else to reclaim the dead tuple and replace it with
 something else.  As long as the TID is live when we get to it,
 heap_update will blindly replace it, whether or not it has
 anything to do with the tuple we think we're replacing.
 
What would a worst-case bug report based on hitting that look like?
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Inserting heap tuples in bulk in COPY

2011-08-12 Thread Heikki Linnakangas

COPY is slow. Let's make it faster. One obvious optimization is to 
insert heap tuples in bigger chunks, instead of calling heap_insert() 
separately for every tuple. That saves the overhead of pinning and 
locking the buffer for every tuple, and you only need to write one WAL 
record for all the tuples written to the same page, instead of one for 
each tuple.


Attached is a WIP patch to do that. It adds a new function, 
heap_multi_insert, which does the same thing as heap_insert, but works 
in bulk. It takes an array of tuples as argument, and tries to cram as 
many of them into the chosen targe page as it can, and only writes a 
single WAL record of the operation.


This gives a significant speedup to COPY, particularly for narrow 
tables, with small tuples. Grouping multiple tuples into one WAL record 
reduces the WAL volume significantly, and the time spent in writing that 
WAL. The reduced overhead of repeatedly locking the buffer is also most 
noticeable on narrow tables. On wider tables, the effects are smaller. 
See copytest-results.txt, containing test results with three tables of 
different widths. The scripts used to get those numbers are also attached.


Triggers complicate this. I believe it is only safe to group tuples 
together like this if the table has no triggers. A BEFORE ROW trigger 
might run a SELECT on the table being copied to, and check if some of 
the tuples we're about to insert exist. If we run BEFORE ROW triggers 
for a bunch of tuples first, and only then insert them, none of the 
trigger invocations will see the other rows as inserted yet. Similarly, 
if we run AFTER ROW triggers after inserting a bunch of tuples, the 
trigger for each of the insertions would see all the inserted rows. So 
at least for now, the patch simply falls back to inserting one row at a 
time if there are any triggers on the table.


The patch is WIP, mainly because I didn't write the WAL replay routines 
yet, but please let me know if you see any issues.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 9f1bcf1..0594332 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -24,6 +24,7 @@
  *		heap_getnext	- retrieve next tuple in scan
  *		heap_fetch		- retrieve tuple with given tid
  *		heap_insert		- insert tuple into a relation
+ *		heap_multi_insert - insert multiple tuples into a relation
  *		heap_delete		- delete a tuple from a relation
  *		heap_update		- replace a tuple in a relation with another tuple
  *		heap_markpos	- mark scan position
@@ -1860,11 +1861,39 @@ Oid
 heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 			int options, BulkInsertState bistate)
 {
+	HeapTuple heaptup;
+
+	heaptup = heap_prepare_insert(relation, tup, cid, options, bistate);
+
+	heap_multi_insert(relation, heaptup, 1, options, bistate);
+
+	/*
+	 * If heaptup is a private copy, release it.  Don't forget to copy t_self
+	 * back to the caller's image, too.
+	 */
+	if (heaptup != tup)
+	{
+		tup-t_self = heaptup-t_self;
+		heap_freetuple(heaptup);
+	}
+
+	return HeapTupleGetOid(tup);
+}
+
+/*
+ * Prepare a tuple for insertion with heap_multi_insert. This sets the
+ * tuple header fields, assigns an OID, and toasts the tuple if necessary.
+ * Returns a toasted version of the tuple if it was toasted, or the original
+ * tuple if not.
+ *
+ * This needs to be called for each tuple before calling heap_multi_insert().
+ */
+HeapTuple
+heap_prepare_insert(Relation relation, HeapTuple tup, CommandId cid,
+	int options, BulkInsertState bistate)
+{
 	TransactionId xid = GetCurrentTransactionId();
-	HeapTuple	heaptup;
-	Buffer		buffer;
-	Buffer		vmbuffer = InvalidBuffer;
-	bool		all_visible_cleared = false;
+	HeapTuple heaptup;
 
 	if (relation-rd_rel-relhasoids)
 	{
@@ -1916,6 +1945,39 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 	else
 		heaptup = tup;
 
+	return heaptup;
+}
+
+/*
+ * Inserts tuples to relation. This always inserts at least one tuple, and
+ * opportunistically more if the chosen target page happens to have room, up
+ * to ntuples. Returns the number of tuples inserted.
+ */
+int
+heap_multi_insert(Relation relation, HeapTuple *heaptuples, int ntuples,
+  int options, BulkInsertState bistate)
+{
+	HeapTuple	heaptup = heaptuples[0];
+	Buffer		buffer;
+	Buffer		vmbuffer = InvalidBuffer;
+	bool		all_visible_cleared = false;
+	int			i;
+	int			ndone;
+	char	   *scratch = NULL;
+	int			scratchused = 0;
+	Page		page;
+
+	/*
+	 * Allocate some memory to use for constructing the WAL record. Using
+	 * palloc() within a critical section is not safe, so we allocate this
+	 * beforehand.
+	 */
+	if (!(options  HEAP_INSERT_SKIP_WAL)  RelationNeedsWAL(relation))
+		scratch = palloc(BLCKSZ);
+
+	if (IsSystemRelation(relation))
+		Assert(ntuples == 1);
+
 	/*
 	 * Find buffer to insert this tuple into.  If the page is all visible,
 	 * this

Re: [HACKERS] VACUUM FULL versus system catalog cache invalidation

2011-08-12 Thread Kevin Grittner

Tom Lane t...@sss.pgh.pa.us wrote:
 
 In any case, it is now clear to me that this bug is capable of
 eating peoples' databases, as in what just happened to our most
 critical table?  Uh, it's not there anymore, boss, but we seem to
 have duplicate pg_class entries for this other table.
 
Based on this, I don't think we want to wait for bug reports.  One
would be too many.
 
+1 for backpatching a fix.
 
 As for it not being the last bug, no, it's probably not.  That's a
 pretty sucky excuse for not fixing it too.
 
I agree.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] VACUUM FULL versus system catalog cache invalidation

2011-08-12 Thread Heikki Linnakangas


On 12.08.2011 21:49, Robert Haas wrote:

On Fri, Aug 12, 2011 at 2:09 PM, Tom Lanet...@sss.pgh.pa.us  wrote:

2. Forget about targeting catcache invals by TID, and instead just use the
key hash value to determine which cache entries to drop.

Approach #2 seems a lot less invasive and more trustworthy, but it has the
disadvantage that cache invals would become more likely to blow away
entries unnecessarily (because of chance hashvalue collisions), even
without any VACUUM FULL being done.  If we could make approach #1 work
reliably, it would result in more overhead during VACUUM FULL but less at
other times --- or at least we could hope so.  In an environment where
lots of sinval overflows and consequent resets happen, we might come out
behind due to doubling the number of catcache flushes forced by a reset
event.

Right at the moment I'm leaning to approach #2.  I wonder if anyone
sees it differently, or has an idea for a third approach?


I don't think it really matters whether we occasionally blow away an
entry unnecessarily due to a hash-value collision.  IIUC, we'd only
need to worry about hash-value collisions between rows in the same
catalog; and the number of entries that we have cached had better be
many orders of magnitude less than 2^32.  If the cache is large enough
that we're having hash value collisions more than once in a great
while, we probably should have flushed some entries out of it a whole
lot sooner and a whole lot more aggressively, because we're likely
eating memory like crazy.


What would suck, though, is if you have an application that repeatedly 
creates and drops a temporary table, and the hash value for that happens 
to match some other table in the database. catcache invalidation would 
keep flushing the entry for that other table too, and you couldn't do 
anything about it except for renaming one of the tables.


Despite that, +1 for option #2. The risk of collision seems acceptable, 
and the consequence of a collision wouldn't be too bad in most 
applications anyway.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Further news on Clang - spurious warnings

2011-08-12 Thread Peter Geoghegan

On 12 August 2011 20:10, David E. Wheeler da...@kineticode.com wrote:
 Ah, great, thanks!

Unfortunately, you'll need to build your own Clang to be up to speed
on the work I've done here, because the Clang developers both fixed
the spurious warning from assigning past what appears to be the end of
an array, plus a significant performance issue when building gram.c
and other grammar files. Building Clang is not difficult, but it's a
little time-consuming.

I imagine we'll have to wait until the release and wide availability
of Clang 3 before we start to see people using it for day-to-day
hacking on Postgres.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] VACUUM FULL versus system catalog cache invalidation

2011-08-12 Thread Tom Lane

Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
 On 12.08.2011 21:49, Robert Haas wrote:
 I don't think it really matters whether we occasionally blow away an
 entry unnecessarily due to a hash-value collision.  IIUC, we'd only
 need to worry about hash-value collisions between rows in the same
 catalog; and the number of entries that we have cached had better be
 many orders of magnitude less than 2^32.  If the cache is large enough
 that we're having hash value collisions more than once in a great
 while, we probably should have flushed some entries out of it a whole
 lot sooner and a whole lot more aggressively, because we're likely
 eating memory like crazy.

 What would suck, though, is if you have an application that repeatedly 
 creates and drops a temporary table, and the hash value for that happens 
 to match some other table in the database. catcache invalidation would 
 keep flushing the entry for that other table too, and you couldn't do 
 anything about it except for renaming one of the tables.

 Despite that, +1 for option #2. The risk of collision seems acceptable, 
 and the consequence of a collision wouldn't be too bad in most 
 applications anyway.

Yeah.  Also, to my mind this is only a fix that will be used in 9.0 and
9.1 --- now that it's occurred to me that we could use tuple xmin/xmax
to invalidate catcaches instead of recording individual TIDs, I'm
excited about doing that instead for 9.2 and beyond.  I believe that
that could result in a significant reduction in sinval traffic, which
would be a considerable performance win.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Inserting heap tuples in bulk in COPY

2011-08-12 Thread Gurjeet Singh

On Fri, Aug 12, 2011 at 3:16 PM, Heikki Linnakangas 
heikki.linnakan...@enterprisedb.com wrote:

 COPY is slow.


No kidding!


 So at least for now, the patch simply falls back to inserting one row at a
 time if there are any triggers on the table.


Maybe we want to change that to fall back to old ways if there are any FOR
EACH ROW triggers, since FOR EACH STATEMENT triggers won't be bothered by
this optimization.

-- 
Gurjeet Singh
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Re: [HACKERS] Inserting heap tuples in bulk in COPY

2011-08-12 Thread Florian Pflug

On Aug12, 2011, at 21:16 , Heikki Linnakangas wrote:
 Triggers complicate this. I believe it is only safe to group tuples together 
 like this if the table has no triggers. A BEFORE ROW trigger might run a 
 SELECT on the table being copied to, and check if some of the tuples we're 
 about to insert exist. If we run BEFORE ROW triggers for a bunch of tuples 
 first, and only then insert them, none of the trigger invocations will see 
 the other rows as inserted yet. Similarly, if we run AFTER ROW triggers after 
 inserting a bunch of tuples, the trigger for each of the insertions would see 
 all the inserted rows.

Don't we run AFTER ROW triggers after inserting *all* the tuples anyway? At 
least this is what we do in the case of INSERT/UPDATE/DELETE if I'm not 
mistaken.

best regards,
Florian Pflug


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] index-only scans

2011-08-12 Thread Heikki Linnakangas


On 11.08.2011 23:06, Robert Haas wrote:

Comments, testing, review appreciated...


I would've expected this to use an index-only scan:

postgres=# CREATE TABLE foo AS SELECT generate_series(1,10) AS id;
SELECT 10
postgres=# CREATE INDEX i_foo ON foo (id) WHERE id = 10;
CREATE INDEX
postgres=# VACUUM ANALYZE foo;
VACUUM
postgres=# EXPLAIN SELECT id FROM foo WHERE id = 10;
   QUERY PLAN
-
 Index Scan using i_foo on foo  (cost=0.00..8.27 rows=1 width=4)
   Index Cond: (id = 10)
(2 rows)

If it's not a predicate index, then it works:

postgres=# DROP INDEX i_foo;
DROP INDEX
postgres=# EXPLAIN SELECT id FROM foo WHERE id = 10;
  QUERY PLAN
---
 Index Only Scan using i_foo2 on foo  (cost=0.00..8.28 rows=1 width=4)
   Index Cond: (id = 10)
(2 rows)

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] VACUUM FULL versus system catalog cache invalidation

2011-08-12 Thread Robert Haas

On Fri, Aug 12, 2011 at 3:14 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 In any case, it is now clear to me that this bug is capable of eating
 peoples' databases, as in what just happened to our most critical
 table?  Uh, it's not there anymore, boss, but we seem to have duplicate
 pg_class entries for this other table.

OK, I take your point.  That's certainly scary enough to worry about.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] index-only scans

2011-08-12 Thread Robert Haas

On Fri, Aug 12, 2011 at 4:03 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 On 11.08.2011 23:06, Robert Haas wrote:

 Comments, testing, review appreciated...

 I would've expected this to use an index-only scan:

 postgres=# CREATE TABLE foo AS SELECT generate_series(1,10) AS id;

Ugh.  I think there's something wrong with this test:

+   if (index-amcanreturn  !index-predOK  !index-indexprs)

I think that should probably say something like (ind-indpred == NIL
|| ind-predOK).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Inserting heap tuples in bulk in COPY

2011-08-12 Thread Heikki Linnakangas


On 12.08.2011 22:57, Florian Pflug wrote:

On Aug12, 2011, at 21:16 , Heikki Linnakangas wrote:

Triggers complicate this. I believe it is only safe to group tuples together 
like this if the table has no triggers. A BEFORE ROW trigger might run a SELECT 
on the table being copied to, and check if some of the tuples we're about to 
insert exist. If we run BEFORE ROW triggers for a bunch of tuples first, and 
only then insert them, none of the trigger invocations will see the other rows 
as inserted yet. Similarly, if we run AFTER ROW triggers after inserting a 
bunch of tuples, the trigger for each of the insertions would see all the 
inserted rows.


Don't we run AFTER ROW triggers after inserting *all* the tuples anyway? At 
least this is what we do in the case of INSERT/UPDATE/DELETE if I'm not 
mistaken.


Um, yes, you're right. Now I feel silly. The above still applies to 
BEFORE ROW triggers, though.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] psql: bogus descriptions displayed by \d+

2011-08-12 Thread Peter Eisentraut

On tor, 2011-08-04 at 14:59 -0400, Robert Haas wrote:
  Well, the facts are:  According to the SQL standard, table
 includes
  views and foreign tables.  According to scientific-ish database
  literature, a table is a relation and vice versa.
 
 So what are you supposed to call it if you mean, specifically, a
 table?

A table is either a base table, a derived table, a transient table, or
a viewed table. (SQL/MED adds foreign table.)

Just FYI.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] psql: bogus descriptions displayed by \d+

2011-08-12 Thread Robert Haas

On Fri, Aug 12, 2011 at 4:11 PM, Peter Eisentraut pete...@gmx.net wrote:
 On tor, 2011-08-04 at 14:59 -0400, Robert Haas wrote:
  Well, the facts are:  According to the SQL standard, table
 includes
  views and foreign tables.  According to scientific-ish database
  literature, a table is a relation and vice versa.

 So what are you supposed to call it if you mean, specifically, a
 table?

 A table is either a base table, a derived table, a transient table, or
 a viewed table. (SQL/MED adds foreign table.)

 Just FYI.

Base table seems clear enough, and a transient table sounds like a
temporary table, but what is a derived table?  Is a viewed table a
view?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] index-only scans

2011-08-12 Thread PostgreSQL - Hans-Jürgen Schönig

On Aug 12, 2011, at 10:03 PM, Heikki Linnakangas wrote:

 On 11.08.2011 23:06, Robert Haas wrote:
 Comments, testing, review appreciated...
 
 I would've expected this to use an index-only scan:
 
 postgres=# CREATE TABLE foo AS SELECT generate_series(1,10) AS id;
 SELECT 10
 postgres=# CREATE INDEX i_foo ON foo (id) WHERE id = 10;
 CREATE INDEX
 postgres=# VACUUM ANALYZE foo;
 VACUUM
 postgres=# EXPLAIN SELECT id FROM foo WHERE id = 10;
   QUERY PLAN
 -
 Index Scan using i_foo on foo  (cost=0.00..8.27 rows=1 width=4)
   Index Cond: (id = 10)
 (2 rows)
 
 If it's not a predicate index, then it works:
 
 postgres=# DROP INDEX i_foo;
 DROP INDEX
 postgres=# EXPLAIN SELECT id FROM foo WHERE id = 10;
  QUERY PLAN
 ---
 Index Only Scan using i_foo2 on foo  (cost=0.00..8.28 rows=1 width=4)
   Index Cond: (id = 10)
 (2 rows)


is there any plan to revise the cost for index only scans compared to what it 
is now?

many thanks,

hans

--
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Extra check in 9.0 exclusion constraint unintended consequences

2011-08-12 Thread Jeff Davis

On Fri, 2011-08-12 at 14:58 -0400, Robert Haas wrote:
 Having thought about this a bit further, I'm coming around to the view
 that if it isn't worth adding this in master, it's not worth adding at
 all.  I just don't think it's going to get any visibility as a
 back-branch only doc patch.

Fine with me.

Regards,
Jeff Davis



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] New copyright program

2011-08-12 Thread David Fetter

Folks,

I noticed that src/tools/copyright looks like it can only be run on
Bruce's machine, so this translation to Perl is intended:

1.  To make the script idempotent, which allows its safe use in
automated tools that might run it many times.

2.  To get the script to run on any machine a PostgreSQL developer
would be using, as Perl is already required.

3.  To make the script more efficient.  As the copyright notice we
need to munge only appears once per file, it stops once it has made a
substitution.

Please find attached a patch implementing same.

Cheers,
David.
-- 
David Fetter da...@fetter.org http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david.fet...@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate
diff --git a/src/tools/copyright b/src/tools/copyright
deleted file mode 100755
index b0d61e1..000
--- a/src/tools/copyright
+++ /dev/null
@@ -1,14 +0,0 @@
-#!/bin/sh
-
-# src/tools/copyright
-
-echo Using current year:  `date '+%Y'`
-
-rgrep -l 'Copyright.*PostgreSQL Global Development Group' | while read FILE
-do
-   pipe sed 's/^\(.*Copyright (c) [12][0-9][0-9][0-9]\) \?- 
\?[12][0-9][0-9][0-9] \?,\?\( PostgreSQL Global Development 
Group.*\)$/\1-'`date '+%Y'`',\2/' $FILE
-   # handle cases where only one year appears
-   pipe sed 's/^\(.*Copyright (c) [12][0-9][0-9][0-9]\) \?,\?\( PostgreSQL 
Global Development Group.*\)$/\1-'`date '+%Y'`',\2/' $FILE
-done
-
-echo Manually update doc/src/sgml/legal.sgml and 
src/interfaces/libpq/libpq.rc.in too 12
diff --git a/src/tools/copyright.pl b/src/tools/copyright.pl
new file mode 100755
index 000..d981911
--- /dev/null
+++ b/src/tools/copyright.pl
@@ -0,0 +1,43 @@
+#!/usr/bin/perl 
+#
+# copyright.pl -- update copyright notices throughout the source tree, 
idempotently.
+#
+# Copyright (c) 2011, PostgreSQL Global Development Group
+#
+# src/tools/copyright.pl
+#
+
+use strict;
+use warnings;
+
+use File::Find;
+
+my $pgdg = 'PostgreSQL Global Development Group';
+my $cc = 'Copyright (c) ';
+# year-1900 is what localtime(time) puts in element 5
+my $year = 1900+${[localtime(time)]}[5];
+
+print Using current year:  $year\n;
+
+find({wanted = \wanted, no_chdir = 1}, '.');
+
+sub wanted {
+return unless -f $File::Find::name;
+
+my @lines;
+tie @lines, Tie::File, $File::Find::name;
+
+foreach my $line (@lines) {
+# We only care about lines with a copyright notice.
+next unless $line =~ m/$cc.*$pgdg/;
+# We stop when we've done one substitution.  This is both for
+# efficiency and, at least in the case of this program, for
+# correctness.
+last if $line =~ m/$cc.*$year.*$pgdg/;
+last if $line =~ s/($cc\d{4})(, $pgdg)/$1-$year$2/;
+last if $line =~ s/($cc\d{4})-\d{4}(, $pgdg)/$1-$year$2/;
+}
+untie @lines;
+}
+print Manually update doc/src/sgml/legal.sgml and 
src/interfaces/libpq/libpq.rc.in too\n;
+

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] VACUUM FULL versus system catalog cache invalidation

2011-08-12 Thread Simon Riggs

On Fri, Aug 12, 2011 at 7:09 PM, Tom Lane t...@sss.pgh.pa.us wrote:

 Right at the moment I'm leaning to approach #2.  I wonder if anyone
 sees it differently, or has an idea for a third approach?

You are trying to solve the problem directly, which seems overkill.

With HOT, there is very little need to perform a VACUUM FULL on any
shared catalog table. Look at the indexes...

I would a suggest that VACUUM FULL perform only a normal VACUUM on
shared catalog tables, then perform an actual VACUUM FULL only in dire
need (some simple heuristic in size and density). This avoids doing a
VACUUM FULL unless it is actually necessary to do so. That has the
added advantage of not locking out essential tables, which is always a
concern.

In the unlikely event we do actually have to VACUUM FULL a shared
catalog table, nuke any cache entry for the whole shared catalog. That
way we absolutely and positively will never get any more bugs in this
area, ever again. Sounds harsh, but these events are only actually
needed very, very rarely and hygiene is more important than a few
minor points of performance.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Inserting heap tuples in bulk in COPY

2011-08-12 Thread Robert Haas

On Fri, Aug 12, 2011 at 3:16 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 COPY is slow. Let's make it faster. One obvious optimization is to insert
 heap tuples in bigger chunks, instead of calling heap_insert() separately
 for every tuple. That saves the overhead of pinning and locking the buffer
 for every tuple, and you only need to write one WAL record for all the
 tuples written to the same page, instead of one for each tuple.

 Attached is a WIP patch to do that. It adds a new function,
 heap_multi_insert, which does the same thing as heap_insert, but works in
 bulk. It takes an array of tuples as argument, and tries to cram as many of
 them into the chosen targe page as it can, and only writes a single WAL
 record of the operation.

 This gives a significant speedup to COPY, particularly for narrow tables,
 with small tuples. Grouping multiple tuples into one WAL record reduces the
 WAL volume significantly, and the time spent in writing that WAL. The
 reduced overhead of repeatedly locking the buffer is also most noticeable on
 narrow tables. On wider tables, the effects are smaller. See
 copytest-results.txt, containing test results with three tables of different
 widths. The scripts used to get those numbers are also attached.

 Triggers complicate this. I believe it is only safe to group tuples together
 like this if the table has no triggers. A BEFORE ROW trigger might run a
 SELECT on the table being copied to, and check if some of the tuples we're
 about to insert exist. If we run BEFORE ROW triggers for a bunch of tuples
 first, and only then insert them, none of the trigger invocations will see
 the other rows as inserted yet. Similarly, if we run AFTER ROW triggers
 after inserting a bunch of tuples, the trigger for each of the insertions
 would see all the inserted rows. So at least for now, the patch simply falls
 back to inserting one row at a time if there are any triggers on the table.

 The patch is WIP, mainly because I didn't write the WAL replay routines yet,
 but please let me know if you see any issues.

I thought about trying to do this at one point in the past, but I
couldn't figure out exactly how to make it work.  I think the approach
you've taken here is good.

Aside from the point already raised about needing to worry only about
BEFORE ROW triggers, I don't see any showstoppers.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Inserting heap tuples in bulk in COPY

2011-08-12 Thread Andrew Dunstan




On 08/12/2011 04:57 PM, Robert Haas wrote:

I thought about trying to do this at one point in the past, but I
couldn't figure out exactly how to make it work.  I think the approach
you've taken here is good.

Aside from the point already raised about needing to worry only about
BEFORE ROW triggers, I don't see any showstoppers.



Yeah, this looks very promising indeed. Well done. In fact, I'm asking 
myself how hard it will be to backport for one particular customer of 
ours, for whom the WAL load is a major hotspot.


cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enforcing that all WAL has been replayed after restoring from backup

2011-08-12 Thread Dimitri Fontaine

Magnus Hagander mag...@hagander.net writes:
 Or add a signal
 handler in the pg_basebackup client emitting a warning about it?

 We don't have such a signal handler pg_dump either. I don't think we should
 add it.

 Hmm. I guess an aborted pg_dump will also look ok but actually be
 corrupt (or incomplete). Good point.

What about having the signal handler corrupt the backup by adding some
garbage into it?  Now the failure case is obvious…

Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Inserting heap tuples in bulk in COPY

2011-08-12 Thread Simon Riggs

On Fri, Aug 12, 2011 at 8:16 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:

 COPY is slow. Let's make it faster. One obvious optimization is to insert
 heap tuples in bigger chunks, instead of calling heap_insert() separately
 for every tuple. That saves the overhead of pinning and locking the buffer
 for every tuple, and you only need to write one WAL record for all the
 tuples written to the same page, instead of one for each tuple.

We don't pin the buffer for every tuple, that optimisation is already done...

When we discussed this before you said that it wasn't worth trying to
do this additional work - it was certainly a smaller gain than the one
we achieved by removing the pinning overhead.

Also, we discussed that you would work on buffering the index inserts,
which is where the main problem lies. The main heap is only a small
part of the overhead if we have multiple indexes already built on a
table - which is the use case that causes the most problem.

So I'm a little surprised to see you working on this and I'm guessing
that the COPY improvement with indexes is barely noticeable. This
would be a nice improvement, but not until the bulk index inserts are
done.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] index-only scans

2011-08-12 Thread Robert Haas

2011/8/12 PostgreSQL - Hans-Jürgen Schönig postg...@cybertec.at:
 is there any plan to revise the cost for index only scans compared to what it 
 is now?

That's one of the points I asked for feedback on in my original email.
 How should the costing be done?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Inserting heap tuples in bulk in COPY

2011-08-12 Thread Merlin Moncure

On Fri, Aug 12, 2011 at 2:16 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 COPY is slow. Let's make it faster. One obvious optimization is to insert
 heap tuples in bigger chunks, instead of calling heap_insert() separately
 for every tuple. That saves the overhead of pinning and locking the buffer
 for every tuple, and you only need to write one WAL record for all the
 tuples written to the same page, instead of one for each tuple.

 Attached is a WIP patch to do that. It adds a new function,
 heap_multi_insert, which does the same thing as heap_insert, but works in
 bulk. It takes an array of tuples as argument, and tries to cram as many of
 them into the chosen targe page as it can, and only writes a single WAL
 record of the operation.

 This gives a significant speedup to COPY, particularly for narrow tables,
 with small tuples. Grouping multiple tuples into one WAL record reduces the
 WAL volume significantly, and the time spent in writing that WAL. The
 reduced overhead of repeatedly locking the buffer is also most noticeable on
 narrow tables. On wider tables, the effects are smaller. See
 copytest-results.txt, containing test results with three tables of different
 widths. The scripts used to get those numbers are also attached.

 Triggers complicate this. I believe it is only safe to group tuples together
 like this if the table has no triggers. A BEFORE ROW trigger might run a
 SELECT on the table being copied to, and check if some of the tuples we're
 about to insert exist. If we run BEFORE ROW triggers for a bunch of tuples
 first, and only then insert them, none of the trigger invocations will see
 the other rows as inserted yet. Similarly, if we run AFTER ROW triggers
 after inserting a bunch of tuples, the trigger for each of the insertions
 would see all the inserted rows. So at least for now, the patch simply falls
 back to inserting one row at a time if there are any triggers on the table.

But generic RI triggers would be ok, right?

merlin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] USECS_* constants undefined with float8 timestamps?

2011-08-12 Thread Bruce Momjian

Robert Haas wrote:
 On Fri, Jul 29, 2011 at 11:18 AM, Johann 'Myrkraverk' Oskarsson
 joh...@2ndquadrant.com wrote:
  Hi all,
 
  I just noticed that the USECS_* constants are not defined when the server
  is compiled without integer dates and timestamps.
 
  Explicitly, timestamp.h is
 
  #ifdef HAVE_INT64_TIMESTAMP
  #define USECS_PER_DAY   INT64CONST(864)
  #define USECS_PER_HOUR  INT64CONST(36)
  #define USECS_PER_MINUTE INT64CONST(6000)
  #define USECS_PER_SEC   INT64CONST(100)
  #endif
 
  Is there a particular reason for this? ?Even with float8 timestamps
  there are uses for these constants in extensions.
 
 I don't see any particular reason not define them unconditionally.

Well, they are only used by integer dates, so why expand their
visibility?  The define does make it clear how they are used.  I suppose
if someone wanted to use them outside that case, we could open them up. 
It is true that with integer dates now the default, we might find that
someone introduces compile problems by using them outside the integer
dates scope.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] index-only scans

2011-08-12 Thread Kevin Grittner

Robert Haas robertmh...@gmail.com wrote:
 
 That's one of the points I asked for feedback on in my original
 email.  How should the costing be done?
 
It seems pretty clear that there should be some cost adjustment.  If
you can't get good numbers somehow on what fraction of the heap
accesses will be needed, I would suggest using a magic number based
on the assumption that half the heap access otherwise necessary will
be done.  It wouldn't be the worst magic number in the planner.  Of
course, real numbers are always better if you can get them.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] USECS_* constants undefined with float8 timestamps?

2011-08-12 Thread Johann 'Myrkraverk' Oskarsson

Bruce Momjian br...@momjian.us writes:

 Robert Haas wrote:
 On Fri, Jul 29, 2011 at 11:18 AM, Johann 'Myrkraverk' Oskarsson
 joh...@2ndquadrant.com wrote:
 
  I just noticed that the USECS_* constants are not defined when
  the server is compiled without integer dates and timestamps.
[snip]
 I don't see any particular reason not define them unconditionally.

 Well, they are only used by integer dates, so why expand their
 visibility?  The define does make it clear how they are used.  I
 suppose if someone wanted to use them outside that case, we could
 open them up. It is true that with integer dates now the default, we
 might find that someone introduces compile problems by using them
 outside the integer dates scope.

I found a use for them in PL/Java which detects at run-time whether
the server is using floating point or integer dates.  The simplest way
was just to use magic numbers instead on the off chance it's compiled
with a server using float dates.


-- 
   Johann Oskarssonhttp://www.2ndquadrant.com/|[]
   PostgreSQL Development, 24x7 Support, Training and Services  --+--
  |
   Blog: http://my.opera.com/myrkraverk/blog/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Inserting heap tuples in bulk in COPY

2011-08-12 Thread Heikki Linnakangas


On 13.08.2011 00:17, Simon Riggs wrote:

Also, we discussed that you would work on buffering the index inserts,
which is where the main problem lies. The main heap is only a small
part of the overhead if we have multiple indexes already built on a
table - which is the use case that causes the most problem.


Sure, if you have indexes on the table already, then the overhead of 
updating them is more significant. I am actually working on that too, I 
will make a separate post about that.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Inserting heap tuples in bulk in COPY

2011-08-12 Thread Heikki Linnakangas


On 13.08.2011 00:26, Merlin Moncure wrote:

On Fri, Aug 12, 2011 at 2:16 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com  wrote:

Triggers complicate this. I believe it is only safe to group tuples together
like this if the table has no triggers. A BEFORE ROW trigger might run a
SELECT on the table being copied to, and check if some of the tuples we're
about to insert exist. If we run BEFORE ROW triggers for a bunch of tuples
first, and only then insert them, none of the trigger invocations will see
the other rows as inserted yet. Similarly, if we run AFTER ROW triggers
after inserting a bunch of tuples, the trigger for each of the insertions
would see all the inserted rows. So at least for now, the patch simply falls
back to inserting one row at a time if there are any triggers on the table.


But generic RI triggers would be ok, right?


RI triggers are AFTER ROW triggers, which we concluded to be OK after 
all, so they would be ok.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] VACUUM FULL versus system catalog cache invalidation

2011-08-12 Thread daveg

On Fri, Aug 12, 2011 at 09:26:02PM +0100, Simon Riggs wrote:
 With HOT, there is very little need to perform a VACUUM FULL on any
 shared catalog table. Look at the indexes...
 
 I would a suggest that VACUUM FULL perform only a normal VACUUM on
 shared catalog tables, then perform an actual VACUUM FULL only in dire
 need (some simple heuristic in size and density). This avoids doing a
 VACUUM FULL unless it is actually necessary to do so. That has the
 added advantage of not locking out essential tables, which is always a
 concern.
 
 In the unlikely event we do actually have to VACUUM FULL a shared
 catalog table, nuke any cache entry for the whole shared catalog. That
 way we absolutely and positively will never get any more bugs in this
 area, ever again. Sounds harsh, but these events are only actually
 needed very, very rarely and hygiene is more important than a few
 minor points of performance.

This is a very optimistic view. My client makes heavy use of temp tables.
HOT and autovacuum are not sufficient to keep catalog bloat under control.
We run a daily script that calculates the density of the catalog and only
vaccum fulls those that are severely bloated. Here is a result from a
recent bloat check on one db. 'packed' is the number of pages needed for
the rows if they were packed, 'bloat' is the multiple of pages in use over
the number really needed.

relation  | tuples | pages | packed | bloat
--++---++---
 pg_class; -- |   4292 | 10619 |114 |  93.2
 pg_depend; --|  25666 |  7665 |217 |  35.4
 pg_attrdef; --   |   6585 |  7595 |236 |  32.2
 pg_type; --  |   4570 |  8177 |416 |  19.6
 pg_shdepend; --  |  52040 |  7968 |438 |  18.2

-dg
-- 
David Gould   da...@sonic.net  510 536 1443510 282 0869
If simplicity worked, the world would be overrun with insects.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] our buffer replacement strategy is kind of lame

2011-08-12 Thread daveg

On Fri, Aug 12, 2011 at 01:28:49PM +0100, Simon Riggs wrote:
 I think there are reasonable arguments to make
 
 * prefer_cache = off (default) | on a table level storage parameter,
 =on will disable the use of BufferAccessStrategy
 
 * make cache_spoil_threshold a parameter, with default 0.25
 
 Considering the world of very large RAMs in which we now live, some
 control of the above makes sense.

As long as we are discussion cache settings for tables, I have a client
who would like to be able to lock specific tables and indexes into cache
as they have strict response time requirements for particular queries.
At the moment they are running postgres with a tablespace on ramfs and
taking frequent backups, but this is not optimal.

-dg

-- 
David Gould   da...@sonic.net  510 536 1443510 282 0869
If simplicity worked, the world would be overrun with insects.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] OperationalError: FATAL: lock AccessShareLock on object 0/1260/0 is already

2011-08-12 Thread daveg


This seems to be bug month for my client. Now there are seeing periods
where all new connections fail immediately with the error:

   FATAL:  lock AccessShareLock on object 0/1260/0 is already held 

This happens on postgresql 8.4.7 on a large (512GB, 32 core) system that has
been up for months. It started happening sporadicly a few days ago. It will
do this for a period of several minutes to an hour and then go back to
normal for hours or days.

One complete failing session out of several hundred around that time:
-
2011-08-09 00:01:04.446  8823  [unknown]  [unknown]  LOG:  connection received: 
host=op05.xxx port=34067
2011-08-09 00:01:04.446  8823  c77  apps  LOG:  connection authorized: 
user=apps database=c77
2011-08-09 00:01:04.449  8823  c77  apps  FATAL:  lock AccessShareLock on 
object 0/1260/0 is already held
--
 
What can I do to help track this down?

-dg


--
David Gould   da...@sonic.net  510 536 1443510 282 0869
If simplicity worked, the world would be overrun with insects.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] PL/Perl Returned Array

2011-08-12 Thread David E. Wheeler

Hackers,

Given this script on 9.1beta3:

BEGIN;

CREATE EXTENSION plperl;

CREATE OR REPLACE FUNCTION wtf(
) RETURNS TEXT[] LANGUAGE plperl AS $$ return []; $$;

SELECT wtf() = '{}'::TEXT[];

ROLLBACK;

The output is:

BEGIN
CREATE EXTENSION
CREATE FUNCTION
 ?column? 
--
 f
(1 row)

ROLLBACK

Why is that? If I save the values to TEXT[] columns, they are still not the 
same. But if I pg_dump them and then load them up again, they *are* the same. 
The dump looks like this:

CREATE TABLE gtf (
have text[],
want text[]
);

COPY gtf (have, want) FROM stdin;
{}  {}
\.


Is PL/Perl doing something funky under the hood to array values it returns on 
9.1?

Thanks,

David


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PL/Perl Returned Array

2011-08-12 Thread Alex Hunsaker

On Fri, Aug 12, 2011 at 18:00, David E. Wheeler da...@kineticode.com wrote:
 Hackers,

 Given this script on 9.1beta3:

    BEGIN;

    CREATE EXTENSION plperl;

    CREATE OR REPLACE FUNCTION wtf(
    ) RETURNS TEXT[] LANGUAGE plperl AS $$ return []; $$;

    SELECT wtf() = '{}'::TEXT[];

 Why is that? If I save the values to TEXT[] columns, they are still not the 
 same. But if I pg_dump them and then load them up again, they *are* the same. 
 The dump looks like this:

Eek.

 Is PL/Perl doing something funky under the hood to array values it returns on 
 9.1?

Yeah, its not handling empty arrays very well (its calling
accumArrayResult which increments the number of elements even when we
are a zero length array).

This was rewritten in 9.1 as part of the array overhaul. Previously
returned arrays were converted into a string with the perl sub
encode_array_literal(), potgres would then convert that string into
the appropriate data type. Not only was that a tad inefficient, but it
failed to handle composite types nicely. In 9.1 you can pass in arrays
with composite types and the like, we figured it would be awfully nice
if you could return them as well :-)

Anyway, the attached patch fixes it for me. That is when we don't have
an array state, just return an empty array.  (Also adds some
additional comments)

I thought about doing this right after we set dims[0] in
plperl_array_to_datum(), but that would fail to handle nested empty
multidim arrays like [[]]. As long as you can do that via
ARRAY[ARRAY[]]... I figure plperl should support it to.

Thanks for the report!
*** a/src/pl/plperl/plperl.c
--- b/src/pl/plperl/plperl.c
***
*** 1078,1091  _array_to_datum(AV *av, int *ndims, int *dims, int cur_depth,
  	int			i = 0;
  	int			len = av_len(av) + 1;
  
- 	if (len == 0)
- 		astate = accumArrayResult(astate, (Datum) 0, true, atypid, NULL);
- 
  	for (i = 0; i  len; i++)
  	{
  		SV		  **svp = av_fetch(av, i, FALSE);
  		SV		   *sav = svp ? get_perl_array_ref(*svp) : NULL;
  
  		if (sav)
  		{
  			AV		   *nav = (AV *) SvRV(sav);
--- 1078,1092 
  	int			i = 0;
  	int			len = av_len(av) + 1;
  
  	for (i = 0; i  len; i++)
  	{
+ 		/* fetch the array element */
  		SV		  **svp = av_fetch(av, i, FALSE);
+ 
+ 		/* see if this element is an array, if so get that */
  		SV		   *sav = svp ? get_perl_array_ref(*svp) : NULL;
  
+ 		/* multi-dimensional array? */
  		if (sav)
  		{
  			AV		   *nav = (AV *) SvRV(sav);
***
*** 1149,1154  plperl_array_to_datum(SV *src, Oid typid)
--- 1150,1158 
  	astate = _array_to_datum((AV *) SvRV(src), ndims, dims, 1, astate, typid,
  			 atypid);
  
+ 	if (!astate)
+ 		return PointerGetDatum(construct_empty_array(atypid));
+ 
  	for (i = 0; i  ndims; i++)
  		lbs[i] = 1;
  

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] USECS_* constants undefined with float8 timestamps?

2011-08-12 Thread Bruce Momjian

Johann 'Myrkraverk' Oskarsson wrote:
 Bruce Momjian br...@momjian.us writes:
 
  Robert Haas wrote:
  On Fri, Jul 29, 2011 at 11:18 AM, Johann 'Myrkraverk' Oskarsson
  joh...@2ndquadrant.com wrote:
  
   I just noticed that the USECS_* constants are not defined when
   the server is compiled without integer dates and timestamps.
 [snip]
  I don't see any particular reason not define them unconditionally.
 
  Well, they are only used by integer dates, so why expand their
  visibility?  The define does make it clear how they are used.  I
  suppose if someone wanted to use them outside that case, we could
  open them up. It is true that with integer dates now the default, we
  might find that someone introduces compile problems by using them
  outside the integer dates scope.
 
 I found a use for them in PL/Java which detects at run-time whether
 the server is using floating point or integer dates.  The simplest way
 was just to use magic numbers instead on the off chance it's compiled
 with a server using float dates.

OK, that is a good reason.  Done for PG 9.2.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgbench internal contention

2011-08-12 Thread Bruce Momjian

Robert Haas wrote:
 Works for me.  Just to confirm, that means we'd also change GEQO to
 use pg_erand48().
 
  BTW, as far as the original plan of using random_r is concerned, how
  did you manage to not run into this?
  http://sourceware.org/bugzilla/show_bug.cgi?id=3662
  I just wasted half an hour on that stupidity in an unrelated context...
 
 Good grief.  It's hard to imagine a more user-hostile attitude than
 the one taken there.  I did run into that precise issue, but managed

Yes, it is scary to think people are relying on software written by
people which such a poor attitude toward quality.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PL/Perl Returned Array

2011-08-12 Thread David E. Wheeler

On Aug 12, 2011, at 6:17 PM, Alex Hunsaker wrote:

 Anyway, the attached patch fixes it for me. That is when we don't have
 an array state, just return an empty array.  (Also adds some
 additional comments)

Fix confirmed, thank you!

+1 to getting this committed before the next beta/RC.

David


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] our buffer replacement strategy is kind of lame

2011-08-12 Thread Greg Stark

On Fri, Aug 12, 2011 at 5:05 AM, Robert Haas robertmh...@gmail.com wrote:
 Only 96 of the 14286 buffers in sample_data are in shared_buffers,
 despite the fact that we have 37,218 *completely unused* buffers lying
 around.  That sucks, because it means that the sample query did a
 whole lot of buffer eviction that was completely needless.  The ring
 buffer is there to prevent trashing the shared buffer arena, but here
 it would have been fine to trash the arena: there wasn't anything
 there we would have minded losing (or, indeed, anything at all).

I don't disagree with the general thrust of your point, but I just
wanted to point out one case where not using free buffers even though
they're available might make sense:

If you execute a large batch delete or update or even just set lots of
hint bits you'll dirty a lot of buffers. The ring buffer forces the
query that is actually dirtying all these buffers to also do the i/o
to write them out. Otherwise you leave them behind to slow down other
queries. This was one of the problems with the old vacuum code which
the ring buffer replaced. It left behind lots of dirtied buffers for
other queries to do i/o on.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PL/Perl Returned Array

2011-08-12 Thread Darren Duncan


David E. Wheeler wrote:

On Aug 12, 2011, at 6:17 PM, Alex Hunsaker wrote:

Anyway, the attached patch fixes it for me. That is when we don't have
an array state, just return an empty array.  (Also adds some
additional comments)


Fix confirmed, thank you!

+1 to getting this committed before the next beta/RC.


Policy question.  If this problem hadn't been detected until after 9.1 was 
declared production, would it have been considered a bug to fix in 9.1.x or 
would it have been delayed to 9.2 since that fix might be considered an 
incompatibility?  If the latter, then I'm really glad it was found now. -- 
Darren Duncan


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Transient plans versus the SPI API

2011-08-12 Thread Bruce Momjian

Tom Lane wrote:
  Note that the SPI functions are more or less directly exposed in PL/Perl
  and PL/Python, and there are a number of existing idioms there that make
  use of prepared plans.  Changing the semantics of those functions might
  upset a lot of code.
 
 Right, but by the same token, if we don't change the default behavior,
 there is going to be a heck of a lot of code requiring manual adjustment
 before it can make use of the (hoped-to-be) improvements.  To me it
 makes more sense to change the default and then provide ways for people
 to lock down the behavior if the heuristic doesn't work for them.

Agreed.  I think the big sticking point is that without logic on how the
replanning will happen, users are having to guess how much impact this
new default behavior will have.  I also agree that this will harm some
uses but improve a larger pool of users.  Remember, the people on this
email list are probably using this feature in a much more sophisticated
way than the average user.

Also, there is a TODO idea that the results found by executing the query
(e.g. number of rows returned at each stage) could be fed back and
affect the replanning of queries.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

81 matches

Mail list logo