date:20110525

Hi,

Query:
=
I am aware of the fact that Expression evaluator is used to filter tuples
based on the qual. However, I think it is not the only purpose it serves. As
I manipulated the ExecEvalVar in the executor, the queries started behaving
abnormally (the comparison operators  , ,  lost effect and at times
started working ). Is the creation of qual also done (in part?) by the
expression evaluator? If yes, what part of query creation process is handled
by the evaluator? I know ExecInitExpr does expr - exprstate tree creation,
but it is not the evaluator.

If the above is confusing, I just want to ask: Is expression evaluator,
even in part responsible for {PLANNEDSTMT creation?



My work / trial so far:
===
I see that the target list to be scanned is handled by ExecTargetList
function.

I am not so sure about this because this function is not listed in the list
of functions which GDB shows me (./configure --enable-debug  make clean 
make  make install). Rest everything else (almost / perhaps) is shown!
(Can anyone tell why? :( )

Whatever be the case, I think the target list creation will somehow call
ExecEvalVar because:

1. It is this function whose modification gets me the erroneous behavior.
2. Server, when run in debug 4 / 5 mode shows target entries as VARs.
3. VARs are range table representation in some form and ExecEvalVar makes
sense to create the target entry.
---


What changes caused this:
===

I inserted this line into ExecEvalVar as well as ExecEvalScalarVar:

return cute_datum_array[attnum];

before:

return slot_getattr(slot, attnum, isNull);

It was just an experiment to see how things are getting done. I know most
devs here would suggest GDB but I am already using that and I did this to
check possible effects than trace.

-

I seriously think the target list would involve VARs whose one-time
evaluation is done. Since cute_datum_array was a garbage array
(intentional), it would crash! Am I right?

Regards,
Vaibhav

Re: [HACKERS] Expression Evaluator used for creating the plan tree / stmt ?

On Wed, May 25, 2011 at 3:22 PM, Vaibhav Kaushal 
vaibhavkaushal...@gmail.com wrote:

 I see that the target list to be scanned is handled by ExecTargetList
 function.

 I am not so sure about this because this function is not listed in the list
 of functions which GDB shows me (./configure --enable-debug  make clean 
 make  make install). Rest everything else (almost / perhaps) is shown!
 (Can anyone tell why? :( )


You would need to turn optimization off by passing -O0 flag to the
compiler. Otherwise static functions may get optimized and you may not see
them in gdb stack.

CFLAGS=-O0 ./configure --enable-debug

Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com

Re: [HACKERS] use less space in xl_xact_commit patch

On Wed, May 18, 2011 at 2:11 PM, Leonardo Francalanci m_li...@yahoo.it wrote:
 this is a second version: now using

  int            counts[1];      /* variable-length array of counts */

 in xl_xact_commit to keep track of number of

 different arrays at the end of the struct.

 Waiting for feedbacks...

I can't find a clear discussion of what you are trying to do, and how,
just a URL back to a complex discussion on another topic.

I can't see any analysis that shows whether this would be effective in
reducing space of WAL records and what the impacts might be.

The patch contains very close to zero comments.

Please explain what you are trying to do, and what the likely benefits
of doing it will be. Add comments to the patch to make that very
clear, e.g. As of 9.2, the record format changed to reduce space...

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pre-alloc ListCell's optimization

2011-05-25 Thread Stephen Frost

* Alvaro Herrera (alvhe...@commandprompt.com) wrote:
 Excerpts from Stephen Frost's message of mar may 24 22:56:21 -0400 2011:
 
A couple of notes regarding the patch:
  
First, it uses ffs(), which might not be fully portable..  We could
certainly implement the same thing in userspace and use ffs() when
it's available.
 
 Err, see RIGHTMOST_ONE in bitmapset.c.

hah, and I even looked for something first, apparently not very well.
Thanks, simple change at least.  :)

Stephen


signature.asc
Description: Digital signature

Re: [HACKERS] Proposal: Another attempt at vacuum improvements

On Tue, May 24, 2011 at 10:59 PM, Robert Haas robertmh...@gmail.com wrote:

 So, first of all, thanks for putting some effort and thought into
 this.  Despite the large number of improvements in this area in 8.3
 and 8.4, this is still a pain point, and it would be really nice to
 find a way to make some further improvements.


Thanks for bringing up the idea during PGCon. That helped me to get
interested in this again. I hope we would be able to take this to a logical
conclusion this time and do something to alleviate the pain.


  On Tue, May 24, 2011 at 2:58 AM, Pavan Deolasee
 pavan.deola...@gmail.com wrote:
  So the idea is to separate the index vacuum (removing index pointers to
 dead
  tuples) from the heap vacuum. When we do heap vacuum (either by
 HOT-pruning
  or using regular vacuum), we can spool the dead line pointers somewhere.
 To
  avoid any hot-spots during normal processing, the spooling can be done
  periodically like the stats collection.

 What happens if the system crashes after a line pointer becomes dead
 but before the record of its death is safely on disk?  The fact that a
 previous index vacuum has committed is only sufficient justification
 for reclaiming the dead line pointers if you're positive that the
 index vacuum killed the index pointers for *every* dead line pointer.
 I'm not sure we want to go there; any operation that wants to make a
 line pointer dead will need to be XLOG'd.  Instead, I think we should
 stick with your original idea and just try to avoid the second heap
 pass.


I would not mind keeping the design simple for the first release. So even if
we can avoid the second heap scan in vacuum, that would be a big win. In the
long term though, I think it will pay off keeping track of dead line
pointers as they are generated. The only way though they are generated is
while cleaning up the page holding the clean-up lock and the operation is
WAL logged. So spooling dead line pointers during WAL replay should be
possible.

Anyways, I would like not to pursue the idea and I am OK with a simplified
version to start with where every heap vacuum is followed by index vacuum,
collecting and holding the dead line pointers in the maintenance memory.


 So to do that, as you say, we can have every operation that creates a
 dead line pointer note the LSN of the operation in the page.


Yes.


 But instead of allocating permanent space in the page header, which would
 both reduce (admittedly only by 8 bytes) the amount of space available
 for tuples, and more significantly have the effect of breaking on-disk
 compatibility, I'm wondering if we could get by with making space for
 that extra LSN only when it's actually present. In other words, when
 it's present, we set a bit PD_HAS_DEAD_LINE_PTR_LSN or somesuch,
 increment pd_upper, and use the extra space to store the LSN.  There
 is an alignment problem to worry about there but that shouldn't be a
 huge issue.


That might work but would require us to move tuples around when the first
dead line pointer gets generated in the page. You may argue that we should
be holding a cleanup-lock when that happens and the dead line pointer
creation is always followed by a call to PageRepairFragmentation(), so it
should be easier to make space for the LSN.

Instead of storing the LSN after the page header, would it be easier to set
pd_special and store the LSN at the end of the page ?


 When we vacuum, we remember the LSN before we start.  When we finish,
 if we scanned the indexes and everything completed without error, then
 we bump the heap's notion (wherever we store it) of the last
 successful index vacuum.  When we vacuum or do HOT cleanup on a page,
 if the page has a most-recent-dead-line pointer LSN and it precedes
 the start-of-last-successful-index-vacuum LSN, then we mark all the
 LP_DEAD tuples as LP_UNUSED and throw away the
 most-recent-dead-line-pointer LSN.


Right. And if the cleanup generates new dead line pointers, the LSN will be
replaced with the LSN of the current operation.


 One downside of this approach is that, if we do something like this,
 it'll become slightly more complicated to figure out where the item
 pointer array ends.  Another issue is that we might find ourselves
 wanting to extend the item pointer array to add a new item, and unable
 to do so easily because this most-recent-dead-line-pointer LSN is in
 the way.


I think that should be not so difficult to handle. I think handling this by
special space mechanism might be less complicated.


 If the LSN stored in the page precedes the
 start-of-last-successful-index-vacuum LSN, and if, further, we can get
 a buffer cleanup lock on the page, then we can do a HOT cleanup and
 life is good.  Otherwise, we can either (1) just forget about the
 most-recent-dead-line-pointer LSN - not ideal but not catastrophic
 either - or (2) if the start-of-last-successful-vacuum-LSN is old
 enough, we could overwrite an LP_DEAD line pointer in place.


I don't think we

[HACKERS] Volunteering as Commitfest Manager

I hear that CF manager is a difficult role for a single individual.

So it makes sense to split that role between multiple people.

I volunteer to be the CF manager for Replication, and also for
Performance. I have an interest in and long experience in each of
those areas, so that makes sense for me. The titles refer to the
categories on the CF application, self-selected by patch authors, not
an increased remit to wade in on anything that appears to fall into
those categories.

Patches need an eventual committer anyway, so this is a reasonable way
of having the process be managed by the eventual committer.

I don't see the role as an authoritarian one, so I will happily pass
over patches to other committers who may wish that and/or call for
help from particular people as required.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Domains versus polymorphic functions, redux

2011-05-25 Thread Noah Misch

On Tue, May 24, 2011 at 02:00:54PM -0400, Noah Misch wrote:
 On Tue, May 24, 2011 at 01:28:38PM -0400, Tom Lane wrote:
  Noah Misch n...@leadboat.com writes:
   On Tue, May 24, 2011 at 12:12:55PM -0400, Tom Lane wrote:
   This is a consequence of the changes I made to fix bug #5717,
   particularly the issues around ANYARRAY matching discussed here:
   http://archives.postgresql.org/pgsql-hackers/2010-10/msg01545.php
  
   We discussed this a few weeks ago:
   http://archives.postgresql.org/message-id/20110511093217.gb26...@tornado.gateway.2wire.net
  
   What's to recommend #1 over what I proposed then?  Seems like a discard of
   functionality for little benefit.
  
  I am unwilling to commit to making #2 work, especially not under time
  constraints; and you apparently aren't either, since you haven't
  produced the patch you alluded to at the end of that thread.
 
 I took your lack of any response as non-acceptance of the plan I outlined.
 Alas, the wrong conclusion.  I'll send a patch this week.

See attached arrdom1v1-polymorphism.patch.  This currently adds one syscache
lookup per array_append, array_prepend or array_cat call when the anyarray
type is not a domain.  When the type is a domain, it adds a few more.  We
could add caching without too much trouble.  I suppose someone out there uses
these functions in bulk operations, though I've yet to see it.  Is it worth
optimizing this straightway?

For a function like
  CREATE FUNCTION f(anyarray, VARIADIC anyarray) RETURNS int LANGUAGE sql
  AS 'SELECT array_length($1, 1) + array_length($2, 1)'
we must coerce the variadic argument array to a domain type when the other
anyarray argument(s) compel it.  Having implemented that, it was nearly free
to re-support a VARIADIC parameter specifically declared with a domain over an
array.  Consequently, I've done that as well.

See here for previously-disclosed rationale:
  
http://archives.postgresql.org/message-id/20110511191249.ga29...@tornado.gateway.2wire.net


I audited remaining get_element_type() callers.  CheckAttributeType() needs to
recurse into domains over array types just like any other array type.  Fixed
trivially in arrdom2v1-checkattr.patch; see its test case for an example hole.

nm
diff --git a/src/backend/catalog/pg_aggregate.c 
b/src/backend/catalog/pg_aggregate.c
index 86e8c6b..8194519 100644
*** a/src/backend/catalog/pg_aggregate.c
--- b/src/backend/catalog/pg_aggregate.c
***
*** 305,310  lookup_agg_function(List *fnName,
--- 305,311 
Oid *rettype)
  {
Oid fnOid;
+   Oid vartype;
boolretset;
int nvargs;
Oid*true_oid_array;
***
*** 321,327  lookup_agg_function(List *fnName,
 */
fdresult = func_get_detail(fnName, NIL, NIL,
   nargs, input_types, 
false, false,
!  fnOid, rettype, 
retset, nvargs,
   true_oid_array, 
NULL);
  
/* only valid case is a normal function not returning a set */
--- 322,328 
 */
fdresult = func_get_detail(fnName, NIL, NIL,
   nargs, input_types, 
false, false,
!  fnOid, rettype, 
vartype, retset, nvargs,
   true_oid_array, 
NULL);
  
/* only valid case is a normal function not returning a set */
diff --git a/src/backend/catalog/pg_proc.index 6250b07..8fdd88e 100644
*** a/src/backend/catalog/pg_proc.c
--- b/src/backend/catalog/pg_proc.c
***
*** 272,278  ProcedureCreate(const char *procedureName,
variadicType = 
ANYELEMENTOID;
break;
default:
!   variadicType = 
get_element_type(allParams[i]);
if 
(!OidIsValid(variadicType))
elog(ERROR, 
variadic parameter is not an array);
break;
--- 272,278 
variadicType = 
ANYELEMENTOID;
break;
default:
!   variadicType = 
get_base_element_type(allParams[i]);
if 
(!OidIsValid(variadicType))
elog(ERROR, 
variadic parameter is not

Re: [HACKERS] Proposal: Another attempt at vacuum improvements

On Tue, May 24, 2011 at 7:58 AM, Pavan Deolasee
pavan.deola...@gmail.com wrote:

 The biggest gripe today is that vacuum needs two heap scans and each scan
 dirties the buffer.

That's not that clear to me. The debate usually stalls because we
don't have sufficient info from real world analysis of where the time
goes.

 So the idea is to separate the index vacuum (removing index pointers to dead
 tuples) from the heap vacuum. When we do heap vacuum (either by HOT-pruning
 or using regular vacuum), we can spool the dead line pointers somewhere.

ISTM it will be complex to attempt to store the exact list of TIDs
between VACUUMs.

At the moment we scan indexes if we have  0 rows to remove, which is
probably wasteful. Perhaps it would be better to keep a running total
of rows to remove, by updating pg_stats, then when we hit a certain
threshold in total we can do the index scan. So we don't need to
remember the TIDs, just remember how many there were and use that to
avoid cleaning too vigorously.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] The way to know whether the standby has caught up with the master

On Wed, May 25, 2011 at 6:16 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:

 To achieve that, I'm thinking to change walsender so that, when the
 standby
 has caught up with the master, it sends back the message indicating that
 to
 the standby. And I'm thinking to add new function (or view like
 pg_stat_replication)
 available on the standby, which shows that info.

 By the time the standby has received that message, it might not be caught-up
 anymore because new WAL might've been generated in the master already.

AFAICS, this is an exact case of the Byzantine General's problem.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Nested CASE-WHEN scoping

While looking at fixing the multiple-evaluation issue in IN and BETWEEN 
discussed a while ago, I realized that the current assumption that only 
one CaseTestExpr placeholder needs to be valid at any given time is not 
true.


Here's a bit contrived example:

CREATE FUNCTION evileq (timestamptz, int4) returns boolean AS $$
 SELECT case $2 WHEN length($1::text) THEN true ELSE false END;
$$ language sql;
CREATE OPERATOR = (procedure = evileq, leftarg = timestamptz, rightarg = 
int4);


postgres=#  SELECT now() = 29, CASE now() WHEN 29 THEN 'foo' ELSE 'bar' END;
 ?column? | case
--+--
 t| bar
(1 row)

Direct call to the operator, now () = 29 returns true, but when used 
in CASE-WHEN, which implicitly does the same comparison, the result is 
false. Admittedly that's pretty far-fetched, but nevertheless it's a bug.


As part of the BETWEEN/IN fix, I was going to refactor CaseTestExpr and 
CoerceToDomainValue placeholder node types into one generic placeholder 
node. BETWEEN needs three placeholder slots in the worst case [*], and 
now it seems that we need to handle an arbitrary number of simultaneous 
placeholders even for CASE-WHEN.


So I'm going to put the BETWEEN/IN fix aside for now, and refactor the 
placeholder infrastructure to handle several simultaneous placeholders, 
and replace CaseTestExpr and CoerceToDomainValue with it. Actually 
AggRef and WindowFunc nodes look a lot like CaseTestExpr and 
CoerceToDomainValue too, but I'm a bit scared of touching those.


PS. This is all 9.2 material, in case you're wondering. We're talking 
about pretty big patches.


[*] a BETWEEN SYMMETRIC b AND c is handled as (a = b AND a = c) OR (a 
= b AND a = c), leading to multiple evaluation of all three operands 
if placeholders are not used


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Nested CASE-WHEN scoping


On 25.05.2011 14:57, Heikki Linnakangas wrote:

Here's a bit contrived example:

CREATE FUNCTION evileq (timestamptz, int4) returns boolean AS $$
SELECT case $2 WHEN length($1::text) THEN true ELSE false END;
$$ language sql;
CREATE OPERATOR = (procedure = evileq, leftarg = timestamptz, rightarg =
int4);

postgres=# SELECT now() = 29, CASE now() WHEN 29 THEN 'foo' ELSE 'bar' END;
?column? | case
--+--
t | bar
(1 row)

Direct call to the operator, now () = 29 returns true, but when used
in CASE-WHEN, which implicitly does the same comparison, the result is
false. Admittedly that's pretty far-fetched, but nevertheless it's a bug.


I should add that this works fine if the function is not an SQL function 
that gets inlined. But inlining is desirable, we don't want to give up 
on that, and inhibiting it in that case would need some extra 
bookkeeping anyway.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: Another attempt at vacuum improvements

On Wed, May 25, 2011 at 7:31 AM, Simon Riggs si...@2ndquadrant.com wrote:
 On Tue, May 24, 2011 at 7:58 AM, Pavan Deolasee
 pavan.deola...@gmail.com wrote:

 The biggest gripe today is that vacuum needs two heap scans and each scan
 dirties the buffer.

 That's not that clear to me. The debate usually stalls because we
 don't have sufficient info from real world analysis of where the time
 goes.

It probably wouldn't be too hard to write a simple patch to measure
time spent during the first heap pass, time spent scanning indexes,
and time spent on the second heap pass.  But it's not just about where
the time goes: as Pavan says, the second heap pass feels like a waste.
 Actually, the time spent scanning the indexes kinda feels like a
waste too, if the number of tuples being removed is small.  Which
brings me to your second point:

 So the idea is to separate the index vacuum (removing index pointers to dead
 tuples) from the heap vacuum. When we do heap vacuum (either by HOT-pruning
 or using regular vacuum), we can spool the dead line pointers somewhere.

 ISTM it will be complex to attempt to store the exact list of TIDs
 between VACUUMs.

 At the moment we scan indexes if we have  0 rows to remove, which is
 probably wasteful. Perhaps it would be better to keep a running total
 of rows to remove, by updating pg_stats, then when we hit a certain
 threshold in total we can do the index scan. So we don't need to
 remember the TIDs, just remember how many there were and use that to
 avoid cleaning too vigorously.

That occurred to me, too.  If we're being launched by autovacuum then
we know that a number of updates and deletes equal ~20% (or whatever
autovacuum_vacuum_scale_factor is set to) of the table size have
occurred since the last autovacuum.  But it's possible that many of
those were HOT updates, in which case the number of index entries to
be cleaned up might be much less than 20% of the table size.
Alternatively, it's possible that we'd be better off vacuuming the
table more often (say, autovacuum_vacuum_scale_factor=0.10 or 0.08 or
something) but only doing the index scans every once in a while when
enough dead line pointers have accumulated.  After all, it's the first
heap pass that frees up most of the space; cleaning dead line pointers
seems a bit less urgent.  But not having done any real analysis of how
this would work out in practice, I'm not sure whether it's a good idea
or not.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reducing overhead of frequent table locks

On Tue, May 24, 2011 at 6:37 PM, Robert Haas robertmh...@gmail.com wrote:

 That being said, it's a slight extra cost for all fast-path lockers to 
 benefit
 the strong lockers, so I'm not prepared to guess whether it will pay off.

 Yeah.  Basically this entire idea is about trying to make life easier
 for weak lockers at the expense of making it more difficult for strong
 lockers.  I think that's a good trade-off in general, but we might
 need to wait until we have an actual implementation to judge whether
 we've turned the dial too far.

I like this overall concept and like the way this has been described
with strong and weak locks. It seems very useful to me, since temp
tables can be skipped. That leaves shared DDL and we have done lots to
reduce the lock levels held and are looking at further reductions
also. I think even quite extensive delays are worth the trade-off.

I'd been looking at this also, though hadn't mentioned it previously
because I found an Oracle patent that discusses dynamically turning on
and off locking. So that's something to be aware of. IMHO if we
discuss this in terms of sharing/not sharing locking information then
it is sufficient to avoid the patent. That patent also discusses the
locking state change needs to wait longer than required.

I got a bit lost with the description of a potential solution. It
seemed like you were unaware that there is a local lock and a shared
lock table, maybe just me?

Design seemed relatively easy from there: put local lock table in
shared memory for all procs. We then have a use_strong_lock at proc
and at transaction level. Anybody that wants a strong lock first sets
use_strong_lock at proc and transaction level, then copies all local
lock data into shared lock table, double checking
TransactionIdIsInProgress() each time. Then queues for lock using the
now fully set up shared lock table. When transaction with strong lock
completes we do not attempt to reset transaction level boolean, only
at proc level, since DDL often occurs in groups and we want to avoid
flip-flopping quickly between lock share states. Cleanup happens by
regularly by bgwriter, perhaps every 10 seconds or so. All locks are
still visible for pg_locks.

Hopefully thats of use.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] tackling full page writes

On Tue, May 24, 2011 at 10:52 PM, Jeff Davis pg...@j-davis.com wrote:
 On Tue, 2011-05-24 at 16:34 -0400, Robert Haas wrote:
 As I think about it a bit more, we'd
 need to XLOG not only the parts of the page we actually modifying, but
 any that the WAL record would need to be correct on replay.

 I don't understand that statement. Can you clarify?

I'll try.  Suppose we have two WAL records A and B, with no
intervening checkpoint, that both modify the same page.  A reads chunk
1 of that page and then modifies chunk 2.  B modifies chunk 1.  Now,
suppose we make A do a partial page write on chunk 2 only, and B do
the same for chunk 1.  At the point the system crashes, A and B are
both on disk, and the page has already been written to disk as well.
Replay begins from a checkpoint preceding A.

Now, when we get to the record for A, what are we to do?  If it were a
full page image, we could just restore it, and everything would be
fine after that.  But if we replay the partial page write, we've got
trouble.  A will now see the state of the chunk 1 as it existed after
the action protected by B occurred, and will presumably do the wrong
thing.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] tackling full page writes

On Tue, May 24, 2011 at 11:52 PM, Bruce Momjian br...@momjian.us wrote:
 Robert Haas wrote:
 2. The other fairly obvious alternative is to adjust our existing WAL
 record types to be idempotent - i.e. to not rely on the existing page
 contents.  For XLOG_HEAP_INSERT, we currently store the target tid and
 the tuple contents.  I'm not sure if there's anything else, but we
 would obviously need the offset where the new tuple should be written,
 which we currently infer from reading the existing page contents.  For
 XLOG_HEAP_DELETE, we store just the TID of the target tuple; we would
 certainly need to store its offset within the block, and maybe the
 infomask.  For XLOG_HEAP_UPDATE, we'd need the old and new offsets and
 perhaps also the old and new infomasks.  Assuming that's all we need
 and I'm not missing anything (which I won't bet on), that means we'd
 be adding, say, 4 bytes per insert or delete and 8 bytes per update.
 So, if checkpoints are spread out widely enough that there will be
 more than ~2K operations per page between checkpoints, then it makes
 more sense to just do a full page write and call it good.  If not,
 this idea might have legs.

 I vote for wal_level = idempotent because so few people will know what
 idempotent means.  ;-)

That idea has the additional advantage of confusing the level of
detail of our WAL logging (minimal vs. archive vs. hot standby) with
the mechanism used to protect against torn pages (full page writes vs.
idempotent WAL records vs. prayer).  When they set it wrong and
destroy their system, we can tell them it's their own fault for not
configuring the system properly!  Bwahahahaha!

In all seriousness, I can't imagine that we'd make this
user-configurable in the first place, since that would amount to
having two sets of WAL records each of which would be even less well
tested than what we have now; and for a project this complex, we
probably shouldn't even consider changing things that seem to work now
unless the new system is clearly better than the old.

 Idempotent does seem like the most promising idea.

I tend to agree with you, but I'm worried it won't actually work out
to a win.  By the time we augment the records with enough additional
information we may have eaten up a lot of the benefit we were hoping
to get.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reducing overhead of frequent table locks

On Wed, May 25, 2011 at 8:27 AM, Simon Riggs si...@2ndquadrant.com wrote:
 I got a bit lost with the description of a potential solution. It
 seemed like you were unaware that there is a local lock and a shared
 lock table, maybe just me?

No, I'm not unaware of the local lock table.  The point of this
proposal is to avoid fighting over the LWLocks that protect the shared
hash table by allowing some locks to be taken without touching it.

 Design seemed relatively easy from there: put local lock table in
 shared memory for all procs. We then have a use_strong_lock at proc
 and at transaction level. Anybody that wants a strong lock first sets
 use_strong_lock at proc and transaction level, then copies all local
 lock data into shared lock table, double checking
 TransactionIdIsInProgress() each time. Then queues for lock using the
 now fully set up shared lock table. When transaction with strong lock
 completes we do not attempt to reset transaction level boolean, only
 at proc level, since DDL often occurs in groups and we want to avoid
 flip-flopping quickly between lock share states. Cleanup happens by
 regularly by bgwriter, perhaps every 10 seconds or so. All locks are
 still visible for pg_locks.

I'm not following this...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reducing overhead of frequent table locks

On Wed, May 25, 2011 at 1:44 PM, Robert Haas robertmh...@gmail.com wrote:
 On Wed, May 25, 2011 at 8:27 AM, Simon Riggs si...@2ndquadrant.com wrote:
 I got a bit lost with the description of a potential solution. It
 seemed like you were unaware that there is a local lock and a shared
 lock table, maybe just me?

 No, I'm not unaware of the local lock table.  The point of this
 proposal is to avoid fighting over the LWLocks that protect the shared
 hash table by allowing some locks to be taken without touching it.

Yes, I got that. I just couldn't work out where mmap came in.

 Design seemed relatively easy from there: put local lock table in
 shared memory for all procs. We then have a use_strong_lock at proc
 and at transaction level. Anybody that wants a strong lock first sets
 use_strong_lock at proc and transaction level, then copies all local
 lock data into shared lock table, double checking
 TransactionIdIsInProgress() each time. Then queues for lock using the
 now fully set up shared lock table. When transaction with strong lock
 completes we do not attempt to reset transaction level boolean, only
 at proc level, since DDL often occurs in groups and we want to avoid
 flip-flopping quickly between lock share states. Cleanup happens by
 regularly by bgwriter, perhaps every 10 seconds or so. All locks are
 still visible for pg_locks.

 I'm not following this...

Which bit aren't you following? It's a design outline for how to
implement, deliberately brief to allow a discussion of design
alternatives.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] tackling full page writes

On Tue, May 24, 2011 at 9:34 PM, Robert Haas robertmh...@gmail.com wrote:

 I'm not sure that these ideas are much good, but
 for the sake of posterity:

Both (1) and (2) seem promising to me.

Heikki mentioned (2) would only be effective if we managed to change
*all* WAL records. ISTM likely that we would find that difficult
and/or time consuming and/or buggy.

I would qualify that by saying *all* WAL record types for a certain
page type, such as all btree index pages. Not that much better, since
heap and btree are the big ones, ISTM.

(1) seems like we could do it incrementally if we supported both
partial and full page writes at same time. That way we could work on
the most frequent record types over time. Not sure if that is
possible, but seems worth considering.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: Another attempt at vacuum improvements

On Wed, May 25, 2011 at 1:27 PM, Robert Haas robertmh...@gmail.com wrote:

 At the moment we scan indexes if we have  0 rows to remove, which is
 probably wasteful. Perhaps it would be better to keep a running total
 of rows to remove, by updating pg_stats, then when we hit a certain
 threshold in total we can do the index scan. So we don't need to
 remember the TIDs, just remember how many there were and use that to
 avoid cleaning too vigorously.

 That occurred to me, too.  If we're being launched by autovacuum then
 we know that a number of updates and deletes equal ~20% (or whatever
 autovacuum_vacuum_scale_factor is set to) of the table size have
 occurred since the last autovacuum.  But it's possible that many of
 those were HOT updates, in which case the number of index entries to
 be cleaned up might be much less than 20% of the table size.
 Alternatively, it's possible that we'd be better off vacuuming the
 table more often (say, autovacuum_vacuum_scale_factor=0.10 or 0.08 or
 something) but only doing the index scans every once in a while when
 enough dead line pointers have accumulated.  After all, it's the first
 heap pass that frees up most of the space; cleaning dead line pointers
 seems a bit less urgent.  But not having done any real analysis of how
 this would work out in practice, I'm not sure whether it's a good idea
 or not.

We know whether a TID was once in the index or not, so we can keep an
exact count. HOT doesn't come into it.

We can remove TIDs from index as well without VACUUM during btree
split avoidance. We can optimise the second scan by skipping htids no
longer present in the index, though we'd need a spare bit to mark
usage that which I'm not sure we have.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] use less space in xl_xact_commit patch

2011-05-25 Thread Leonardo Francalanci

 Da: Simon Riggs si...@2ndquadrant.com
 I can't find a clear  discussion of what you are trying to do, and how,
 just a URL back to a  complex discussion on another topic.


While trying to write a patch to allow changing an unlogged table into
a logged one, I had to add another int field to xl_xact_commit.
Robert Haas said:

 I have to admit I don't like this approach very much.  I can't see 
adding 4 bytes to every commit record for this feature.


which is a correct remark.

xl_xact_commit can contain some arrays (relation to drops, 
committed sub-trans, shared invalidation msgs). The length of
these arrays is specified using 3 ints in the struct. 

So, to avoid adding more ints to the struct, I've been suggested to
remove all the ints, and use   xl_xact_commit.xinfo to flag which
arrays are, in fact, present. 

So the whole idea is:

- remove nrels, nsubxacts and nmsgs from xl_xact_commit
- use bits in xinfo to signal which arrays are present at the end
of   xl_xact_commit 
- for each present array, add the length of the array (as int) at
the end ofxl_xact_commit 
- add each present array after all the lengths


 I can't see any analysis that shows  whether this would be effective in
 reducing space of WAL records and what the  impacts might be.


The space of WAL records should always be = than without the patch:
in the worst case scenario, all ints are present (as they would without
the patch). In the best case, which I guess is the more common, each
xl_xact_commit will be 3 ints shorter.

I don't think the effect will be perceivable: the whole idea is to allow
more variable-length structures in xl_xact_commit in the future,
without incrementing the space on disk.

 The patch contains very close to zero  comments.


I tried to add some.
 
 

Leonardo

commitlog_lessbytes01.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: Another attempt at vacuum improvements

On Wed, May 25, 2011 at 7:07 AM, Pavan Deolasee
pavan.deola...@gmail.com wrote:
 But instead of allocating permanent space in the page header, which would
 both reduce (admittedly only by 8 bytes) the amount of space available
 for tuples, and more significantly have the effect of breaking on-disk
 compatibility, I'm wondering if we could get by with making space for
 that extra LSN only when it's actually present. In other words, when
 it's present, we set a bit PD_HAS_DEAD_LINE_PTR_LSN or somesuch,
 increment pd_upper, and use the extra space to store the LSN.  There
 is an alignment problem to worry about there but that shouldn't be a
 huge issue.

 That might work but would require us to move tuples around when the first
 dead line pointer gets generated in the page.

I'm confused.  A major point of the approach I was proposing was to
avoid having to move tuples around.

 You may argue that we should
 be holding a cleanup-lock when that happens and the dead line pointer
 creation is always followed by a call to PageRepairFragmentation(), so it
 should be easier to make space for the LSN.

I'm not sure if this is the same thing you're saying, but certainly
the only time we need to make space for this value is when we've just
remove tuples from the page and defragmented, and at that point there
should certainly be 8 bytes free somewhere.

 Instead of storing the LSN after the page header, would it be easier to set
 pd_special and store the LSN at the end of the page ?

I was proposing storing it after the line pointer array, not after the
page header.  If we store it at the end of the page, I suspect we're
going to basically end up allocating permanent space for it, because
otherwise we'll have to shift all the tuple data forward and backward
by 8 bytes when we allocate or deallocate space for this.  Now, maybe
that's OK: I'm not sure.  But it's something to think about carefully.
 If we are going to allocate permanent space, the special space seems
better than the page header, because we should be able to make that
work without on-disk compatibility, and because AFAIUI we only need
the space for heap pages, not index pages.

 I think that should be not so difficult to handle. I think handling this by
 special space mechanism might be less complicated.

A permanent space allocation will certainly be simpler.  I'm just not
sure how much we care about giving up 8 bytes that could potentially
be used to store tuple data.

 If the LSN stored in the page precedes the
 start-of-last-successful-index-vacuum LSN, and if, further, we can get
 a buffer cleanup lock on the page, then we can do a HOT cleanup and
 life is good.  Otherwise, we can either (1) just forget about the
 most-recent-dead-line-pointer LSN - not ideal but not catastrophic
 either - or (2) if the start-of-last-successful-vacuum-LSN is old
 enough, we could overwrite an LP_DEAD line pointer in place.

 I don't think we need the cleanup lock to turn the LP_DEAD line pointers to
 LP_UNUSED since that does not involve moving tuples around. So a simple
 EXCLUSIVE lock should be enough. But we would need to WAL log the operation
 of turning DEAD to UNUSED, so it would be simpler to consolidate this in HOT
 pruning. There could be exceptions such as, say large number of DEAD line
 pointers are accumulated towards the end and reclaiming those would free up
 substantial space in the page. But may be we can use those conditions to
 invoke HOT prune instead of handling them separately.

Makes sense.

 Another issue is that this causes problems for temporary and unlogged
 tables, because no WAL records are generated and, therefore, the LSN
 does not advance.  This is also a problem for GIST indexes; Heikki
 fixed temporary GIST indexes by generating fake LSNs off of a
 backend-local counter.  Unlogged GIST indexes are currently not
 supported.  I think what we need to do is create an API to which you
 can pass a relation and get an LSN.  If it's a permanent relation, you
 get a regular LSN.  If it's a temporary relation, you get a fake LSN
 based on a backend-local counter.  If it's an unlogged relation, you
 get a fake LSN based on a shared-memory counter that is reset on
 restart.  If we can encapsulate that properly, it should provide both
 what we need to make this idea work and allow a somewhat graceful fix
 for GIST-vs-unlogged problem.

 Can you explain more how things would work for unlogged tables ? Do we use
 the same shared memory counter for tracking last successful index vacuum ?

Yes.

 If so, how do we handle the case where after restart the page may get LSN
 less than the index vacuum LSN if the index vacuum happened before the
 crash/stop ?

Well, on a crash, the unlogged relations get truncated, and their
indexes also, so no problem.  On a clean shutdown, I guess we need to
arrange to save the counter across restarts.

Take a look at the existing logic around GetXLogRecPtrForTemp().
That's slightly different, because there we don't even

Re: [HACKERS] use less space in xl_xact_commit patch

2011-05-25 Thread Leonardo Francalanci



commitlog_lessbytes02.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] The way to know whether the standby has caught up with the master

Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
 On 25.05.2011 07:42, Fujii Masao wrote:
 To achieve that, I'm thinking to change walsender so that, when the standby
 has caught up with the master, it sends back the message indicating that to
 the standby. And I'm thinking to add new function (or view like
 pg_stat_replication)
 available on the standby, which shows that info.

 By the time the standby has received that message, it might not be 
 caught-up anymore because new WAL might've been generated in the master 
 already.

Even assuming that you believe this is a useful capability, there is no
need to change walsender.  It *already* sends the current-end-of-WAL in
every message, which indicates precisely whether the message contains
all of available WAL data.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: Another attempt at vacuum improvements

On Wed, May 25, 2011 at 7:20 PM, Robert Haas robertmh...@gmail.com wrote:
 On Wed, May 25, 2011 at 7:07 AM, Pavan Deolasee
 pavan.deola...@gmail.com wrote:
 But instead of allocating permanent space in the page header, which would
 both reduce (admittedly only by 8 bytes) the amount of space available
 for tuples, and more significantly have the effect of breaking on-disk
 compatibility, I'm wondering if we could get by with making space for
 that extra LSN only when it's actually present. In other words, when
 it's present, we set a bit PD_HAS_DEAD_LINE_PTR_LSN or somesuch,
 increment pd_upper, and use the extra space to store the LSN.  There
 is an alignment problem to worry about there but that shouldn't be a
 huge issue.

 That might work but would require us to move tuples around when the first
 dead line pointer gets generated in the page.

 I'm confused.  A major point of the approach I was proposing was to
 avoid having to move tuples around.


Well, I am not sure how you can always guarantee to make space
available for the LSN without moving tuples , irrespective of where
you store it.  But probably its not important as we discussed below.

 You may argue that we should
 be holding a cleanup-lock when that happens and the dead line pointer
 creation is always followed by a call to PageRepairFragmentation(), so it
 should be easier to make space for the LSN.

 I'm not sure if this is the same thing you're saying, but certainly
 the only time we need to make space for this value is when we've just
 remove tuples from the page and defragmented, and at that point there
 should certainly be 8 bytes free somewhere.


Agree.

 Instead of storing the LSN after the page header, would it be easier to set
 pd_special and store the LSN at the end of the page ?

 I was proposing storing it after the line pointer array, not after the
 page header.  If we store it at the end of the page, I suspect we're
 going to basically end up allocating permanent space for it, because
 otherwise we'll have to shift all the tuple data forward and backward
 by 8 bytes when we allocate or deallocate space for this.  Now, maybe
 that's OK: I'm not sure.  But it's something to think about carefully.
  If we are going to allocate permanent space, the special space seems
 better than the page header, because we should be able to make that
 work without on-disk compatibility, and because AFAIUI we only need
 the space for heap pages, not index pages.


I think if are reclaiming LP_DEAD line pointers only while
defragmenting the page, we can always reclaim the space for the LSN,
irrespective of where we store it. So may be we should decide
depending on what would matter for on-disk compatibility and whatever
requires least invasive changes. I don't know what is that yet.



 If so, how do we handle the case where after restart the page may get LSN
 less than the index vacuum LSN if the index vacuum happened before the
 crash/stop ?

 Well, on a crash, the unlogged relations get truncated, and their
 indexes also, so no problem.  On a clean shutdown, I guess we need to
 arrange to save the counter across restarts.

Oh ok. I was not aware that unlogged tables get truncated. I think we
can just restore from the value stored for last successful index
vacuum (after incrementing it may be). That should be possible.


 Take a look at the existing logic around GetXLogRecPtrForTemp().
 That's slightly different, because there we don't even need to be
 consistent across backends.  We just need an increasing sequence of
 values.  For unlogged relations things are a bit more complex - but it
 seems manageable.

Ok. Will look at it.

Thanks,
Pavan


-- 
Pavan Deolasee
EnterpriseDB     http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Volunteering as Commitfest Manager

2011-05-25 Thread Stephen Frost

* Simon Riggs (si...@2ndquadrant.com) wrote:
 So it makes sense to split that role between multiple people.

I agree that it'd be good to have multiple people supporting the CF.
I've been considering volunteering to be part of such a group for a
given CF.

 I volunteer to be the CF manager for Replication, and also for
 Performance.

I dislike the idea of splitting up the duties along these lines.
However, if others are willing to handle it that way and we can get
coverage for all the CF categories, I wouldn't object.  The reason
that I dislike this is simply that the actual work of the CF manager
isn't really distinguished in any way based on if it's a Replication
patch or a Performance patch or some other patch.  Most of the CF work
is about following-up with reviewers and authors and committers.  

I feel this kind of I'd prefer to be working on what interests me/what
I'm knowledgable in concept is typically addressed through the
self-volunteer reviewers, where someone will mark themselves down as a
reviewer for a specific patch because it's what they're interested in.
Additionally, reviewers, when volunteering, can, and often do, ask for
specific kinds of patches (though it's usually driven more by 'size' or
'complexity' than category).  That feels like a better place to put this
than at the CF manager level.

 Patches need an eventual committer anyway, so this is a reasonable way
 of having the process be managed by the eventual committer.

I also don't like the idea that committers, when supporting a
commitfest, would only be involved/working on specific categories of
patches.  I feel that part of the idea of a commitfest is to have
individuals, particularly committers, looking at patches which fall
outside of what they're currently working on (since they can/could
commit those things more-or-less anytime).

My thinking (and I have no idea how others feel or if anyone's even
considered it this way) is simply that part of the responsibility of a
committer would be that they help get non-committer patches, which are
good for PG, submitted through the right process, etc, but which may
not be of specific interest to any given committer, committed.  If a
patch is already of interest to a committer because it hits on a
hot-spot with them then that patch doesn't really *need* a CF to get
done.

Thanks,

Stephen


signature.asc
Description: Digital signature

Re: [HACKERS] Proposal: Another attempt at vacuum improvements

On Wed, May 25, 2011 at 5:57 PM, Robert Haas robertmh...@gmail.com wrote:


 Alternatively, it's possible that we'd be better off vacuuming the
 table more often (say, autovacuum_vacuum_scale_factor=0.10 or 0.08 or
 something) but only doing the index scans every once in a while when
 enough dead line pointers have accumulated.

Thats precisely the reason I suggested separating heap and index
vacuums instead of a tight integration as we have now. If we don't
spool the dead line pointers in a separate area though, we would need
to make sure that index vacuum runs through the heap first to collect
the dead line pointers and then remove the corresponding index
pointers. We would need to also take into consideration the
implications on visibility map for any such scheme to work correctly
and efficiently.

Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB     http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] tackling full page writes

Robert Haas wrote:
 That idea has the additional advantage of confusing the level of
 detail of our WAL logging (minimal vs. archive vs. hot standby) with
 the mechanism used to protect against torn pages (full page writes vs.
 idempotent WAL records vs. prayer).  When they set it wrong and
 destroy their system, we can tell them it's their own fault for not
 configuring the system properly!  Bwahahahaha!

I love it!  Create confusing configuration options and blame the user!

 In all seriousness, I can't imagine that we'd make this
 user-configurable in the first place, since that would amount to
 having two sets of WAL records each of which would be even less well
 tested than what we have now; and for a project this complex, we
 probably shouldn't even consider changing things that seem to work now
 unless the new system is clearly better than the old.
 
  Idempotent does seem like the most promising idea.
 
 I tend to agree with you, but I'm worried it won't actually work out
 to a win.  By the time we augment the records with enough additional
 information we may have eaten up a lot of the benefit we were hoping
 to get.

This is where I was confused.  Our bad case now is when someone modifies
one row on a page between checkpoints --- instead of writing 400 bytes,
we write 8400.  What portion of between-checkpoint activity writes more
than a few rows to a page?  I didn't think many, except for COPY. 
Ideally we could switch in and out of this mode per page, but that seems
super-complicated.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] New/Revised TODO? Gathering actual read performance data for use by planner

2011-05-25 Thread Greg Smith


Michael Nolan wrote:
Based on last year's discussion of this TODO item, it seems thoughts 
have been focused on estimating how much data is
being satisfied from PG's shared buffers.  However, I think that's 
only part of the problem.  


Sure, but neither it nor what you're talking about are the real gating 
factor on making an improvement here.  Figuring out how to expose all 
this information to the optimizer so it can use it when planning is the 
hard part.  Collecting a read time profile is just one of the many ways 
you can estimate what's in cache, and each of the possible methods has 
good and bad trade-offs.  I've been suggesting that people assume that's 
a solved problem--I'm pretty sure what you're proposing was done by Greg 
Stark once and a prototype built even--and instead ask what you're going 
to do next if you had this data.


This data would probably need to be kept separately for each table or 
index, as some tables or indexes
may be mostly or fully in cache or on faster physical media than 
others, although in the absence of other
data about a specific table or index, data about other relations in 
the same tablespace might be of some use. 


This is the important part.  Model how the data needs to get stored such 
that the optimizer can make decisions using it, and I consider it easy 
to figure out how it will get populated later.  There are actually 
multiple ways to do it, and it may end up being something people plug-in 
an implementation that fits their workload into, rather than just having 
one available.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Volunteering as Commitfest Manager

Simon Riggs si...@2ndquadrant.com writes:
 I hear that CF manager is a difficult role for a single individual.
 So it makes sense to split that role between multiple people.

 I volunteer to be the CF manager for Replication, and also for
 Performance. ...
 Patches need an eventual committer anyway, so this is a reasonable way
 of having the process be managed by the eventual committer.

ISTM that we really want the CF manager to be somebody who is *not*
directly involved in reviewing or committing patches.  It's a distinct
skill set, so there is no reason why it's a good idea for a committer
to do it.  And we need to get the CF work load more spread out, not more
concentrated.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] use less space in xl_xact_commit patch

2011-05-25 Thread Leonardo Francalanci

Sorry, email sent without body.

Fixed some English mistakes.

commitlog_lessbytes02.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Volunteering as Commitfest Manager

On Wed, May 25, 2011 at 3:24 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Simon Riggs si...@2ndquadrant.com writes:
 I hear that CF manager is a difficult role for a single individual.
 So it makes sense to split that role between multiple people.

 I volunteer to be the CF manager for Replication, and also for
 Performance. ...
 Patches need an eventual committer anyway, so this is a reasonable way
 of having the process be managed by the eventual committer.

 ISTM that we really want the CF manager to be somebody who is *not*
 directly involved in reviewing or committing patches.  It's a distinct
 skill set, so there is no reason why it's a good idea for a committer
 to do it.  And we need to get the CF work load more spread out, not more
 concentrated.

I agree it makes sense if a non-committer performs the role. If a
committer does take the role, I would volunteer to split the role and
for me to work on the suggested areas.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reducing overhead of frequent table locks

Simon Riggs si...@2ndquadrant.com writes:
 On Wed, May 25, 2011 at 1:44 PM, Robert Haas robertmh...@gmail.com wrote:
 On Wed, May 25, 2011 at 8:27 AM, Simon Riggs si...@2ndquadrant.com wrote:
 Design seemed relatively easy from there: put local lock table in
 shared memory for all procs. We then have a use_strong_lock at proc
 and at transaction level. Anybody that wants a strong lock first sets
 use_strong_lock at proc and transaction level, then copies all local
 lock data into shared lock table,

 I'm not following this...

 Which bit aren't you following? It's a design outline for how to
 implement, deliberately brief to allow a discussion of design
 alternatives.

What I'm not following is how moving the local lock table into shared
memory can possibly be a good idea.  The reason we invented the local
lock table in the first place (we didn't use to have one) is so that a
process could do some manipulations without touching shared memory.
(Notably, it is currently nearly free, and certainly lock-free, to
re-request a lock type you already hold.  This is not an infrequent
case.)  That advantage will go away if you do this.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pull up aggregate subquery

2011-05-25 Thread Hitoshi Harada

2011/5/25 Hitoshi Harada umi.tan...@gmail.com:
 So I'm still
 thinking which of pulling up and parameterized scan is better.

After more investigation I came up with third idea, pushing down
RangeTblEntry to aggregate subquery. This sounds like a crazy idea,
but it seems to me like it is slightly easier than pulling up
agg-subquery. The main point is that when you want to pull up, you
must care if the final output would be correct with other join
relations than the aggregate-related one. In contrast, when pushing
down the target relation to agg-subquery it is simple to ensure the
result.

I'm looking around pull_up_subqueries() in subquery_planner() to add
the pushing down logic. It could be possible to do it around
make_one_rel() but I bet query structure changes are doable against
Query, not PlannerInfo.

I appreciate any comments.

Regards,

-- 
Hitoshi Harada

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Expression Evaluator used for creating the plan tree / stmt ?

Excerpts from Vaibhav Kaushal's message of mié may 25 05:52:32 -0400 2011:

 If the above is confusing, I just want to ask: Is expression evaluator,
 even in part responsible for {PLANNEDSTMT creation?

Yeah, as far as I understood Tom's talk, the expr evaluator is used to
reduce some expressions to constants and such.

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Expression Evaluator used for creating the plan tree / stmt ?

Alvaro Herrera alvhe...@commandprompt.com writes:
 Excerpts from Vaibhav Kaushal's message of miÃ© may 25 05:52:32 -0400 2011:
 If the above is confusing, I just want to ask: Is expression evaluator,
 even in part responsible for {PLANNEDSTMT creation?

 Yeah, as far as I understood Tom's talk, the expr evaluator is used to
 reduce some expressions to constants and such.

The planner would never call it with an expression containing a Var,
though.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pull up aggregate subquery

Hitoshi Harada umi.tan...@gmail.com writes:
 2011/5/25 Hitoshi Harada umi.tan...@gmail.com:
 So I'm still
 thinking which of pulling up and parameterized scan is better.

 After more investigation I came up with third idea, pushing down
 RangeTblEntry to aggregate subquery. This sounds like a crazy idea,

Yes, it sure does.  Won't that typically change the number of rows
produced by the subquery's jointree?  And thus possibly change the
aggregate's result?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reducing overhead of frequent table locks

On Wed, May 25, 2011 at 8:56 AM, Simon Riggs si...@2ndquadrant.com wrote:
 On Wed, May 25, 2011 at 1:44 PM, Robert Haas robertmh...@gmail.com wrote:
 On Wed, May 25, 2011 at 8:27 AM, Simon Riggs si...@2ndquadrant.com wrote:
 I got a bit lost with the description of a potential solution. It
 seemed like you were unaware that there is a local lock and a shared
 lock table, maybe just me?

 No, I'm not unaware of the local lock table.  The point of this
 proposal is to avoid fighting over the LWLocks that protect the shared
 hash table by allowing some locks to be taken without touching it.

 Yes, I got that. I just couldn't work out where mmap came in.

I don't think we were planning to use mmap anywhere.

 Design seemed relatively easy from there: put local lock table in
 shared memory for all procs. We then have a use_strong_lock at proc
 and at transaction level. Anybody that wants a strong lock first sets
 use_strong_lock at proc and transaction level, then copies all local
 lock data into shared lock table, double checking
 TransactionIdIsInProgress() each time. Then queues for lock using the
 now fully set up shared lock table. When transaction with strong lock
 completes we do not attempt to reset transaction level boolean, only
 at proc level, since DDL often occurs in groups and we want to avoid
 flip-flopping quickly between lock share states. Cleanup happens by
 regularly by bgwriter, perhaps every 10 seconds or so. All locks are
 still visible for pg_locks.

 I'm not following this...

 Which bit aren't you following? It's a design outline for how to
 implement, deliberately brief to allow a discussion of design
 alternatives.

Well, OK:

1. I don't think putting the local lock table in shared memory is a
good idea both for performance (keeping it uncontended has value) and
memory management (it would increase shared memory requirements quite
a bit).  The design Noah and I were discussing upthread uses only a
small and fixed amount of shared memory for each backend, and
preserves the local lock table as an unshared resource.

2. You haven't explained the procedure for acquire a weak lock at all,
and in particular how a weak locker would be able to quickly determine
whether a conflicting strong lock was potentially present.

3. I don't understand the proposed procedure for acquiring a strong
lock either; in particular, I don't see why
TransactionIdIsInProgress() would be relevant.  The lock manager
doesn't really do anything with transaction IDs now, and you haven't
offered any explanation of why that would be necessary or advisable.

4. You haven't explained what the transaction-level boolean would
actually do.  It's not even clear whether you intend for that to be
kept in local or shared memory.  It's also unclear what you intend for
bgwriter to clean up.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] use less space in xl_xact_commit patch

On Wed, May 25, 2011 at 2:43 PM, Leonardo Francalanci m_li...@yahoo.it wrote:
 Da: Simon Riggs si...@2ndquadrant.com
 I can't find a clear  discussion of what you are trying to do, and how,
 just a URL back to a  complex discussion on another topic.


 While trying to write a patch to allow changing an unlogged table into
 a logged one, I had to add another int field to xl_xact_commit.
 Robert Haas said:

  I have to admit I don't like this approach very much.  I can't see
 adding 4 bytes to every commit record for this feature.


 which is a correct remark.

 xl_xact_commit can contain some arrays (relation to drops,
 committed sub-trans, shared invalidation msgs). The length of
 these arrays is specified using 3 ints in the struct.

 So, to avoid adding more ints to the struct, I've been suggested to
 remove all the ints, and use   xl_xact_commit.xinfo to flag which
 arrays are, in fact, present.

 So the whole idea is:

 - remove nrels, nsubxacts and nmsgs from xl_xact_commit
 - use bits in xinfo to signal which arrays are present at the end
 of   xl_xact_commit
 - for each present array, add the length of the array (as int) at
 the end of    xl_xact_commit
 - add each present array after all the lengths


OK, thats clear. Thanks.

That formatting sounds quite complex.

I would propose we split this into 2 WAL records: xl_xact_commit and
xl_xact_commit_with_info

xl_xact_commit doesn't have any flags, counts or arrays.

xl_xact_commit_with_info always has all 3 counts, even if zero.
Arrays follow the main record

I think it might also be possible to removed dbId and tsId from
xl_act_commit if we use that definition.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum

Florian Helmberger f...@25th-floor.com writes:
 On 25.05.11 04:47, Tom Lane wrote:
 Florian Helmbergerf...@25th-floor.com  writes:
 I'm running a production database with PostgreSQL 9.0.3 (64-bit) on
 Debian 5.0.4 and have an issue with a TOAST table and far to frequent
 autovacuum runs.
 
 I think I've pinned the problem down to the values pg_class holds for
 the affected TOAST table:
 
 relpages  | 433596
 reltuples | 1868538
 
 These values are significantly too low. Interestingly, the autovacuum
 logout reports the correct values:
 
 pages: 0 removed, 34788136 remain
 tuples: 932487 removed, 69599038 remain
 
 but these aren't stored in pg_class after each run.

 That's exceedingly weird.  Do the pg_stat_all_tables columns update
 after autovacuums on that table?

 Yes they do:

I think I see what must be going on here: that toast table must contain
a long run of all-visible-according-to-the-VM pages (which is hardly a
surprising situation).  This results in VACUUM choosing not to update
the pg_class entry:

/*
 * Update statistics in pg_class.  But only if we didn't skip any pages;
 * the tuple count only includes tuples from the pages we've visited, and
 * we haven't frozen tuples in unvisited pages either.  The page count is
 * accurate in any case, but because we use the reltuples / relpages ratio
 * in the planner, it's better to not update relpages either if we can't
 * update reltuples.
 */
if (vacrelstats-scanned_all)
vac_update_relstats(onerel,
vacrelstats-rel_pages, vacrelstats-rel_tuples,
vacrelstats-hasindex,
FreezeLimit);

For an ordinary table this wouldn't be fatal because we'll still do an
ANALYZE from time to time, and that will update the entries with new
(approximate) values.  But we never run ANALYZE on toast tables.

And this would *still* be okay, because as noted in the comment, the
planner only depends on the ratio being roughly correct, not on either
individual value being current.  But autovacuum didn't get the memo;
it thinks it can use reltuples to make decisions.

I can see two basic approaches we might take here:

1. Modify autovacuum to use something from the stats collector, rather
than reltuples, to make its decisions.  I'm not too clear on why
reltuples is being used there anyway; is there some good algorithmic or
statistical reason why AV should be looking at a number from the last
vacuum?

2. Revise the vacuum code so that it doesn't skip updating the pg_class
entries.  We could have it count the number of pages it skipped, rather
than just keeping a bool, and then scale up the rel_tuples count to be
approximately right by assuming the skipped pages have tuple density
similar to the scanned ones.

Thoughts?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pull up aggregate subquery

2011-05-25 Thread Hitoshi Harada

2011/5/26 Tom Lane t...@sss.pgh.pa.us:
 Hitoshi Harada umi.tan...@gmail.com writes:
 2011/5/25 Hitoshi Harada umi.tan...@gmail.com:
 So I'm still
 thinking which of pulling up and parameterized scan is better.

 After more investigation I came up with third idea, pushing down
 RangeTblEntry to aggregate subquery. This sounds like a crazy idea,

 Yes, it sure does.  Won't that typically change the number of rows
 produced by the subquery's jointree?  And thus possibly change the
 aggregate's result?

No, because it will be restricted to assumption that join qual ==
grouping key and outer relation is unique on the var in that join
qual. Pushing down the baserel to sub-aggregate is equivalent to
pulling up sub-aggregate eventually if there are only two relations
(which is my first example.) The difference is if the optimizer needs
to care about other relations which may change the number of rows
produced.

Regards,

-- 
Hitoshi Harada

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Nested CASE-WHEN scoping

Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
 While looking at fixing the multiple-evaluation issue in IN and BETWEEN 
 discussed a while ago, I realized that the current assumption that only 
 one CaseTestExpr placeholder needs to be valid at any given time is not 
 true.

[ scratches head ... ]  Why does the save/restore in ExecEvalCase not
take care of this?

 So I'm going to put the BETWEEN/IN fix aside for now, and refactor the 
 placeholder infrastructure to handle several simultaneous placeholders, 

That sounds like a mess, and I'm not even convinced it'll solve the
problem ...

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Volunteering as Commitfest Manager

On Wed, May 25, 2011 at 10:32 AM, Simon Riggs si...@2ndquadrant.com wrote:
 On Wed, May 25, 2011 at 3:24 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Simon Riggs si...@2ndquadrant.com writes:
 I hear that CF manager is a difficult role for a single individual.
 So it makes sense to split that role between multiple people.

 I volunteer to be the CF manager for Replication, and also for
 Performance. ...
 Patches need an eventual committer anyway, so this is a reasonable way
 of having the process be managed by the eventual committer.

 ISTM that we really want the CF manager to be somebody who is *not*
 directly involved in reviewing or committing patches.  It's a distinct
 skill set, so there is no reason why it's a good idea for a committer
 to do it.  And we need to get the CF work load more spread out, not more
 concentrated.

 I agree it makes sense if a non-committer performs the role. If a
 committer does take the role, I would volunteer to split the role and
 for me to work on the suggested areas.

I think it would be really valuable for you to get more involved in
reviewing some of the patches, whether you end up doing any of the
CommitFest management tasks or not.  In the past, your involvement has
been light; but we have a limited number of people who are qualified
to do detailed technical reviews of complicated patches, and more help
is very much needed.

I share the opinion expressed upthread that in an ideal world, we
would not have major reviewers or committers also doing CommitFest
management work, but in some sense that's an ideal world that we don't
live in.  Greg Smith and Kevin Grittner, both of whom have
successfully managed CommitFests in the past, also contribute in many
other ways, and I would not want to say either that those other
contributions disqualify them from managing CommitFests, or that
because they are managing the CommitFest they should suspend their
other involvement.  Either rule would be a random bureaucratic
obstacle that accomplishes nothing useful.  So I don't see the fact
that you are a committer as a reason why you can't or shouldn't help
with CommitFest management - if, for example, you don't want to review
patches yourself, but you do want to help organize other people to
volunteer to review patches, we should accept the contribution that
you're willing to make rather than insisting that you should make some
other contribution instead.

So, what I think is important is to understand exactly what you'd like
to volunteer to do, and see whether that fits in well with what other
people want to do.  Typically, at the beginning of the CommitFest,
there are a bunch of people who volunteer to be assigned a patch.
This task is likely best done by one person who has the entire
CommitFest in mind - in particular, I have found that it's a good idea
to tackle the most significant patches first; the little stuff can be
picked off at the end without much trouble, and is rarely what holds
the process up.  However, all of the other CommitFest management tasks
can be easily split across multiple people.  The main task, and where
the vast majority of the work goes, is in following up on patches
where discussion has died off or failed to reach a conclusion, and
encouraging the patch author to resubmit or the reviewer to re-review
or a committer to consider it for commit or just marking it Returned
with Feedback if there's insufficient consensus and/or time and/or
motivation to get it finished up in a month's time.  We've allowed
this to become the problem of a single and rather monolithic
individual, but that system sucks, and I have no desire to perpetuate
it.  It sucks for that individual because it's a lot of work, and it's
difficult to keep many unrelated patches in your head at the same
time; and it sucks for everyone else, because about half the time that
one monolithic individual finds that the rest of their life interferes
with PostgreSQL, or the other things they are doing for PostgreSQL
interfere with their role as CommitFest manager, and they don't do it
as well as it needs to be done to really make things run smoothly.

Therefore, if you'd like to volunteer to help keep on top of the
replication and performance patches, and make sure that they are
getting followed up on regularly, I think that would be excellent.  We
tried before having a position of patch-chaser or assistant
commitfest manager by topic for the September of 2009 CommitFest, but
it wasn't entirely successful.  It's time to revisit that concept,
because we're not always going to find one person who can devote 40+
hours to managing a CommitFest, and even when we can, it's not exactly
accomplishing our goal of decentralizing the work.  It's better than
the Tom do everything approach, and we do have a lot of people
helping review now, but there are still two or three people per
CommitFest burning a lot of midnight oil.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum

Excerpts from Tom Lane's message of mié may 25 11:47:52 -0400 2011:

 I think I see what must be going on here: that toast table must contain
 a long run of all-visible-according-to-the-VM pages (which is hardly a
 surprising situation).  This results in VACUUM choosing not to update
 the pg_class entry:
 
 /*
  * Update statistics in pg_class.  But only if we didn't skip any pages;
  * the tuple count only includes tuples from the pages we've visited, and
  * we haven't frozen tuples in unvisited pages either.  The page count is
  * accurate in any case, but because we use the reltuples / relpages ratio
  * in the planner, it's better to not update relpages either if we can't
  * update reltuples.
  */
 if (vacrelstats-scanned_all)
 vac_update_relstats(onerel,
 vacrelstats-rel_pages, vacrelstats-rel_tuples,
 vacrelstats-hasindex,
 FreezeLimit);
 
 For an ordinary table this wouldn't be fatal because we'll still do an
 ANALYZE from time to time, and that will update the entries with new
 (approximate) values.  But we never run ANALYZE on toast tables.

Ouch.

 I can see two basic approaches we might take here:
 
 1. Modify autovacuum to use something from the stats collector, rather
 than reltuples, to make its decisions.  I'm not too clear on why
 reltuples is being used there anyway; is there some good algorithmic or
 statistical reason why AV should be looking at a number from the last
 vacuum?

It uses reltuples simply because it was what the original contrib code
was using.  Since pgstat was considerably weaker at the time, reltuples
might have been the only thing available.  It's certainly the case that
pgstat has improved a lot since autovacuum got in, and some things have
been revised but not this one.

 2. Revise the vacuum code so that it doesn't skip updating the pg_class
 entries.  We could have it count the number of pages it skipped, rather
 than just keeping a bool, and then scale up the rel_tuples count to be
 approximately right by assuming the skipped pages have tuple density
 similar to the scanned ones.

Hmm, interesting idea.  This would be done only for toast tables, or all
tables?

At this point I only wonder why we store tuples  pages rather than just
live tuple density.

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] tackling full page writes

On Wed, May 25, 2011 at 10:13 AM, Bruce Momjian br...@momjian.us wrote:
  Idempotent does seem like the most promising idea.

 I tend to agree with you, but I'm worried it won't actually work out
 to a win.  By the time we augment the records with enough additional
 information we may have eaten up a lot of the benefit we were hoping
 to get.

 This is where I was confused.  Our bad case now is when someone modifies
 one row on a page between checkpoints --- instead of writing 400 bytes,
 we write 8400.  What portion of between-checkpoint activity writes more
 than a few rows to a page?  I didn't think many, except for COPY.
 Ideally we could switch in and out of this mode per page, but that seems
 super-complicated.

Well, an easy to understand example would be a page that gets repeated
HOT updates.  We'll do this: add a tuple, add a tuple, add a tuple,
add a tuple, HOT cleanup, add a tuple, add a tuple, add a tuple, add a
tuple, HOT cleanup... and so on.  In the worst case, that could be
done many, many times between checkpoints that might be up to an hour
apart.  The problem can also occur (with a little more difficulty)
even without HOT.  Imagine a small table without lots of inserts and
deletes.  Page fills up, some rows are deleted, vacuum frees up space,
page fills up again, some more rows are deleted, vacuum frees up space
again, and so on.

But you raise an interesting point, which is that it might also be
possible to reduce the impact of write-ahead logging in other ways.
For example, if we're doing a large COPY into a table, we could buffer
up a full block of tuples and then just emit an FPI for the page.
This would likely be cheaper than logging each tuple individually.

In fact, you could imagine keeping a queue of pending WAL for each
block in shared buffers.  You don't really need that WAL to be
consolidated into a single stream until either (a) you need to write
the block or (b) you need to commit the transaction.  When one of
those things happens, you can decide at that point whether it's
cheaper to emit the individual records or do some sort of
consolidation.  Doing it in exactly that way is probably impractical,
because every backend that wants to commit would have to make a sweep
of every buffer it's dirtied and see if any of them still contain WAL
that needs to be shoved into the main queue, and that would probably
suck, especially for small transactions.  But maybe there's some
variant that could be made to work.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pull up aggregate subquery

On Wed, May 25, 2011 at 10:35 AM, Hitoshi Harada umi.tan...@gmail.com wrote:
 2011/5/25 Hitoshi Harada umi.tan...@gmail.com:
 So I'm still
 thinking which of pulling up and parameterized scan is better.

 After more investigation I came up with third idea, pushing down
 RangeTblEntry to aggregate subquery. This sounds like a crazy idea,
 but it seems to me like it is slightly easier than pulling up
 agg-subquery. The main point is that when you want to pull up, you
 must care if the final output would be correct with other join
 relations than the aggregate-related one. In contrast, when pushing
 down the target relation to agg-subquery it is simple to ensure the
 result.

 I'm looking around pull_up_subqueries() in subquery_planner() to add
 the pushing down logic. It could be possible to do it around
 make_one_rel() but I bet query structure changes are doable against
 Query, not PlannerInfo.

How do you decide whether or not to push down?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reducing overhead of frequent table locks

Simon Riggs wrote:
 On Tue, May 24, 2011 at 6:37 PM, Robert Haas robertmh...@gmail.com wrote:
 
  That being said, it's a slight extra cost for all fast-path lockers to 
  benefit
  the strong lockers, so I'm not prepared to guess whether it will pay off.
 
  Yeah. ?Basically this entire idea is about trying to make life easier
  for weak lockers at the expense of making it more difficult for strong
  lockers. ?I think that's a good trade-off in general, but we might
  need to wait until we have an actual implementation to judge whether
  we've turned the dial too far.
 
 I like this overall concept and like the way this has been described
 with strong and weak locks. It seems very useful to me, since temp
 tables can be skipped. That leaves shared DDL and we have done lots to
 reduce the lock levels held and are looking at further reductions
 also. I think even quite extensive delays are worth the trade-off.
 
 I'd been looking at this also, though hadn't mentioned it previously
 because I found an Oracle patent that discusses dynamically turning on
 and off locking. So that's something to be aware of. IMHO if we
 discuss this in terms of sharing/not sharing locking information then
 it is sufficient to avoid the patent. That patent also discusses the
 locking state change needs to wait longer than required.

FYI, I thought we had agreed not to look at any patents because looking
at them might cause us more problems than not looking at them.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum

On Wed, May 25, 2011 at 11:47 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 2. Revise the vacuum code so that it doesn't skip updating the pg_class
 entries.  We could have it count the number of pages it skipped, rather
 than just keeping a bool, and then scale up the rel_tuples count to be
 approximately right by assuming the skipped pages have tuple density
 similar to the scanned ones.

This approach doesn't seem like a good idea to me.  The skipped
portions of the table are the ones that haven't been modified in a
while, so this is or embeds an assumption that the tuples in the hot
and cold portions of the table are the same size.  That might be true,
but it isn't hard to think of cases where it might not be.  There
could also very easily be sampling error, because it's easy to imagine
a situation where 99% of the table is getting skipped.  Any error that
creeps into the estimate is going to get scaled up along with the
estimate itself.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum

Alvaro Herrera alvhe...@commandprompt.com writes:
 Excerpts from Tom Lane's message of miÃ© may 25 11:47:52 -0400 2011:
 I can see two basic approaches we might take here:
 
 1. Modify autovacuum to use something from the stats collector, rather
 than reltuples, to make its decisions.  I'm not too clear on why
 reltuples is being used there anyway; is there some good algorithmic or
 statistical reason why AV should be looking at a number from the last
 vacuum?

 It uses reltuples simply because it was what the original contrib code
 was using.  Since pgstat was considerably weaker at the time, reltuples
 might have been the only thing available.  It's certainly the case that
 pgstat has improved a lot since autovacuum got in, and some things have
 been revised but not this one.

On reflection I'm hesitant to do this, especially for a backpatched bug
fix, because it would be changing the feedback loop behavior for
autovacuum scheduling.  That could have surprising consequences.

 2. Revise the vacuum code so that it doesn't skip updating the pg_class
 entries.  We could have it count the number of pages it skipped, rather
 than just keeping a bool, and then scale up the rel_tuples count to be
 approximately right by assuming the skipped pages have tuple density
 similar to the scanned ones.

 Hmm, interesting idea.  This would be done only for toast tables, or all
 tables?

I'm thinking just do it for all.  The fact that these numbers don't
necessarily update after a vacuum is certainly surprising in and of
itself, and it did not work that way before the VM patch went in.
I'm concerned about other stuff besides AV not dealing well with
obsolete values.

 At this point I only wonder why we store tuples  pages rather than just
 live tuple density.

It's just for backwards compatibility.  I've thought about doing that in
the past, but I don't know what client-side code might be looking at
relpages/reltuples.  It's not like collapsing them into one field would
save much, anyway.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: Another attempt at vacuum improvements

On Wed, May 25, 2011 at 10:07 AM, Pavan Deolasee
pavan.deola...@gmail.com wrote:
 I'm confused.  A major point of the approach I was proposing was to
 avoid having to move tuples around.

 Well, I am not sure how you can always guarantee to make space
 available for the LSN without moving tuples , irrespective of where
 you store it.  But probably its not important as we discussed below.

Well, if we just defragged the page, then we should be guaranteed that
pd_lower + 8  pd_upper.  No?

 I think if are reclaiming LP_DEAD line pointers only while
 defragmenting the page, we can always reclaim the space for the LSN,
 irrespective of where we store it. So may be we should decide
 depending on what would matter for on-disk compatibility and whatever
 requires least invasive changes. I don't know what is that yet.

If we store the LSN at the beginning (page header) or end (special
space) of the page, then we'll have to adjust the location of the data
which follows it or precedes it if and when we want to reclaim the
space it occupies.  But if we store it in the hole in the middle
somewhere (probably by reducing the size of the hole, not by actually
referencing the area between pd_lower and pd_upper) then no tuples or
item pointers have to move around when we want to reclaim the space.
That way, we need only an exclusive lock (not a cleanup lock), and we
don't need to memmove() anything.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum

2011-05-25 Thread Kevin Grittner

Tom Lane t...@sss.pgh.pa.us wrote:
 
 I don't know what client-side code might be looking at
 relpages/reltuples.
 
I know that I find reltuples useful for getting an accurate enough
sense of rows in a table (or set of tables) without resorting to
count(*).  I'd be OK with any two of pages, tuples, and density; but
dropping to one would make databases harder to manage, IMV.
 
Personally, I don't have much code that uses those columns;
eliminating an existing column wouldn't involve much pain for me as
long as it could still be derived.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [BUGS] BUG #6034: pg_upgrade fails when it should not.

Excerpts from Bruce Momjian's message of mar may 24 15:59:59 -0400 2011:

  I think you misread what I wrote, or I misexplained it, but never
  mind.  Matching locale names case-insensitively sounds reasonable to
  me, unless someone has reason to believe it will blow up.
 
 OK, that's what I needed to hear.  I have applied the attached patch,
 but only to 9.1 because  of the risk of breakage. (This was only the
 first bug report of this, and we aren't 100% certain about the case
 issue.)

Hmm, I know the locale can be spelled en_US.UTF-8 (note dash) in some
systems.  So if you have a locale alias that makes en_US.utf8 the same
as en_US.UTF-8, the patched code would still not work.  I wonder if
there's a need for canonicalization somewhere.  Or maybe this is just
not worth the trouble.

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum

Robert Haas robertmh...@gmail.com writes:
 On Wed, May 25, 2011 at 11:47 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 2. Revise the vacuum code so that it doesn't skip updating the pg_class
 entries.  We could have it count the number of pages it skipped, rather
 than just keeping a bool, and then scale up the rel_tuples count to be
 approximately right by assuming the skipped pages have tuple density
 similar to the scanned ones.

 This approach doesn't seem like a good idea to me.  The skipped
 portions of the table are the ones that haven't been modified in a
 while, so this is or embeds an assumption that the tuples in the hot
 and cold portions of the table are the same size.  That might be true,
 but it isn't hard to think of cases where it might not be.  There
 could also very easily be sampling error, because it's easy to imagine
 a situation where 99% of the table is getting skipped.

Yeah, I had been thinking about the latter point.  We could be
conservative and just keep the reported tuple density the same (ie,
update relpages to the new correct value, while setting reltuples so
that the density ratio doesn't change).  But that has its own problems
when the table contents *do* change.  What I'm currently imagining is
to do a smoothed moving average, where we factor in the new density
estimate with a weight dependent on the percentage of the table we did
scan.  That is, the calculation goes something like

old_density = old_reltuples / old_relpages
new_density = counted_tuples / scanned_pages
reliability = scanned_pages / new_relpages
updated_density = old_density + (new_density - old_density) * reliability
new_reltuples = updated_density * new_relpages

We could slow the moving-average convergence even further when
reliability is small; for instance if you were really paranoid you might
want to use the square of reliability in line 4.  That might be
overdoing it, though.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Expression Evaluator used for creating the plan tree / stmt ?

@pavan - thanks a lot. Will try it when i go to desktop the next time.

--
Sent from my Android
On 25 May 2011 15:33, Pavan Deolasee pavan.deola...@gmail.com wrote:
 On Wed, May 25, 2011 at 3:22 PM, Vaibhav Kaushal 
 vaibhavkaushal...@gmail.com wrote:

 I see that the target list to be scanned is handled by ExecTargetList
 function.

 I am not so sure about this because this function is not listed in the
list
 of functions which GDB shows me (./configure --enable-debug  make clean

 make  make install). Rest everything else (almost / perhaps) is shown!
 (Can anyone tell why? :( )


 You would need to turn optimization off by passing -O0 flag to the
 compiler. Otherwise static functions may get optimized and you may not see
 them in gdb stack.

 CFLAGS=-O0 ./configure --enable-debug

 Thanks,
 Pavan

 --
 Pavan Deolasee
 EnterpriseDB http://www.enterprisedb.com

Re: [HACKERS] errno not set in case of libm functions (HPUX)

2011-05-25 Thread Ibrar Ahmed

Please find the updated patch. I have added this +Olibmerrno compile flag
check in configure/configure.in file.


OS

HP-UX B.11.31 U ia64

without patch
---
postgres=# select acos(2);
 acos
--
  NaN
(1 row)


with patch
---
postgres=# select acos(2);
ERROR:  input is out of range



On Tue, May 24, 2011 at 11:13 PM, Tom Lane t...@sss.pgh.pa.us wrote:

 Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
  So the default is indeed non-standard. But I wonder if we should use -Aa
  instead?

 Probably not; at least on older HPUX versions, -Aa turns off access to
 assorted stuff that we do want, eg long long.  man cc on my box
 saith

 -Amode Specify the compilation standard to be used by the
compiler.  mode can be one of the following letters:

   c(Default) Compile in a mode compatible with
HP-UX releases prior to 7.0.  (See The C
Programming Language, First Edition by
Kernighan and Ritchie).  This option also
defines the symbol _HPUX_SOURCE and allows the
user to access macros and typedefs provided by
the HPUX Operating System. The default
compilation mode may change in future releases.

   aCompile under ANSI mode (ANSI programming
language C standard ISO 9899:1990).  When
compiling under ANSI mode, the header files
would define only those names (macros and
typedefs) specified by the Standard. To access
macros and typedefs that are not defined by the
ANSI Standard but are provided by the HPUX
Operating System, define the symbol
_HPUX_SOURCE; or use the extension option
described below.

   eExtended ANSI mode.  Same as -Aa -D_HPUX_SOURCE
+e.  This would define the names (macros and
typedefs) provided by the HPUX Operating System
and, in addition, allow the following
extensions: $ characters in identifier names,
sized enums, sized bit-fields, and 64-bit
integral type long long.  Additional extensions
may be added to this option in the future.

 The +e option is elsewhere stated to mean

  +eEnables HP value-added features while compiling
in ANSI C mode, -Aa.  This option is ignored
with -Ac because these features are already
provided.  Features enabled:

 o  Long pointers
 o  Integral type specifiers can appear in
enum declarations.
 o  The $ character can appear in
identifier names.
 o  Missing parameters on intrinsic calls

 which isn't 100% consistent with what it says under -Ae, so maybe some
 additional experimentation is called for.  But anyway, autoconf appears
 to think that -Ae is preferable to the combination -Aa -D_HPUX_SOURCE
 (that choice is coming from autoconf not our own code); so I'm not
 optimistic that we can get more-standard behavior by overriding that.

regards, tom lane




-- 
   Ibrar Ahmed
diff --git a/configure b/configure
index 3b23c46..d448534 100755
--- a/configure
+++ b/configure
@@ -4468,6 +4468,66 @@ if test x$pgac_cv_prog_cc_cflags__qnoansialias = xyes; then
   CFLAGS=$CFLAGS -qnoansialias
 fi
 
+elif test $PORTNAME = hpux; then
+  # HP-UX libm functions on 'Integrity server' do not set errno by default,
+  # for errno setting, compile with the +Olibmerrno option.
+  { $as_echo $as_me:$LINENO: checking whether $CC supports +Olibmerrno 5
+$as_echo_n checking whether $CC supports +Olibmerrno...  6; }
+if test ${pgac_cv_prog_cc_cflags_pOlibmerrno+set} = set; then
+  $as_echo_n (cached)  6
+else
+  pgac_save_CFLAGS=$CFLAGS
+CFLAGS=$pgac_save_CFLAGS +Olibmerrno
+cat conftest.$ac_ext _ACEOF
+/* confdefs.h.  */
+_ACEOF
+cat confdefs.h conftest.$ac_ext
+cat conftest.$ac_ext _ACEOF
+/* end confdefs.h.  */
+
+int
+main ()
+{
+
+  ;
+  return 0;
+}
+_ACEOF
+rm -f conftest.$ac_objext
+if { (ac_try=$ac_compile
+case (($ac_try in
+  *\* | *\`* | *\\*) ac_try_echo=\$ac_try;;
+  *) ac_try_echo=$ac_try;;
+esac
+eval ac_try_echo=\\$as_me:$LINENO: $ac_try_echo\
+$as_echo $ac_try_echo) 5
+  (eval $ac_compile) 2conftest.er1
+  ac_status=$?
+  grep -v '^ *+' conftest.er1 conftest.err
+  rm -f

Re: [HACKERS] New/Revised TODO? Gathering actual read performance data for use by planner

On Wed, May 25, 2011 at 10:17 AM, Greg Smith g...@2ndquadrant.com wrote:
 This data would probably need to be kept separately for each table or
 index, as some tables or indexes
 may be mostly or fully in cache or on faster physical media than others,
 although in the absence of other
 data about a specific table or index, data about other relations in the
 same tablespace might be of some use.

 This is the important part.  Model how the data needs to get stored such
 that the optimizer can make decisions using it, and I consider it easy to
 figure out how it will get populated later.

I basically agree.  There have been several recent discussions of this
topic on both -hackers and -performance; it is likely that the TODO
needs to be updated with some more recent links.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: Another attempt at vacuum improvements

Excerpts from Robert Haas's message of mié may 25 12:27:38 -0400 2011:
 On Wed, May 25, 2011 at 10:07 AM, Pavan Deolasee
 pavan.deola...@gmail.com wrote:

  I think if are reclaiming LP_DEAD line pointers only while
  defragmenting the page, we can always reclaim the space for the LSN,
  irrespective of where we store it. So may be we should decide
  depending on what would matter for on-disk compatibility and whatever
  requires least invasive changes. I don't know what is that yet.
 
 If we store the LSN at the beginning (page header) or end (special
 space) of the page, then we'll have to adjust the location of the data
 which follows it or precedes it if and when we want to reclaim the
 space it occupies.  But if we store it in the hole in the middle
 somewhere (probably by reducing the size of the hole, not by actually
 referencing the area between pd_lower and pd_upper) then no tuples or
 item pointers have to move around when we want to reclaim the space.
 That way, we need only an exclusive lock (not a cleanup lock), and we
 don't need to memmove() anything.

You can store anything in the hole in the data area (currently we only
store tuple data), but then you'd have to store the offset to where you
stored it in some fixed location, and make sure you adjust pd_upper/lower
so that the next tuple does not overwrite it.  You'd still need space
in either page header or special space.  I don't see how you'd store
anything in the hole without it being in a fixed place, where it would
eventually be hit by either the line pointer array or tuple data.


As far as the general idea of this thread goes, I remember thinking
about exactly this topic a couple of months ago -- I didn't get this far
though, so thanks for fleshing out some details and getting the ball
rolling.

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum

On Wed, May 25, 2011 at 12:41 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Yeah, I had been thinking about the latter point.  We could be
 conservative and just keep the reported tuple density the same (ie,
 update relpages to the new correct value, while setting reltuples so
 that the density ratio doesn't change).  But that has its own problems
 when the table contents *do* change.  What I'm currently imagining is
 to do a smoothed moving average, where we factor in the new density
 estimate with a weight dependent on the percentage of the table we did
 scan.  That is, the calculation goes something like

 old_density = old_reltuples / old_relpages
 new_density = counted_tuples / scanned_pages
 reliability = scanned_pages / new_relpages
 updated_density = old_density + (new_density - old_density) * reliability
 new_reltuples = updated_density * new_relpages

 We could slow the moving-average convergence even further when
 reliability is small; for instance if you were really paranoid you might
 want to use the square of reliability in line 4.  That might be
 overdoing it, though.

I don't know.  That's maybe better, but I'd be willing to wager that
in some cases it will just slow down the rate at which we converge to
a completely incorrect value, while in other cases it'll fail to
update the data when it really has changed.

I am wondering, though, why we're not just inserting a special-purpose
hack for TOAST tables.  Your email seems to indicate that regular
tables are already handled well enough, and certainly if we only whack
around the TOAST behavior it's much less likely to fry anything.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum

Excerpts from Kevin Grittner's message of mié may 25 12:37:24 -0400 2011:
 Tom Lane t...@sss.pgh.pa.us wrote:
  
  I don't know what client-side code might be looking at
  relpages/reltuples.
  
 I know that I find reltuples useful for getting an accurate enough
 sense of rows in a table (or set of tables) without resorting to
 count(*).  I'd be OK with any two of pages, tuples, and density; but
 dropping to one would make databases harder to manage, IMV.
  
 Personally, I don't have much code that uses those columns;
 eliminating an existing column wouldn't involve much pain for me as
 long as it could still be derived.

Well, we only actually need to store one number, because you can figure
out a much more precise number-of-pages figure with pg_relation_size()
divided by configured page size.

(We feel free to hack around catalogs in other places, and we regularly
break pgAdmin and the like when we drop columns -- people just live with
it and update their tools.  I don't think it's such a big deal in this
particular case.)

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum

Robert Haas robertmh...@gmail.com writes:
 On Wed, May 25, 2011 at 12:41 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Yeah, I had been thinking about the latter point.  We could be
 conservative and just keep the reported tuple density the same (ie,
 update relpages to the new correct value, while setting reltuples so
 that the density ratio doesn't change).  But that has its own problems
 when the table contents *do* change.  What I'm currently imagining is
 to do a smoothed moving average, where we factor in the new density
 estimate with a weight dependent on the percentage of the table we did
 scan.  That is, the calculation goes something like
 
 old_density = old_reltuples / old_relpages
 new_density = counted_tuples / scanned_pages
 reliability = scanned_pages / new_relpages
 updated_density = old_density + (new_density - old_density) * reliability
 new_reltuples = updated_density * new_relpages
 
 We could slow the moving-average convergence even further when
 reliability is small; for instance if you were really paranoid you might
 want to use the square of reliability in line 4.  That might be
 overdoing it, though.

 I don't know.  That's maybe better, but I'd be willing to wager that
 in some cases it will just slow down the rate at which we converge to
 a completely incorrect value, while in other cases it'll fail to
 update the data when it really has changed.

[ shrug... ]  When you don't have complete information, it's *always*
the case that you will sometimes make a mistake.  That's not
justification for paralysis, especially not when the existing code is
demonstrably broken.

What occurs to me after thinking a bit more is that the existing tuple
density is likely to be only an estimate, too (one coming from the last
ANALYZE, which could very well have scanned even less of the table than
VACUUM did).  So what I now think is that both VACUUM and ANALYZE ought
to use a calculation of the above form to arrive at a new value for
pg_class.reltuples.  In both cases it would be pretty easy to track the
number of pages we looked at while counting tuples, so the same raw
information is available.

 I am wondering, though, why we're not just inserting a special-purpose
 hack for TOAST tables.

Because the problem is not specific to TOAST tables.  As things
currently stand, we will accept the word of an ANALYZE as gospel even if
it scanned 1% of the table, and completely ignore the results from a
VACUUM even if it scanned 99% of the table.  This is not sane.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: Another attempt at vacuum improvements

On Wed, May 25, 2011 at 12:48 PM, Alvaro Herrera
alvhe...@commandprompt.com wrote:
 Excerpts from Robert Haas's message of mié may 25 12:27:38 -0400 2011:
 On Wed, May 25, 2011 at 10:07 AM, Pavan Deolasee
 pavan.deola...@gmail.com wrote:

  I think if are reclaiming LP_DEAD line pointers only while
  defragmenting the page, we can always reclaim the space for the LSN,
  irrespective of where we store it. So may be we should decide
  depending on what would matter for on-disk compatibility and whatever
  requires least invasive changes. I don't know what is that yet.

 If we store the LSN at the beginning (page header) or end (special
 space) of the page, then we'll have to adjust the location of the data
 which follows it or precedes it if and when we want to reclaim the
 space it occupies.  But if we store it in the hole in the middle
 somewhere (probably by reducing the size of the hole, not by actually
 referencing the area between pd_lower and pd_upper) then no tuples or
 item pointers have to move around when we want to reclaim the space.
 That way, we need only an exclusive lock (not a cleanup lock), and we
 don't need to memmove() anything.

 You can store anything in the hole in the data area (currently we only
 store tuple data), but then you'd have to store the offset to where you
 stored it in some fixed location, and make sure you adjust pd_upper/lower
 so that the next tuple does not overwrite it.  You'd still need space
 in either page header or special space.

You only need one bit of space in the page header, to indicate whether
the LSN is present or not.  And we have unused bit space there.

 I don't see how you'd store
 anything in the hole without it being in a fixed place, where it would
 eventually be hit by either the line pointer array or tuple data.

The point is that it doesn't matter.  Suppose we put it just after the
line pointer array.  Any time we're thinking about extending the line
pointer array, we already have an exclusive lock on the buffer.  And
if we already have a exclusive lock on the buffer, then we can reclaim
the dead line pointers and now we no longer need the saved LSN, so
writing over it is perfectly fine.

OK, I lied: if we have an exclusive buffer lock, but the last vacuum
either failed, or is still in progress, then the LSN might not be old
enough for us to reclaim the dead line pointers yet.  So ideally we'd
like to hold onto it.  We can do that by either (a) moving the LSN out
another 6 bytes, if there's enough room; or (b) deciding not to put
the new tuple on this page, after all.  There's no situation in which
we absolutely HAVE to get another tuple onto this particular page.  We
can just decide that the effective size of a page that contains dead
line pointers is effectively 8 bytes less.  The alternative is to eat
up 8 bytes of space on ALL pages, whether they contain dead line
pointers or not.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Nested CASE-WHEN scoping


On 25.05.2011 17:47, Tom Lane wrote:

Heikki Linnakangasheikki.linnakan...@enterprisedb.com  writes:

While looking at fixing the multiple-evaluation issue in IN and BETWEEN
discussed a while ago, I realized that the current assumption that only
one CaseTestExpr placeholder needs to be valid at any given time is not
true.


[ scratches head ... ]  Why does the save/restore in ExecEvalCase not
take care of this?


The mistake happens during planning, when the SQL function is inlined 
and pre-evaluated. It's a bit hard to see what happened once the 
planning is finished because the whole expression is folded into a 
constant, but here's goes:


The original expression is:

  CASE now() WHEN 29 THEN 'foo' ELSE 'bar' END;

In parse analysis, it is turned into this:

  CASE WHEN CaseTestExpr = 29 THEN 'foo' ELSE 'bar' END;

where CaseTestExpr stands for the now(). Next the planner tries to 
simplify the WHEN condition, CaseTestExpr = 29. The equality operator 
is implemented by the evileq(timestamptz, int4) SQL function, defined as:


  CASE $2 WHEN length($1::text) THEN true ELSE false END;

That SQL-function is transformed at parse analysis into:

  CASE CaseTestExpr = length($1::text) THEN true ELSE false END;

This CaseTestExpr stands for the Param to the function, $2. When that 
tranformed SQL function body is inlined into the outer WHEN clause, 
CaseTestExpr = 29, and Params are substituted, it becomes:


 CASE CaseTestExpr = length(CaseTestExpr::text) THEN true ELSE false END.

(you can see the expression tree for that if you print out 'newexpr' in 
inline_function(), just after the call to substitute_actual_parameters())


At this point it's easy to see that we have screwed up. The first 
CaseTestExpr stands for the inner CASE-value, which is $2, which stands 
for 29, and the second CaseTestExpr stands for the *outer* CASE-value, 
which is supposed to be now(). The planner cannot distinguish between 
the two anymore.


Both CaseTestExprs are then incorrectly replaced with constant 29, and 
the whole expression is constant-folded into 'bar'.



So I'm going to put the BETWEEN/IN fix aside for now, and refactor the
placeholder infrastructure to handle several simultaneous placeholders,


That sounds like a mess, and I'm not even convinced it'll solve the
problem ...


Hmm, it would solve the above by if we can keep the CaseTestExprs 
separate. It's not quite as straightforward as I originally thought, as 
the parse analysis of the inlined SQL function needs to use placeholder 
numbers that are different from those used in the outer context. But it 
seems doable.



BTW, i just stumbled into this:

postgres=# explain verbose SELECT CASE now() WHEN (29+random()::int4) 
THEN 'foo' ELSE 'bar' END;

ERROR:  unexpected CASE WHEN clause: 326

Looks like ruleutils.c is also not prepared for the case that the 
implicit equality operation gets inlined into something else than an OpExpr.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Nested CASE-WHEN scoping

Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
 On 25.05.2011 17:47, Tom Lane wrote:
 [ scratches head ... ]  Why does the save/restore in ExecEvalCase not
 take care of this?

 The mistake happens during planning, when the SQL function is inlined 
 and pre-evaluated.

Hm.  I'm inclined to think this may be more of a bug in the inlining
process than anything else.  I have to run off for a doctor's
appointment, but will look at this closer later today.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum

Excerpts from Robert Haas's message of mié may 25 12:54:28 -0400 2011:

 I don't know.  That's maybe better, but I'd be willing to wager that
 in some cases it will just slow down the rate at which we converge to
 a completely incorrect value, while in other cases it'll fail to
 update the data when it really has changed.

For regular tables I don't think there's a real problem because it'll
get fixed next time a full scan happens anyway.  For toast tables, I
think the set of operations is limited enough that this is easy to prove
correct (or fixed so that it is) -- no HOT updates (in fact no updates
at all), etc.

BTW one thing we haven't fixed at all is how do HOT updates affect
vacuuming behavior ...

 I am wondering, though, why we're not just inserting a special-purpose
 hack for TOAST tables.  Your email seems to indicate that regular
 tables are already handled well enough, and certainly if we only whack
 around the TOAST behavior it's much less likely to fry anything.

Well, having two paths one of which is uncommonly used means that it
will get all the bugs.  As with the xl_commit WAL record comment from
Simon, I'd rather stick with having a single one.

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum

On Wed, May 25, 2011 at 1:04 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 [ shrug... ]  When you don't have complete information, it's *always*
 the case that you will sometimes make a mistake.  That's not
 justification for paralysis, especially not when the existing code is
 demonstrably broken.

 What occurs to me after thinking a bit more is that the existing tuple
 density is likely to be only an estimate, too (one coming from the last
 ANALYZE, which could very well have scanned even less of the table than
 VACUUM did).  So what I now think is that both VACUUM and ANALYZE ought
 to use a calculation of the above form to arrive at a new value for
 pg_class.reltuples.  In both cases it would be pretty easy to track the
 number of pages we looked at while counting tuples, so the same raw
 information is available.

 I am wondering, though, why we're not just inserting a special-purpose
 hack for TOAST tables.

 Because the problem is not specific to TOAST tables.  As things
 currently stand, we will accept the word of an ANALYZE as gospel even if
 it scanned 1% of the table, and completely ignore the results from a
 VACUUM even if it scanned 99% of the table.  This is not sane.

I agree that if VACUUM scanned 99% of the table, it's probably fine to
use its numbers.  It's also fine to use the numbers from ANALYZE,
because those pages are chosen randomly.  What bothers me is the idea
of using a small *non-random* sample, and I'm not sure that
incorporating possibly-bogus results slowly is any better than
incorporating them quickly.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] tackling full page writes

2011-05-25 Thread Greg Smith


On 05/24/2011 04:34 PM, Robert Haas wrote:

we could name this feature partial full page writes and enable it
either with a setting of full_page_writes=partial


+1 to overloading the initial name, but only if the parameter is named 
'maybe', 'sometimes', or 'perhaps'.


I've been looking into a similar refactoring of the names here, where we 
bundle all of these speed over safety things (fsync, full_page_writes, 
etc.) into one control so they're easier to turn off at once.  Not sure 
if it should be named web_scale or do_you_feel_lucky_punk.



3. Going a bit further, Greg proposed the idea of ripping out our
current WAL infrastructure altogether and instead just having one WAL
record that says these byte ranges on this page changed to have these
new contents.


The main thing that makes this idea particularly interesting to me, over 
the other two, is that it might translate well into the idea of using 
sync_file_range to aim for a finer fsync call on Linux than is currently 
possible.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [BUGS] BUG #6034: pg_upgrade fails when it should not.

Alvaro Herrera wrote:
 Excerpts from Bruce Momjian's message of mar may 24 15:59:59 -0400 2011:
 
   I think you misread what I wrote, or I misexplained it, but never
   mind.  Matching locale names case-insensitively sounds reasonable to
   me, unless someone has reason to believe it will blow up.
  
  OK, that's what I needed to hear.  I have applied the attached patch,
  but only to 9.1 because  of the risk of breakage. (This was only the
  first bug report of this, and we aren't 100% certain about the case
  issue.)
 
 Hmm, I know the locale can be spelled en_US.UTF-8 (note dash) in some
 systems.  So if you have a locale alias that makes en_US.utf8 the same
 as en_US.UTF-8, the patched code would still not work.  I wonder if
 there's a need for canonicalization somewhere.  Or maybe this is just
 not worth the trouble.

I can easily remove dashes before the compare if people like that idea
--- I think you could argue that a dash is not significant, unless ab-c
and a-bc are different locales.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Expression Evaluator used for creating the plan tree / stmt ?

But somehow the execevalvar is being called. When i changed the logic of its
working to use slot_getattr instead of cute_datum_array for the first run /
call, everything just worked!

This would indicate surely that the function does get called at least once
before being called by executor for qual check. This is what got me confused
- where does the function get called? Gdb trace says its first call comes
through ExecutePlan but results say something else. Moreover, gdb still
confuses me. Is there some gui to gdb? :(

The only thing i am trying to make sure that ExecEvalVar gets the values
from cute_datum_array only when called durin a scan, no way before it. May
be there could be another way. I think i have to rethink about this.
--
Sent from my Android
On 25 May 2011 20:34, Tom Lane t...@sss.pgh.pa.us wrote:
 Alvaro Herrera alvhe...@commandprompt.com writes:
 Excerpts from Vaibhav Kaushal's message of mié may 25 05:52:32 -0400
2011:
 If the above is confusing, I just want to ask: Is expression evaluator,
 even in part responsible for {PLANNEDSTMT creation?

 Yeah, as far as I understood Tom's talk, the expr evaluator is used to
 reduce some expressions to constants and such.

 The planner would never call it with an expression containing a Var,
 though.

 regards, tom lane

Re: [HACKERS] Nested CASE-WHEN scoping


On 25.05.2011 20:11, Tom Lane wrote:

Heikki Linnakangasheikki.linnakan...@enterprisedb.com  writes:

On 25.05.2011 17:47, Tom Lane wrote:

[ scratches head ... ]  Why does the save/restore in ExecEvalCase not
take care of this?



The mistake happens during planning, when the SQL function is inlined
and pre-evaluated.


Hm.  I'm inclined to think this may be more of a bug in the inlining
process than anything else.


Well, if you want to get away without the capability to reference 
multiple CaseTestExprs at a time, you'll have to detect the danger and 
abort the inlining process. That's a bit pessimal, although this is a 
pretty artificial case in the first place so maybe we don't care much.


(I'm still going to need more placeholder slots to handle IN and 
BETWEEN. Of course, I can just copy-paste CaseTestExpr into something 
like InTestExpr and BetweenTestExpr, but it seems like it would be good 
to unite all that infrastructure)


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] tackling full page writes

On Wed, May 25, 2011 at 1:06 PM, Greg Smith g...@2ndquadrant.com wrote:
 On 05/24/2011 04:34 PM, Robert Haas wrote:

 we could name this feature partial full page writes and enable it
 either with a setting of full_page_writes=partial

 +1 to overloading the initial name, but only if the parameter is named
 'maybe', 'sometimes', or 'perhaps'.

Perfect!

 I've been looking into a similar refactoring of the names here, where we
 bundle all of these speed over safety things (fsync, full_page_writes, etc.)
 into one control so they're easier to turn off at once.  Not sure if it
 should be named web_scale or do_you_feel_lucky_punk.

Actually, I suggested that same idea to you, or someone, a while back,
only I was serious.  crash_safety=off.  I never got around to fleshing
out the details, though.

 3. Going a bit further, Greg proposed the idea of ripping out our
 current WAL infrastructure altogether and instead just having one WAL
 record that says these byte ranges on this page changed to have these
 new contents.

 The main thing that makes this idea particularly interesting to me, over the
 other two, is that it might translate well into the idea of using
 sync_file_range to aim for a finer fsync call on Linux than is currently
 possible.

Hmm, maybe.  But it's possible that the dirty blocks are the first and
last ones in the file.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum

On Wed, May 25, 2011 at 1:15 PM, Robert Haas robertmh...@gmail.com wrote:
 I agree that if VACUUM scanned 99% of the table, it's probably fine to
 use its numbers.  It's also fine to use the numbers from ANALYZE,
 because those pages are chosen randomly.  What bothers me is the idea
 of using a small *non-random* sample, and I'm not sure that
 incorporating possibly-bogus results slowly is any better than
 incorporating them quickly.

In particular, unless I'm misremembering, VACUUM *always* scans the
first few pages of the table, until it sees enough consecutive
all-visible bits that it decides to start skipping.  If I'm right
about that, then those pages could easily end up being overweighted
when VACUUM does the counting; especially if ANALYZE or an actual
full-table vacuum aren't allowed to snap the count back to reality.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum

2011-05-25 Thread Cédric Villemain

2011/5/25 Alvaro Herrera alvhe...@commandprompt.com:
 Excerpts from Kevin Grittner's message of mié may 25 12:37:24 -0400 2011:
 Tom Lane t...@sss.pgh.pa.us wrote:

  I don't know what client-side code might be looking at
  relpages/reltuples.

 I know that I find reltuples useful for getting an accurate enough
 sense of rows in a table (or set of tables) without resorting to
 count(*).  I'd be OK with any two of pages, tuples, and density; but
 dropping to one would make databases harder to manage, IMV.

 Personally, I don't have much code that uses those columns;
 eliminating an existing column wouldn't involve much pain for me as
 long as it could still be derived.

 Well, we only actually need to store one number, because you can figure
 out a much more precise number-of-pages figure with pg_relation_size()
 divided by configured page size.

 (We feel free to hack around catalogs in other places, and we regularly
 break pgAdmin and the like when we drop columns -- people just live with
 it and update their tools.  I don't think it's such a big deal in this
 particular case.)

I may miss something but we need relation size in costsize.c even if
we have a reldensity (or we need a reltuples). Else what values should
be used to estimate the relation size ? (pg_relation_size() goes down
to kernel/fs to ask the stat.st_size, is it really what we want?)



 --
 Álvaro Herrera alvhe...@commandprompt.com
 The PostgreSQL Company - Command Prompt, Inc.
 PostgreSQL Replication, Consulting, Custom Development, 24x7 support

 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers




-- 
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Expression Evaluator used for creating the plan tree / stmt ?

On Wed, May 25, 2011 at 12:34 PM, Vaibhav Kaushal
vaibhavkaushal...@gmail.com wrote:
 But somehow the execevalvar is being called. When i changed the logic of its
 working to use slot_getattr instead of cute_datum_array for the first run /
 call, everything just worked!

This is what gdb is for... set a breakpoint on that function using b
ExexEvalVar, and when it hits the breakpoint, use bt to see where
the call came from.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [BUGS] BUG #6034: pg_upgrade fails when it should not.

On Wed, May 25, 2011 at 1:22 PM, Bruce Momjian br...@momjian.us wrote:
 Alvaro Herrera wrote:
 Excerpts from Bruce Momjian's message of mar may 24 15:59:59 -0400 2011:

   I think you misread what I wrote, or I misexplained it, but never
   mind.  Matching locale names case-insensitively sounds reasonable to
   me, unless someone has reason to believe it will blow up.
 
  OK, that's what I needed to hear.  I have applied the attached patch,
  but only to 9.1 because  of the risk of breakage. (This was only the
  first bug report of this, and we aren't 100% certain about the case
  issue.)

 Hmm, I know the locale can be spelled en_US.UTF-8 (note dash) in some
 systems.  So if you have a locale alias that makes en_US.utf8 the same
 as en_US.UTF-8, the patched code would still not work.  I wonder if
 there's a need for canonicalization somewhere.  Or maybe this is just
 not worth the trouble.

 I can easily remove dashes before the compare if people like that idea
 --- I think you could argue that a dash is not significant, unless ab-c
 and a-bc are different locales.

I think the more we mush that string around, the more chance we have
of breaking something.  What's wrong with insisting that people set
the value to the same thing?  Like, really the same?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pull up aggregate subquery

2011-05-25 Thread Hitoshi Harada

2011/5/26 Robert Haas robertmh...@gmail.com:
 On Wed, May 25, 2011 at 10:35 AM, Hitoshi Harada umi.tan...@gmail.com wrote:
 2011/5/25 Hitoshi Harada umi.tan...@gmail.com:
 So I'm still
 thinking which of pulling up and parameterized scan is better.

 After more investigation I came up with third idea, pushing down
 RangeTblEntry to aggregate subquery. This sounds like a crazy idea,
 but it seems to me like it is slightly easier than pulling up
 agg-subquery. The main point is that when you want to pull up, you
 must care if the final output would be correct with other join
 relations than the aggregate-related one. In contrast, when pushing
 down the target relation to agg-subquery it is simple to ensure the
 result.

 I'm looking around pull_up_subqueries() in subquery_planner() to add
 the pushing down logic. It could be possible to do it around
 make_one_rel() but I bet query structure changes are doable against
 Query, not PlannerInfo.

 How do you decide whether or not to push down?

Yeah, that's the problem. In addition to the conditions of join-qual
== grouping key  outer is unique on qual, we need some criteria if
it should be done. At first I started to think I can compare cost of
two different plan nodes, which are generated by calling
subquery_planner() twice. But now my plan is to apply some heuristics
like that join qual selectivity is less than 10% or so. I either don't
like magic numbers but given Query restructuring instead of
PlannerInfo (which means we cannot use Path) it is only left way. To
get it work is my first goal anyway.

Regards,

-- 
Hitoshi Harada

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum

Excerpts from Cédric Villemain's message of mié may 25 13:24:01 -0400 2011:

  Well, we only actually need to store one number, because you can figure
  out a much more precise number-of-pages figure with pg_relation_size()
  divided by configured page size.

 I may miss something but we need relation size in costsize.c even if
 we have a reldensity (or we need a reltuples). Else what values should
 be used to estimate the relation size ? (pg_relation_size() goes down
 to kernel/fs to ask the stat.st_size, is it really what we want?)

Actually yes, the planner does go to kernel to determine the current
relation size, and then multiplies by density as computed from catalog
data to figure out current reasonably accurate number of tuples.

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pull up aggregate subquery

2011-05-25 Thread Ross J. Reedstrom

On Mon, May 23, 2011 at 11:08:40PM -0400, Robert Haas wrote:
 
 I don't really like the idea of adding a GUC for this, unless we
 convince ourselves that nothing else is sensible.  I mean, that leads
 to conversations like this:
 
 Newbie: My query is slow.
 Hacker: Turn on enable_magic_pixie_dust and it'll get better.
 Newbie: Oh, yeah, that is better.  Why isn't this turned on by default, 
 anyway?
 Hacker: Well, on pathological queries, it makes planning take way too
 long, so we think it's not really safe to enable it by default.
 Newbie: Wait... I thought you just told me to enable it.  It's not safe?
 Hacker: Well, sort of.  I mean, uh... hey, look, an elephant!
 

ROTFL! This needs to go on the wiki somewhere discussing why HACKERs
rejects tunable knobs as often as happens.

Ross
-- 
Ross Reedstrom, Ph.D. reeds...@rice.edu
Systems Engineer  Admin, Research Scientistphone: 713-348-6166
Connexions  http://cnx.orgfax: 713-348-3665
Rice University MS-375, Houston, TX 77005
GPG Key fingerprint = F023 82C8 9B0E 2CC6 0D8E  F888 D3AE 810E 88F0 BEDE

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [BUGS] BUG #6034: pg_upgrade fails when it should not.

Excerpts from Robert Haas's message of mié may 25 13:33:41 -0400 2011:
 On Wed, May 25, 2011 at 1:22 PM, Bruce Momjian br...@momjian.us wrote:

  I can easily remove dashes before the compare if people like that idea
  --- I think you could argue that a dash is not significant, unless ab-c
  and a-bc are different locales.
 
 I think the more we mush that string around, the more chance we have
 of breaking something.  What's wrong with insisting that people set
 the value to the same thing?  Like, really the same?

No objection here to that idea.

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] about EDITOR_LINENUMBER_SWITCH

Excerpts from Tom Lane's message of mar may 24 17:11:17 -0400 2011:

 Right.  It would also increase the cognitive load on the user to have
 to remember the command-line go-to-line-number switch for his editor.
 So I don't particularly want to redesign this feature.  However, I can
 see the possible value of letting EDITOR_LINENUMBER_SWITCH be set from
 the same place that you set EDITOR, which would suggest that we allow
 the value to come from an environment variable.  I'm not sure whether
 there is merit in allowing both that source and ~/.psqlrc, though
 possibly for Windows users it might be easier if ~/.psqlrc worked.

If we're going to increase the number of options in .psqlrc that do not
work with older psql versions, can I please have .psqlrc-MAJORVERSION or
some such?  Having 8.3's psql complain all the time because it doesn't
understand linestyle is annoying.

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pull up aggregate subquery

On Wed, May 25, 2011 at 1:37 PM, Hitoshi Harada umi.tan...@gmail.com wrote:
 How do you decide whether or not to push down?

 Yeah, that's the problem. In addition to the conditions of join-qual
 == grouping key  outer is unique on qual, we need some criteria if
 it should be done. At first I started to think I can compare cost of
 two different plan nodes, which are generated by calling
 subquery_planner() twice. But now my plan is to apply some heuristics
 like that join qual selectivity is less than 10% or so. I either don't
 like magic numbers but given Query restructuring instead of
 PlannerInfo (which means we cannot use Path) it is only left way. To
 get it work is my first goal anyway.

I think getting it working is probably a good first goal.  I am not
really sure that we want to commit it that way, and I think my vote
would be for you to work on the approach we discussed before rather
than this one, but it's your project, and I think you'll probably
learn enough in getting it working that it will be a step forward in
any case.  The planner is complex enough that it's worth trying to get
something that works, first, before trying to make it perfect.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: Another attempt at vacuum improvements

On Wed, May 25, 2011 at 10:36 PM, Robert Haas robertmh...@gmail.com wrote:


 I don't see how you'd store
 anything in the hole without it being in a fixed place, where it would
 eventually be hit by either the line pointer array or tuple data.

 The point is that it doesn't matter.  Suppose we put it just after the
  line pointer array.

I think the point is you can not *always* put it just after the line
pointer array without possibly shuffling the tuples. Remember we need
to put the LSN when the dead line pointer is generated because we
decided to prune away the dead tuple. Say, for example, the page is
completely full and there are no dead line pointers and hence no LSN
on the page. Also there is no free space after the line pointer array.
Now say we prune dead tuples and generate dead line pointers, but the
last line pointer in the array is still in-use and the first tuple
immediately after the line pointer array is live. Since you generated
dead line pointers you want to store the LSN on the page. Now, there
is no way you can store is after the line pointer array without moving
the live tuple somewhere else.

Thats the point me and Alvaro are making. Do you agree with that logic
? Now that does not matter because we would always generate dead line
pointers holding a buffer cleanup lock and hence we are free to
shuffle tuples around. May be we are digressing on a trivial detail
here, but I hope I got it correct.

 Any time we're thinking about extending the line
 pointer array, we already have an exclusive lock on the buffer.  And
 if we already have a exclusive lock on the buffer, then we can reclaim
 the dead line pointers and now we no longer need the saved LSN, so
 writing over it is perfectly fine.


The trouble is you may not be able to shrink the line pointer array.
But of course, you can reuse the reclaimed dead line pointers. I would
still advocate doing that during the pruning operation because we want
to emit WAL records for the operation.

 OK, I lied: if we have an exclusive buffer lock, but the last vacuum
 either failed, or is still in progress, then the LSN might not be old
 enough for us to reclaim the dead line pointers yet.  So ideally we'd
 like to hold onto it.  We can do that by either (a) moving the LSN out
 another 6 bytes, if there's enough room; or (b) deciding not to put
 the new tuple on this page, after all.  There's no situation in which
 we absolutely HAVE to get another tuple onto this particular page.  We
 can just decide that the effective size of a page that contains dead
 line pointers is effectively 8 bytes less.  The alternative is to eat
 up 8 bytes of space on ALL pages, whether they contain dead line
 pointers or not.


I think we are on the same page as far as storing LSN if and only if
its required. But what was not convincing is the argument that you can
*always* find free space for the LSN without moving things around.

Let me summarize the sequence of operations and let me know if you
still disagree with the general principle:

1. There are no dead line pointers in the page - we are good.
2. Few tuples become dead, HOT pruning is invoked either during normal
operation or heap vacuum. The dead tuples are pruned away and
truncated to dead line pointers. We already hold cleanup lock on the
buffer. We set the flag in the page header and store the LSN (either
at the end of line pointer array or at the end of the page)
3. Someday index vacuum is run and it removes the index pointers to
the dead line pointers. We remember the start LSN of the index vacuum
somewhere, may be as a pg_class attribute (how does index vacuum get
the list of dead line pointers is not material in the general scheme
of things)
4. When the page is again chosen for pruning, we check if the flag is
set in the header. If so, get the LSN stored in the page, check it
against the last successful index vacuum LSN and if its precedes the
index vacuum LSN, we turn the LP_DEAD line pointers to LP_UNUSED. The
special LSN can be removed unless new LP_DEAD line pointers get
generated during the pruning, otherwise its overwritten with the
current LSN. Since we hold the buffer cleanup lock, the special LSN
storage can be reclaimed by shuffling things around.

Thanks,
Pavan


-- 
Pavan Deolasee
EnterpriseDB     http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] adding a new column in IDENTIFY_SYSTEM

2011-05-25 Thread Magnus Hagander

On Tue, May 24, 2011 at 22:31, Jaime Casanova ja...@2ndquadrant.com wrote:
 On Tue, May 24, 2011 at 8:52 PM, Fujii Masao masao.fu...@gmail.com wrote:

 +       primary_xlp_magic = atoi(PQgetvalue(res, 0, 2));

 You wrongly get the third field (i.e., current xlog location) as the
 WAL version.
 You should call PQgetvalue(res, 0, 3), instead.

 errdetail(Expected 1 tuple with 3 fields, got %d tuples with %d fields.,

 You need to change the above message.


 Fixed.

 About you comments on the check... if you read the thread, you will
 find that the whole reason for the field is future improvement, but
 everyone wanted some use of the field now... so i made a patch to use
 it in pg_basebackup before the transfer starts and avoid time and
 bandwith waste but Magnus prefer this in walreceiver...

The idea *at this point* was, I believe, to be able to provide a more
useful error message in the case of walreceiver.

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: Another attempt at vacuum improvements

On Wed, May 25, 2011 at 1:43 PM, Pavan Deolasee
pavan.deola...@gmail.com wrote:
 I think the point is you can not *always* put it just after the line
 pointer array without possibly shuffling the tuples. Remember we need
 to put the LSN when the dead line pointer is generated because we
 decided to prune away the dead tuple. Say, for example, the page is
 completely full and there are no dead line pointers and hence no LSN
 on the page. Also there is no free space after the line pointer array.
 Now say we prune dead tuples and generate dead line pointers, but the
 last line pointer in the array is still in-use and the first tuple
 immediately after the line pointer array is live. Since you generated
 dead line pointers you want to store the LSN on the page. Now, there
 is no way you can store is after the line pointer array without moving
 the live tuple somewhere else.

So far I agree.  But don't we always defragment immediately after
pruning dead tuples to line pointers?  The removal of even one tuple
will give us more than enough space to store the LSN.

 Let me summarize the sequence of operations and let me know if you
 still disagree with the general principle:

 1. There are no dead line pointers in the page - we are good.
 2. Few tuples become dead, HOT pruning is invoked either during normal
 operation or heap vacuum. The dead tuples are pruned away and
 truncated to dead line pointers. We already hold cleanup lock on the
 buffer. We set the flag in the page header and store the LSN (either
 at the end of line pointer array or at the end of the page)
 3. Someday index vacuum is run and it removes the index pointers to
 the dead line pointers. We remember the start LSN of the index vacuum
 somewhere, may be as a pg_class attribute (how does index vacuum get
 the list of dead line pointers is not material in the general scheme
 of things)
 4. When the page is again chosen for pruning, we check if the flag is
 set in the header. If so, get the LSN stored in the page, check it
 against the last successful index vacuum LSN and if its precedes the
 index vacuum LSN, we turn the LP_DEAD line pointers to LP_UNUSED. The
 special LSN can be removed unless new LP_DEAD line pointers get
 generated during the pruning, otherwise its overwritten with the
 current LSN. Since we hold the buffer cleanup lock, the special LSN
 storage can be reclaimed by shuffling things around.

Agreed.  The only thing I'm trying to do further is to avoid the need
for a reshuffle when the special LSN storage is reclaimed.  For
example, consider:

1. There are three tuples on the page.  We are good.
2. Tuple #2 becomes dead.  The tuple is pruned to a line pointer.  The
page is defragmented.  At this point, it doesn't matter WHERE we put
the LSN - we are rearranging the whole page anyway.
3. Index vacuum is run.
4. Now we want to make the dead line pointer unused, and reclaim the
LSN storage.  If the LSN is stored at the end of the page, then we now
have to move all of the tuple data forward by 8 bytes.  But if it's
stored adjacent to the hole in the middle of the page, we need only
clear the page-header bits saying it's there (and maybe adjust
pd_lower).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [BUGS] BUG #6034: pg_upgrade fails when it should not.

Alvaro Herrera wrote:
 Excerpts from Robert Haas's message of mi?? may 25 13:33:41 -0400 2011:
  On Wed, May 25, 2011 at 1:22 PM, Bruce Momjian br...@momjian.us wrote:
 
   I can easily remove dashes before the compare if people like that idea
   --- I think you could argue that a dash is not significant, unless ab-c
   and a-bc are different locales.
  
  I think the more we mush that string around, the more chance we have
  of breaking something.  What's wrong with insisting that people set
  the value to the same thing?  Like, really the same?
 
 No objection here to that idea.

I thought the problem was that they upgraded the OS and now the encoding
names changed, though they behaved the same.  Is that now what is
happening?  Can they supply the values with different cases?

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Minor issues with docs

2011-05-25 Thread Marco Nenciarini

While I was working on automatic translation of PostgreSQL's
documentation from SGML to XML, I found some minor issues.

In the file doc/src/sgml/ecpg.sgml there are many lines of C code
containing unescaped '' characters.

In the file doc/src/sgml/array.sgml there is a tag which has a case
mismatch error with its end tag.

Attached you can find the patch, if you want to apply it.

Regards,
Marco

-- 
Marco Nenciarini - 2ndQuadrant Italy
PostgreSQL Training, Services and Support
marco.nenciar...@2ndquadrant.it | www.2ndQuadrant.it
From d22539bdb7cabcb6bfbf0ce1b80a59fbba283ca4 Mon Sep 17 00:00:00 2001
From: Marco Nenciarini marco.nenciar...@2ndquadrant.it
Date: Wed, 25 May 2011 19:35:36 +0200
Subject: [PATCH] Fix minor issues with documentation markup


Signed-off-by: Marco Nenciarini marco.nenciar...@2ndquadrant.it
---
 doc/src/sgml/array.sgml |2 +-
 doc/src/sgml/ecpg.sgml  |   20 ++--
 2 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/doc/src/sgml/array.sgml b/doc/src/sgml/array.sgml
index bb318c5..3508ba3 100644
--- a/doc/src/sgml/array.sgml
+++ b/doc/src/sgml/array.sgml
@@ -369,7 +369,7 @@ UPDATE sal_emp SET pay_by_quarter = ARRAY[25000,25000,27000,27000]
 programlisting
 UPDATE sal_emp SET pay_by_quarter[4] = 15000
 WHERE name = 'Bill';
-/programListing
+/programlisting
 
   or updated in a slice:
 
diff --git a/doc/src/sgml/ecpg.sgml b/doc/src/sgml/ecpg.sgml
index 9c6ca4c..a8ffde5 100644
--- a/doc/src/sgml/ecpg.sgml
+++ b/doc/src/sgml/ecpg.sgml
@@ -4154,7 +4154,7 @@ switch (v.sqltype)
 {
 case ECPGt_char:
 memset(amp;var_buf, 0, sizeof(var_buf));
-memcpy(amp;var_buf, sqldata, (sizeof(var_buf) = sqllen ? sizeof(var_buf) - 1 : sqllen));
+memcpy(amp;var_buf, sqldata, (sizeof(var_buf) lt;= sqllen ? sizeof(var_buf) - 1 : sqllen));
 break;
 
 case ECPGt_int: /* integer */
@@ -4390,7 +4390,7 @@ main(void)
 
 case ECPGt_char:
 memset(amp;var_buf, 0, sizeof(var_buf));
-memcpy(amp;var_buf, sqldata, (sizeof(var_buf) = sqllen ? sizeof(var_buf)-1 : sqllen));
+memcpy(amp;var_buf, sqldata, (sizeof(var_buf) lt;= sqllen ? sizeof(var_buf)-1 : sqllen));
 break;
 
 case ECPGt_int: /* integer */
@@ -5871,39 +5871,39 @@ main(void)
 
 /* create */
 loid = lo_create(conn, 0);
-if (loid  0)
+if (loid lt; 0)
 printf(lo_create() failed: %s, PQerrorMessage(conn));
 
 printf(loid = %d\n, loid);
 
 /* write test */
 fd = lo_open(conn, loid, INV_READ|INV_WRITE);
-if (fd  0)
+if (fd lt; 0)
 printf(lo_open() failed: %s, PQerrorMessage(conn));
 
 printf(fd = %d\n, fd);
 
 rc = lo_write(conn, fd, buf, buflen);
-if (rc  0)
+if (rc lt; 0)
 printf(lo_write() failed\n);
 
 rc = lo_close(conn, fd);
-if (rc  0)
+if (rc lt; 0)
 printf(lo_close() failed: %s, PQerrorMessage(conn));
 
 /* read test */
 fd = lo_open(conn, loid, INV_READ);
-if (fd  0)
+if (fd lt; 0)
 printf(lo_open() failed: %s, PQerrorMessage(conn));
 
 printf(fd = %d\n, fd);
 
 rc = lo_read(conn, fd, buf2, buflen);
-if (rc  0)
+if (rc lt; 0)
 printf(lo_read() failed\n);
 
 rc = lo_close(conn, fd);
-if (rc  0)
+if (rc lt; 0)
 printf(lo_close() failed: %s, PQerrorMessage(conn));
 
 /* check */
@@ -5912,7 +5912,7 @@ main(void)
 
 /* cleanup */
 rc = lo_unlink(conn, loid);
-if (rc  0)
+if (rc lt; 0)
 printf(lo_unlink() failed: %s, PQerrorMessage(conn));
 
 EXEC SQL COMMIT;
-- 
1.7.5.1


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Expression Evaluator used for creating the plan tree / stmt ?

I think the command 'where' does the same. And the command showed something
which looked like was part of evaluation...it got me confused. Anyways,
thanks robert. I will check that too. I did not know the 'bt' command.

--
Sent from my Android
On 25 May 2011 23:02, Robert Haas robertmh...@gmail.com wrote:

Re: [HACKERS] The way to know whether the standby has caught up with the master

2011-05-25 Thread David Fetter

On Wed, May 25, 2011 at 12:34:59PM +0100, Simon Riggs wrote:
 On Wed, May 25, 2011 at 6:16 AM, Heikki Linnakangas
 heikki.linnakan...@enterprisedb.com wrote:
 
  To achieve that, I'm thinking to change walsender so that, when the
  standby
  has caught up with the master, it sends back the message indicating that
  to
  the standby. And I'm thinking to add new function (or view like
  pg_stat_replication)
  available on the standby, which shows that info.
 
  By the time the standby has received that message, it might not be caught-up
  anymore because new WAL might've been generated in the master already.
 
 AFAICS, this is an exact case of the Byzantine General's problem.

Have they updated it to acknowledge that the city is now called
Istanbul?

Cheers,
David.
-- 
David Fetter da...@fetter.org http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david.fet...@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [BUGS] BUG #6034: pg_upgrade fails when it should not.

On Wed, May 25, 2011 at 2:13 PM, Bruce Momjian br...@momjian.us wrote:
 Alvaro Herrera wrote:
 Excerpts from Robert Haas's message of mié may 25 13:33:41 -0400 2011:
  On Wed, May 25, 2011 at 1:22 PM, Bruce Momjian br...@momjian.us wrote:

   I can easily remove dashes before the compare if people like that idea
   --- I think you could argue that a dash is not significant, unless ab-c
   and a-bc are different locales.
 
  I think the more we mush that string around, the more chance we have
  of breaking something.  What's wrong with insisting that people set
  the value to the same thing?  Like, really the same?

 No objection here to that idea.

 I thought the problem was that they upgraded the OS and now the encoding
 names changed, though they behaved the same.  Is that now what is
 happening?  Can they supply the values with different cases?

Oh, hmm.  I don't know.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] use less space in xl_xact_commit patch

On Wed, May 25, 2011 at 3:41 PM, Simon Riggs si...@2ndquadrant.com wrote:
 On Wed, May 25, 2011 at 2:43 PM, Leonardo Francalanci m_li...@yahoo.it 
 wrote:
 Da: Simon Riggs si...@2ndquadrant.com
 I can't find a clear  discussion of what you are trying to do, and how,
 just a URL back to a  complex discussion on another topic.


 While trying to write a patch to allow changing an unlogged table into
 a logged one, I had to add another int field to xl_xact_commit.
 Robert Haas said:

  I have to admit I don't like this approach very much.  I can't see
 adding 4 bytes to every commit record for this feature.


 which is a correct remark.

 xl_xact_commit can contain some arrays (relation to drops,
 committed sub-trans, shared invalidation msgs). The length of
 these arrays is specified using 3 ints in the struct.

 So, to avoid adding more ints to the struct, I've been suggested to
 remove all the ints, and use   xl_xact_commit.xinfo to flag which
 arrays are, in fact, present.

 So the whole idea is:

 - remove nrels, nsubxacts and nmsgs from xl_xact_commit
 - use bits in xinfo to signal which arrays are present at the end
 of   xl_xact_commit
 - for each present array, add the length of the array (as int) at
 the end of    xl_xact_commit
 - add each present array after all the lengths


 OK, thats clear. Thanks.

 That formatting sounds quite complex.

 I would propose we split this into 2 WAL records: xl_xact_commit and
 xl_xact_commit_with_info

 xl_xact_commit doesn't have any flags, counts or arrays.

 xl_xact_commit_with_info always has all 3 counts, even if zero.
 Arrays follow the main record

 I think it might also be possible to removed dbId and tsId from
 xl_act_commit if we use that definition.

Yes, that's correct. We can remove them from the normal commit record
when nmsgs == 0.

I think we can get away with these 2 definitions, but pls check. Using
static definitions works better for me because we can see what they
contain, rather than having the info flags imply that the record can
contain any permutation of settings when that's not really possible.

  {
TimestampTz xact_time;  /* time of commit */
int nsubxacts;  /* number of 
subtransaction XIDs */
/* ARRAY OF COMMITTED SUBTRANSACTION XIDs FOLLOWS */
  } xl_xact_commit;

  {
TimestampTz xact_time;  /* time of commit */
uint32  xinfo;  /* info flags */
int nrels;  /* number of 
RelFileNodes */
int nsubxacts;  /* number of 
subtransaction XIDs */
int nmsgs;  /* number of shared inval msgs 
*/
Oid dbId;   /* MyDatabaseId */
Oid tsId;   /* MyDatabaseTableSpace 
*/
/* Array of RelFileNode(s) to drop at commit */
RelFileNode xnodes[1];  /* VARIABLE LENGTH ARRAY */
/* ARRAY OF COMMITTED SUBTRANSACTION XIDs FOLLOWS */
/* ARRAY OF SHARED INVALIDATION MESSAGES FOLLOWS */
  } xl_xact_commit_with_info;

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] use less space in xl_xact_commit patch

On Wed, May 25, 2011 at 10:41 AM, Simon Riggs si...@2ndquadrant.com wrote:
 OK, thats clear. Thanks.

 That formatting sounds quite complex.

 I would propose we split this into 2 WAL records: xl_xact_commit and
 xl_xact_commit_with_info

 xl_xact_commit doesn't have any flags, counts or arrays.

 xl_xact_commit_with_info always has all 3 counts, even if zero.
 Arrays follow the main record

 I think it might also be possible to removed dbId and tsId from
 xl_act_commit if we use that definition.

That's an interesting idea.  I hadn't noticed that if nmsgs=0 then
those fields aren't needed either.  Also, both nrels0 and nmsgs0
will probably only happen if some DDL has been done, which most
transactions probably don't, so it would be logical to have one record
type that excludes nrels, nmsgs, dbId, tsId, and another record type
that includes all of those.  In fact, we could probably throw in xinfo
as well: if there's no DDL involved, we shouldn't need to set
XACT_COMPLETION_UPDATE_RELCACHE_FILE or
XACT_COMPLETION_FORCE_SYNC_COMMIT either.

But I'm not so sure about nsubxacts.  It seems to me that we might
lose a lot of the benefit of having two record types if we have to use
the larger one whenever nsubxacts0.  I'm not exactly sure how common
subtransactions are, but if I had to guess I'd speculate that they are
far more frequent than DDL operations.  Maybe we should think about
making the xl_xact_commit record contain just xact_time, nsubxacts,
and a subxact array; and then have the xl_xact_commit_with_info (or
maybe xl_xact_commit_full, or whatever we decide to call it) contain
everything else.   That would shave off 20 bytes from the commit
records of any non-DDL operation, which would be pretty nice.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum

2011-05-25 Thread Cédric Villemain

2011/5/25 Alvaro Herrera alvhe...@commandprompt.com:
 Excerpts from Cédric Villemain's message of mié may 25 13:24:01 -0400 2011:

  Well, we only actually need to store one number, because you can figure
  out a much more precise number-of-pages figure with pg_relation_size()
  divided by configured page size.

 I may miss something but we need relation size in costsize.c even if
 we have a reldensity (or we need a reltuples). Else what values should
 be used to estimate the relation size ? (pg_relation_size() goes down
 to kernel/fs to ask the stat.st_size, is it really what we want?)

 Actually yes, the planner does go to kernel to determine the current
 relation size, and then multiplies by density as computed from catalog
 data to figure out current reasonably accurate number of tuples.

Okay! I just read that part. Interesting.
(If I dive correctly, we search our last segment and then use a
fileseek to the end of this segment to get our information)

make more sense, suddendly :)


 --
 Álvaro Herrera alvhe...@commandprompt.com
 The PostgreSQL Company - Command Prompt, Inc.
 PostgreSQL Replication, Consulting, Custom Development, 24x7 support




-- 
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_upgrade automatic testing

2011-05-25 Thread Peter Eisentraut

On ons, 2011-04-27 at 18:14 -0400, Noah Misch wrote:
 Enthusiastic +1 for this concept.  There's at least one rough edge: it fails 
 if
 you have another postmaster running on port 5432.

This has now been addressed: pg_upgrade accepts PGPORT settings.
Attached is a slightly updated patch runs the test suite with a port of
65432, which you can override by setting PGPORT yourself.

Anyway, is this something that people want in the repository?  It's not
as polished as the pg_regress business, but it is definitely helpful.

diff --git i/contrib/pg_upgrade/Makefile w/contrib/pg_upgrade/Makefile
index 8f3fd7c..9ec7bc0 100644
--- i/contrib/pg_upgrade/Makefile
+++ w/contrib/pg_upgrade/Makefile
@@ -21,3 +21,6 @@ top_builddir = ../..
 include $(top_builddir)/src/Makefile.global
 include $(top_srcdir)/contrib/contrib-global.mk
 endif
+
+check: test.sh
+	MAKE=$(MAKE) bindir=$(bindir) $(SHELL) $
diff --git i/contrib/pg_upgrade/test.sh w/contrib/pg_upgrade/test.sh
index e69de29..c675f26 100644
--- i/contrib/pg_upgrade/test.sh
+++ w/contrib/pg_upgrade/test.sh
@@ -0,0 +1,41 @@
+set -eux
+
+: ${MAKE=make}
+: ${PGPORT=65432}
+export PGPORT
+
+temp_root=$PWD/tmp_check
+temp_install=$temp_root/install
+bindir=$temp_install/$bindir
+PATH=$bindir:$PATH
+export PATH
+
+PGDATA=$temp_root/data
+export PGDATA
+rm -rf $PGDATA $PGDATA.old
+
+logdir=$PWD/log
+rm -rf $logdir
+mkdir $logdir
+
+$MAKE -C ../.. install DESTDIR=$temp_install 21 | tee $logdir/install.log
+$MAKE -C ../pg_upgrade_support install DESTDIR=$temp_install 21 | tee -a $logdir/install.log
+$MAKE -C . install DESTDIR=$temp_install 21 | tee -a $logdir/install.log
+
+initdb 21 | tee $logdir/initdb1.log
+pg_ctl start -l $logdir/postmaster1.log -w
+$MAKE -C ../.. installcheck 21 | tee $logdir/installcheck.log
+pg_dumpall $temp_root/dump1.sql
+pg_ctl -m fast stop
+
+mv ${PGDATA} ${PGDATA}.old
+
+initdb 21 | tee $logdir/initdb2.log
+
+pg_upgrade -d ${PGDATA}.old -D ${PGDATA} -b $bindir -B $bindir
+
+pg_ctl start -l $logdir/postmaster2.log -w
+pg_dumpall $temp_root/dump2.sql
+pg_ctl -m fast stop
+
+diff -q $temp_root/dump1.sql $temp_root/dump2.sql

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] use less space in xl_xact_commit patch

On Wed, May 25, 2011 at 3:05 PM, Simon Riggs si...@2ndquadrant.com wrote:
On Wed, May 25, 2011 at 3:41 PM, Simon Riggs si...@2ndquadrant.com wrote:
On Wed, May 25, 2011 at 2:43 PM, Leonardo Francalanci m_li...@yahoo.it
wrote:
Da: Simon Riggs si...@2ndquadrant.com
I can't find a clear discussion of what you are trying to do, and how,
just a URL back to a complex discussion on another topic.

While trying to write a patch to allow changing an unlogged table into
a logged one, I had to add another int field to xl_xact_commit.
Robert Haas said:

I have to admit I don't like this approach very much. I can't see
adding 4 bytes to every commit record for this feature.

which is a correct remark.

xl_xact_commit can contain some arrays (relation to drops,
committed sub-trans, shared invalidation msgs). The length of
these arrays is specified using 3 ints in the struct.

So, to avoid adding more ints to the struct, I've been suggested to
remove all the ints, and use xl_xact_commit.xinfo to flag which
arrays are, in fact, present.

So the whole idea is:

- remove nrels, nsubxacts and nmsgs from xl_xact_commit
- use bits in xinfo to signal which arrays are present at the end
of xl_xact_commit
- for each present array, add the length of the array (as int) at
the end of xl_xact_commit
- add each present array after all the lengths

OK, thats clear. Thanks.

That formatting sounds quite complex.

I would propose we split this into 2 WAL records: xl_xact_commit and
xl_xact_commit_with_info

xl_xact_commit doesn't have any flags, counts or arrays.

xl_xact_commit_with_info always has all 3 counts, even if zero.
Arrays follow the main record

I think it might also be possible to removed dbId and tsId from
xl_act_commit if we use that definition.

Yes, that's correct. We can remove them from the normal commit record
when nmsgs == 0.

I think we can get away with these 2 definitions, but pls check. Using
static definitions works better for me because we can see what they
contain, rather than having the info flags imply that the record can
contain any permutation of settings when that's not really possible.

{
TimestampTz xact_time; /* time of commit */
int nsubxacts; /* number of
subtransaction XIDs */
/* ARRAY OF COMMITTED SUBTRANSACTION XIDs FOLLOWS */
} xl_xact_commit;

{
TimestampTz xact_time; /* time of commit */
uint32 xinfo; /* info flags */
int nrels; /* number of
RelFileNodes */
int nsubxacts; /* number of
subtransaction XIDs */
int nmsgs; /* number of shared inval msgs
*/
Oid dbId; /* MyDatabaseId */
Oid tsId; /*
MyDatabaseTableSpace */
/* Array of RelFileNode(s) to drop at commit */
RelFileNode xnodes[1]; /* VARIABLE LENGTH ARRAY */
/* ARRAY OF COMMITTED SUBTRANSACTION XIDs FOLLOWS */
/* ARRAY OF SHARED INVALIDATION MESSAGES FOLLOWS */
} xl_xact_commit_with_info;

Wow, that is identical to the conclusion that I came to about 60 seconds ago.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] about EDITOR_LINENUMBER_SWITCH

Alvaro Herrera alvhe...@commandprompt.com writes:
 Excerpts from Tom Lane's message of mar may 24 17:11:17 -0400 2011:
 Right.  It would also increase the cognitive load on the user to have
 to remember the command-line go-to-line-number switch for his editor.
 So I don't particularly want to redesign this feature.  However, I can
 see the possible value of letting EDITOR_LINENUMBER_SWITCH be set from
 the same place that you set EDITOR, which would suggest that we allow
 the value to come from an environment variable.  I'm not sure whether
 there is merit in allowing both that source and ~/.psqlrc, though
 possibly for Windows users it might be easier if ~/.psqlrc worked.

 If we're going to increase the number of options in .psqlrc that do not
 work with older psql versions, can I please have .psqlrc-MAJORVERSION or
 some such?  Having 8.3's psql complain all the time because it doesn't
 understand linestyle is annoying.

1. I thought we already did have that.

2. In any case, EDITOR_LINENUMBER_SWITCH isn't a hazard for this,
because older versions will just think it's a variable without any
special meaning.

But the real question here is whether we want to change it to be also
(or instead?) an environment variable.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pull up aggregate subquery

Robert Haas robertmh...@gmail.com writes:
 I think getting it working is probably a good first goal.  I am not
 really sure that we want to commit it that way, and I think my vote
 would be for you to work on the approach we discussed before rather
 than this one, but it's your project, and I think you'll probably
 learn enough in getting it working that it will be a step forward in
 any case.  The planner is complex enough that it's worth trying to get
 something that works, first, before trying to make it perfect.

Remember Brooks' advice: Plan to throw one away.  You will anyhow.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [BUGS] BUG #6034: pg_upgrade fails when it should not.

Robert Haas robertmh...@gmail.com writes:
 On Wed, May 25, 2011 at 2:13 PM, Bruce Momjian br...@momjian.us wrote:
 I thought the problem was that they upgraded the OS and now the encoding
 names changed, though they behaved the same.  Is that now what is
 happening?  Can they supply the values with different cases?

 Oh, hmm.  I don't know.

The original complaint wasn't too clear.  But I agree with Robert that
case-insensitive comparison is about as far as it seems safe to go in
trying to allow for variant spellings of locale names.  I would think
that a platform that randomly whacks such names around is going to find
a whole lot of software breaking, not only Postgres.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Nested CASE-WHEN scoping

Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
 On 25.05.2011 17:47, Tom Lane wrote:
 [ scratches head ... ]  Why does the save/restore in ExecEvalCase not
 take care of this?

 The mistake happens during planning, when the SQL function is inlined 
 and pre-evaluated.

OK, I see the problem now: we have a CaseTestExpr in the arguments of
the function, which we are unable to reduce to a constant, so it gets
substituted as-is into the body of the function during inlining.  And
then it's physically inside the CASE expression that's in the function
body, so it looks like it syntactically belongs to that expression,
which it doesn't.  You're probably right that this is impractical to fix
without redesigning the expression representation, and that
CoerceToDomainValue has got a similar issue.

My advice is to not change the parser output representation, though.
What I think you ought to do about this is to have the planner replace 
CaseTestExpr and CoerceToDomainValue with PARAM_EXEC Params during
expression preprocessing, and assign suitable Param numbers which it
sticks into the CaseExpr (resp CoerceToDomainExpr) so that those
expressions know which Param slot to fill at runtime.  The
const-simplification logic can avoid getting dumber by treating the
cases like known-Param-value cases.  I don't think you need to invent
something separate from the PARAM_EXEC infrastructure to handle these.

The main annoyance here is that standalone expressions may now need
Param slots to execute, which will complicate places that need to
evaluate them, but there's probably no way around that (a bespoke
mechanism would complicate those callers just as much, if the number
of value slots it needs is variable, which it will be).

 BTW, i just stumbled into this:

 postgres=# explain verbose SELECT CASE now() WHEN (29+random()::int4) 
 THEN 'foo' ELSE 'bar' END;
 ERROR:  unexpected CASE WHEN clause: 326

 Looks like ruleutils.c is also not prepared for the case that the 
 implicit equality operation gets inlined into something else than an OpExpr.

Grumble ... I thought we'd fixed that ...

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [BUGS] BUG #6034: pg_upgrade fails when it should not.

Tom Lane wrote:
 Robert Haas robertmh...@gmail.com writes:
  On Wed, May 25, 2011 at 2:13 PM, Bruce Momjian br...@momjian.us wrote:
  I thought the problem was that they upgraded the OS and now the encoding
  names changed, though they behaved the same. ?Is that now what is
  happening? ?Can they supply the values with different cases?
 
  Oh, hmm.  I don't know.
 
 The original complaint wasn't too clear.  But I agree with Robert that
 case-insensitive comparison is about as far as it seems safe to go in
 trying to allow for variant spellings of locale names.  I would think
 that a platform that randomly whacks such names around is going to find
 a whole lot of software breaking, not only Postgres.

Agreed.  Unless an expert shows up and can tell us exactly how the
naming works, we have gone as far as we can go.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum