[HACKERS] pl/perl and utf-8 in sql_ascii databases

2012-02-09 Thread Christoph Berg
Hi,

we have a database that is storing strings in various encodings (and
non-encodings, namely the arbitrary byte soup that you might see in
email headers from the internet). For this reason, the database uses
sql_ascii encoding. The columns are text, as most characters are
ascii, so bytea didn't seem the right way to go.

Currently we are on 8.3 and try to upgrade to 9.1, but the plperlu
functions we have are acting up.

Old behavior on 8.3 .. 9.0:

sql_ascii =# create or replace function whitespace(text) returns text
language plperlu as $$ $a = shift; $a =~ s/[\t ]+/ /g; return $a; $$;
CREATE FUNCTION

sql_ascii =# select whitespace (E'\200'); -- 0x80 is not valid utf-8
 whitespace 

 

sql_ascii =# select whitespace (E'\200')::bytea;
 whitespace 

 \x80

New behavior on 9.1.2:

sql_ascii =# select whitespace (E'\200');
ERROR:  XX000: Malformed UTF-8 character (fatal) at line 1.
KONTEXT:  PL/Perl function whitespace
ORT:  plperl_call_perl_func, plperl.c:2037

A crude workaround is:

sql_ascii =# create or replace function whitespace_utf8_off(text)
returns text language plperlu as $$ use Encode; $a = shift;
Encode::_utf8_off($a); $a =~ s/[\t ]+/ /g; return $a; $$;
CREATE FUNCTION

sql_ascii =# select whitespace_utf8_off (E'\200');
 whitespace_utf8_off 
-
 \u0080

sql_ascii =# select whitespace_utf8_off (E'\200')::bytea;
 whitespace_utf8_off 
-
 \xc280

(Note that the workaround is not perfect as the resulting 0x80..0xff
bytes are still tagged to be utf8.)


I think the bug is in plperl_helpers.h:

/*
 * Create a new SV from a string assumed to be in the current database's
 * encoding.
 */

static inline SV *
cstr2sv(const char *str)
{
SV *sv;
char   *utf8_str = utf_e2u(str);

sv = newSVpv(utf8_str, 0);
SvUTF8_on(sv);

pfree(utf8_str);

return sv;
}

In sql_ascii databases, utf_e2u does not do any recoding, but then
SvUTF8_on still marks the string as utf-8, while it isn't.

(Returned values might also need fixing.)

In my view, this is clearly a bug in pl/perl on sql_ascii databases.

Christoph
-- 
c...@df7cb.de | http://www.df7cb.de/


signature.asc
Description: Digital signature


Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)

2012-02-09 Thread Fujii Masao
On Thu, Feb 9, 2012 at 3:32 AM, Jeff Janes jeff.ja...@gmail.com wrote:
 On Wed, Feb 1, 2012 at 11:46 PM, Heikki Linnakangas
 heikki.linnakan...@enterprisedb.com wrote:
 On 31.01.2012 17:35, Fujii Masao wrote:

 On Fri, Jan 20, 2012 at 11:11 PM, Heikki Linnakangas
 heikki.linnakan...@enterprisedb.com  wrote:

 On 20.01.2012 15:32, Robert Haas wrote:


 On Sat, Jan 14, 2012 at 9:32 AM, Heikki Linnakangas
 heikki.linnakan...@enterprisedb.com    wrote:


 Here's another version of the patch to make XLogInsert less of a
 bottleneck
 on multi-CPU systems. The basic idea is the same as before, but several
 bugs
 have been fixed, and lots of misc. clean up has been done.



 This seems to need a rebase.



 Here you go.


 The patch seems to need a rebase again.


 Here you go again. It conflicted with the group commit patch, and the patch
 to WAL-log and track changes to full_page_writes setting.


 After applying this patch and then forcing crashes, upon recovery the
 database is not correct.

 If I make a table with 10,000 rows and then after that intensively
 update it using a unique key:

 update foo set count=count+1 where foobar=?

 Then after the crash there are less than 10,000 visible rows:

 select count(*) from foo

 This not a subtle thing, it happens every time.  I get counts of
 between 1973 and 8827.  Without this patch I always get exactly
 10,000.

 I don't really know where to start on tracking this down.

Similar problem happened on my test. When I executed CREATE TABLE and
shut down the server with immediate mode, after recovery I could not see the
created table. Here are the server log of recovery with wal_debug = on:

LOG:  database system was interrupted; last known up at 2012-02-09 19:18:50 JST
LOG:  database system was not properly shut down; automatic recovery in progress
LOG:  redo starts at 0/179CC90
LOG:  REDO @ 0/179CC90; LSN 0/179CCB8: prev 0/179CC30; xid 0; len 4:
XLOG - nextOid: 24576
LOG:  REDO @ 0/179CCB8; LSN 0/179CCE8: prev 0/179CC90; xid 0; len 16:
Storage - file create: base/12277/16384
LOG:  REDO @ 0/179CCE8; LSN 0/179DDE0: prev 0/179CCB8; xid 998; len
21; bkpb1: Heap - insert: rel 1663/12277/12014; tid 7/22
LOG:  there is no contrecord flag in log file 0, segment 1, offset 7987200
LOG:  redo done at 0/179CCE8

According to the log there is no contrecord flag, ISTM the path treats the
contrecord of backup block incorrectly, and which causes the problem.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)

2012-02-09 Thread Fujii Masao
On Thu, Feb 9, 2012 at 7:25 PM, Fujii Masao masao.fu...@gmail.com wrote:
 On Thu, Feb 9, 2012 at 3:32 AM, Jeff Janes jeff.ja...@gmail.com wrote:
 On Wed, Feb 1, 2012 at 11:46 PM, Heikki Linnakangas
 heikki.linnakan...@enterprisedb.com wrote:
 On 31.01.2012 17:35, Fujii Masao wrote:

 On Fri, Jan 20, 2012 at 11:11 PM, Heikki Linnakangas
 heikki.linnakan...@enterprisedb.com  wrote:

 On 20.01.2012 15:32, Robert Haas wrote:


 On Sat, Jan 14, 2012 at 9:32 AM, Heikki Linnakangas
 heikki.linnakan...@enterprisedb.com    wrote:


 Here's another version of the patch to make XLogInsert less of a
 bottleneck
 on multi-CPU systems. The basic idea is the same as before, but several
 bugs
 have been fixed, and lots of misc. clean up has been done.



 This seems to need a rebase.



 Here you go.


 The patch seems to need a rebase again.


 Here you go again. It conflicted with the group commit patch, and the patch
 to WAL-log and track changes to full_page_writes setting.


 After applying this patch and then forcing crashes, upon recovery the
 database is not correct.

 If I make a table with 10,000 rows and then after that intensively
 update it using a unique key:

 update foo set count=count+1 where foobar=?

 Then after the crash there are less than 10,000 visible rows:

 select count(*) from foo

 This not a subtle thing, it happens every time.  I get counts of
 between 1973 and 8827.  Without this patch I always get exactly
 10,000.

 I don't really know where to start on tracking this down.

 Similar problem happened on my test. When I executed CREATE TABLE and
 shut down the server with immediate mode, after recovery I could not see the
 created table. Here are the server log of recovery with wal_debug = on:

 LOG:  database system was interrupted; last known up at 2012-02-09 19:18:50 
 JST
 LOG:  database system was not properly shut down; automatic recovery in 
 progress
 LOG:  redo starts at 0/179CC90
 LOG:  REDO @ 0/179CC90; LSN 0/179CCB8: prev 0/179CC30; xid 0; len 4:
 XLOG - nextOid: 24576
 LOG:  REDO @ 0/179CCB8; LSN 0/179CCE8: prev 0/179CC90; xid 0; len 16:
 Storage - file create: base/12277/16384
 LOG:  REDO @ 0/179CCE8; LSN 0/179DDE0: prev 0/179CCB8; xid 998; len
 21; bkpb1: Heap - insert: rel 1663/12277/12014; tid 7/22
 LOG:  there is no contrecord flag in log file 0, segment 1, offset 7987200
 LOG:  redo done at 0/179CCE8

 According to the log there is no contrecord flag, ISTM the path treats the
 contrecord of backup block incorrectly, and which causes the problem.

Yep, as far as I read the patch, it seems to have forgotten to set
XLP_FIRST_IS_CONTRECORD flag.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Add protransform for numeric, varbit, and temporal types

2012-02-09 Thread Noah Misch
On Wed, Feb 08, 2012 at 09:37:01AM -0500, Robert Haas wrote:
 On Tue, Feb 7, 2012 at 12:43 PM, Robert Haas robertmh...@gmail.com wrote:
  I've committed the numeric and varbit patches and will look at the
  temporal one next.
 
 Committed, after changing the OIDs so they don't conflict.
 
 Here's one more case for you to ponder:
 
 rhaas=# create table noah (i interval day);
 CREATE TABLE
 rhaas=# alter table noah alter column i set data type interval second(3);
 DEBUG:  rewriting table noah
 ALTER TABLE
 
 Do we really need a rewrite in that case?  The code acts like the
 interval range and precision are separate beasts, but is that really
 true?

The code has a thinko; a given interval typmod ultimately implies a single
point from which we truncate rightward.  The precision only matters if the
range covers SECOND.  Thanks; the attached patch improves this.
*** a/src/backend/utils/adt/timestamp.c
--- b/src/backend/utils/adt/timestamp.c
***
*** 958,963  interval_transform(PG_FUNCTION_ARGS)
--- 958,964 
int new_range = INTERVAL_RANGE(new_typmod);
int new_precis = 
INTERVAL_PRECISION(new_typmod);
int new_range_fls;
+   int old_range_fls;
  
if (old_typmod == -1)
{
***
*** 974,985  interval_transform(PG_FUNCTION_ARGS)
 * Temporally-smaller fields occupy higher positions in the 
range
 * bitmap.  Since only the temporally-smallest bit matters for 
length
 * coercion purposes, we compare the last-set bits in the 
ranges.
 */
new_range_fls = fls(new_range);
if (new_typmod == -1 ||
((new_range_fls = SECOND ||
! new_range_fls = fls(old_range)) 
!(new_precis = MAX_INTERVAL_PRECISION ||
  new_precis = old_precis)))
ret = relabel_to_typmod(source, new_typmod);
}
--- 975,990 
 * Temporally-smaller fields occupy higher positions in the 
range
 * bitmap.  Since only the temporally-smallest bit matters for 
length
 * coercion purposes, we compare the last-set bits in the 
ranges.
+* Precision, which is to say, sub-second precision, only 
affects
+* ranges that include SECOND.
 */
new_range_fls = fls(new_range);
+   old_range_fls = fls(old_range);
if (new_typmod == -1 ||
((new_range_fls = SECOND ||
! new_range_fls = old_range_fls) 
!(old_range_fls  SECOND ||
! new_precis = MAX_INTERVAL_PRECISION ||
  new_precis = old_precis)))
ret = relabel_to_typmod(source, new_typmod);
}

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-02-09 Thread Noah Misch
On Tue, Feb 07, 2012 at 09:38:39PM -0500, Robert Haas wrote:
 Second, there's a concern about binary bloat: duplicating lots of code
 with different comparators inlined generates, well, a lot of machine
 code.  Of course, an 0.8% increase in the size of the resulting binary
 is very unlikely to produce any measurable performance regression on
 its own, but that's not really the issue.  The issue is rather that we
 could easily go nuts applying this technique in lots of different
 places - or even just in one place to lots of different types.  Let's
 say that we find that even for types with very complex comparators, we
 can get a 5% speedup on quick-sorting speed using this technique.
 Should we, then, apply the technique to every type we have?  Perhaps
 the binary would grow by, I don't know, 10%.  Maybe that's still not
 enough to start hurting performance (or making packagers complain),
 but surely it's getting there, and what happens when we discover that
 similar speedups are possible using the same trick in five other
 scenarios?  I think it's a bit like going out to dinner and spending
 $50.  It's nicer than eating at home, and many people can afford to do
 it occasionally, but if you do it every night, it gets expensive (and
 possibly fattening).
 
 So we need some principled way of deciding how much inlining is
 reasonable, because I am 100% certain this is not going to be the last
 time someone discovers that a massive exercise in inlining can yield a
 nifty performance benefit in some case or another: index builds and
 COPY have already been mentioned on this thread, and I expect that
 people will find other cases as well.  I'm not really sure what the
 budget is - i.e. how much binary bloat we can afford to add - or how
 many cases there are that can benefit, but the first isn't infinite
 and the second is more than the first.

Having such a metric would resolve this discussion, but formulating it will be
all but impossible.  We don't know how much time PostgreSQL installations now
spend sorting or how much additional time they would spend on cache misses,
TLB misses, page faults, etc.  That data won't show up on this thread.

You posed[1] the straw man of a sin(numeric) optimization requiring a 40GB
lookup table.  I would not feel bad about rejecting that, because it can live
quite comfortably as an external module.  Indeed, I think the best thing we
can do to constrain long-term bloat in the postgres executable is to improve
pluggability.  Let a wider range of features live outside the postmaster
binary.  For example, if we had the infrastructure to let hash, GIN, GiST and
SP-GiST index access methods live in their own DSOs (like PL/pgSQL does
today), I would support doing that.  Extensibility is a hallmark of
PostgreSQL.  It's a bad day when we reject a minority-interest feature or
optimization that has no other way to exist.

Better pluggability can't ease the decision process for Peter's patch, because
its specializations essentially must live in the postgres executable to live
at all.  Nominally, we could let external modules inject sort specializations
for core types, and Peter could publish an all_fast_sorts extension.  That
would be useless punting: if we can't make a principled decision on whether
these accelerated sorts justify 85 KiB of binary, how will a DBA who discovers
the module make that decision?

This patch has gotten more than its fair share of attention for bloat, and I
think that's mostly because there's a dial-a-bloat-level knob sticking out and
begging to be frobbed.  On my system, fastpath_sort_2012_01_19.patch adds 85
KiB to the postgres binary.  A recent commit added 57 KiB and bloat never came
up in discussions thereof.  I appreciate your concern over a slippery slope of
inlining proposals, but I also don't wish to see the day when every patch
needs to declare and justify its binary byte count.  Unless, of course, we
discover that elusive metric for evaluating it fairly.

If we'd like to start taking interest in binary bloat, how about having the
buildfarm log capture an ls -l on the bin directory?  We'll then have data
to mine if we ever wish to really take action.


All that being said, I'd want to see a 15-30% (depending on how contrived)
improvement to a microbenchmark or a 5% improvement to a generic benchmark
(pgbench, DBT-N) before adopting an optimization of this complexity.  Based
on your measurements[2], per-type inlining gives us a 7% microbenchmark
improvement.  I'd therefore excise the per-type inlining.  (For the record,
given a 20% improvement to the same benchmark, I'd vote yea on the submission
and perhaps even suggest int2 and uuid support.)

nm

[1] 
http://archives.postgresql.org/message-id/ca+tgmozo1xsz+yiqz2mrokmcmqtb+jir0lz43cne6de7--q...@mail.gmail.com
[2] 
http://archives.postgresql.org/message-id/ca+tgmobfz8sqtgtaaryerywuzfodgetprbuygbhsplr4+li...@mail.gmail.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To 

Re: pg_receivexlog and sync rep Re: [HACKERS] Updated version of pg_receivexlog

2012-02-09 Thread Magnus Hagander
On Wed, Feb 8, 2012 at 19:39, Fujii Masao masao.fu...@gmail.com wrote:
 On Tue, Oct 25, 2011 at 7:37 PM, Magnus Hagander mag...@hagander.net wrote:
 On Mon, Oct 24, 2011 at 14:40, Magnus Hagander mag...@hagander.net wrote:
 On Mon, Oct 24, 2011 at 13:46, Heikki Linnakangas
 heikki.linnakan...@enterprisedb.com wrote:
 How does this interact with synchronous replication? If a base backup that
 streams WAL is in progress, and you have synchronous_standby_names set to
 '*', I believe the in-progress backup will count as a standby for that
 purpose. That might give a false sense of security.

 Ah yes. Did not think of that. Yes, it will have this problem.

 Actually, thinking more, per other mail, it won't. Because it will
 never report that the data is synced to disk, so it will not be
 considered for sync standby.

 Now, new replication mode (synchronous_commit = write) is supported.
 In this mode, the in-progress backup will be considered as sync
 standby because its periodic status report includes the valid write position.
 We should change the report so that it includes only invalid positions.
 Patch attached.

Agreed, applied.


 While I agree that the backup should not behave as sync standby, ISTM
 that pg_receivexlog should, which is very useful. If pg_receivexlog does
 so, we can write WAL synchronously in both local and remote, which
 would increase the durability of the system. Thus, to allow pg_receivexlog
 to behave as sync standby, we should change it so that its status report
 includes the write and flush positions?

Yes, that could be useful. I don't think it should do so by default,
though, but it could be useful to provide a commandline switc hto
pg_receivexlog allowing it to do this.

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Add new keywords SNAPSHOT and TYPES to the keyword list in gram.

2012-02-09 Thread Alvaro Herrera

Excerpts from Heikki Linnakangas's message of jue feb 09 06:42:12 -0300 2012:
 Add new keywords SNAPSHOT and TYPES to the keyword list in gram.y
 
 These were added to kwlist.h as unreserved keywords in separate patches,
 but authors forgot to add them to the corresponding list in gram.y.
 Because of that, even though they were supposed to be unreserved keywords,
 they could not be used as identifiers. src/tools/check_keywords.pl is your
 friend.

I wondered a couple of weeks ago if we could, instead, generate the
lists in gram.y from kwlist.h.

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-02-09 Thread Robert Haas
On Thu, Feb 9, 2012 at 7:24 AM, Noah Misch n...@leadboat.com wrote:
 On Tue, Feb 07, 2012 at 09:38:39PM -0500, Robert Haas wrote:
 Second, there's a concern about binary bloat: duplicating lots of code
 with different comparators inlined generates, well, a lot of machine
 code.  Of course, an 0.8% increase in the size of the resulting binary
 is very unlikely to produce any measurable performance regression on
 its own, but that's not really the issue.  The issue is rather that we
 could easily go nuts applying this technique in lots of different
 places - or even just in one place to lots of different types.  Let's
 say that we find that even for types with very complex comparators, we
 can get a 5% speedup on quick-sorting speed using this technique.
 Should we, then, apply the technique to every type we have?  Perhaps
 the binary would grow by, I don't know, 10%.  Maybe that's still not
 enough to start hurting performance (or making packagers complain),
 but surely it's getting there, and what happens when we discover that
 similar speedups are possible using the same trick in five other
 scenarios?  I think it's a bit like going out to dinner and spending
 $50.  It's nicer than eating at home, and many people can afford to do
 it occasionally, but if you do it every night, it gets expensive (and
 possibly fattening).

 So we need some principled way of deciding how much inlining is
 reasonable, because I am 100% certain this is not going to be the last
 time someone discovers that a massive exercise in inlining can yield a
 nifty performance benefit in some case or another: index builds and
 COPY have already been mentioned on this thread, and I expect that
 people will find other cases as well.  I'm not really sure what the
 budget is - i.e. how much binary bloat we can afford to add - or how
 many cases there are that can benefit, but the first isn't infinite
 and the second is more than the first.

 Having such a metric would resolve this discussion, but formulating it will be
 all but impossible.  We don't know how much time PostgreSQL installations now
 spend sorting or how much additional time they would spend on cache misses,
 TLB misses, page faults, etc.  That data won't show up on this thread.

 You posed[1] the straw man of a sin(numeric) optimization requiring a 40GB
 lookup table.  I would not feel bad about rejecting that, because it can live
 quite comfortably as an external module.  Indeed, I think the best thing we
 can do to constrain long-term bloat in the postgres executable is to improve
 pluggability.  Let a wider range of features live outside the postmaster
 binary.  For example, if we had the infrastructure to let hash, GIN, GiST and
 SP-GiST index access methods live in their own DSOs (like PL/pgSQL does
 today), I would support doing that.  Extensibility is a hallmark of
 PostgreSQL.  It's a bad day when we reject a minority-interest feature or
 optimization that has no other way to exist.

 Better pluggability can't ease the decision process for Peter's patch, because
 its specializations essentially must live in the postgres executable to live
 at all.  Nominally, we could let external modules inject sort specializations
 for core types, and Peter could publish an all_fast_sorts extension.  That
 would be useless punting: if we can't make a principled decision on whether
 these accelerated sorts justify 85 KiB of binary, how will a DBA who discovers
 the module make that decision?

 This patch has gotten more than its fair share of attention for bloat, and I
 think that's mostly because there's a dial-a-bloat-level knob sticking out and
 begging to be frobbed.  On my system, fastpath_sort_2012_01_19.patch adds 85
 KiB to the postgres binary.  A recent commit added 57 KiB and bloat never came
 up in discussions thereof.  I appreciate your concern over a slippery slope of
 inlining proposals, but I also don't wish to see the day when every patch
 needs to declare and justify its binary byte count.  Unless, of course, we
 discover that elusive metric for evaluating it fairly.

 If we'd like to start taking interest in binary bloat, how about having the
 buildfarm log capture an ls -l on the bin directory?  We'll then have data
 to mine if we ever wish to really take action.


 All that being said, I'd want to see a 15-30% (depending on how contrived)
 improvement to a microbenchmark or a 5% improvement to a generic benchmark
 (pgbench, DBT-N) before adopting an optimization of this complexity.  Based
 on your measurements[2], per-type inlining gives us a 7% microbenchmark
 improvement.  I'd therefore excise the per-type inlining.  (For the record,
 given a 20% improvement to the same benchmark, I'd vote yea on the submission
 and perhaps even suggest int2 and uuid support.)

That's all well-said and I mostly agree with all of it, including your
suggested percentages in the last paragraph (but does anyone really
sort int2s or uuids?).  What I think is different about inlining 

Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-02-09 Thread Peter Geoghegan
On 9 February 2012 13:50, Robert Haas robertmh...@gmail.com wrote:
 I'm also more than slightly concerned that we are losing sight of the
 forest for the trees.  I have heard reports that sorting large amounts
 of data is many TIMES slower for us than for a certain other database
 product.  I frankly wonder if this entire patch is a distraction.
 When inlining is the tool of choice for further performance
 improvement, it suggests that we're pretty much out of ideas, and that
 whatever we get out of inlining - be it 6% or 30% - will be the end.
 I don't like that idea very much.

If you're talking about the product I think you're talking about,
well, their contracts forbid any actual benchmarks, so we won't have
any hard figures, but I find it intuitively difficult to believe that
the difference is that large, mostly because I've demonstrated that
the performance of qsort is a pretty good proxy for the performance of
a query like select * from foo order by bar, and qsort can't be too
far from optimal. Besides, didn't you yourself say I've been
skeptical of this patch for a number of reasons, the major one of
which is that most workloads spend only a very small amount of time in
doing qucksorts?

I have a new plan. I think you'll like it:

* drop the type specific specialisations entirely.

* drop qsort_arg (the original, unspecialised version). We now have
exactly the same number of source code copies of qsort as before: 2.

* gut the comparison of the leading sort key only from
comparetup_heap, comparetup_index_btree, etc (I collectively refer to
these functions as tuple-class encapsulating comparators, distinct
from their comparator proper).

* Instantiate two specialisations of qsort_arg: single key and
multiple key (exactly the same number as has already been agreed to,
since we lost the generic version). However, there is one key
difference now: we call one of the tuple class encapsulating functions
through a function pointer if the first comparison gives equality - .
Early indications are that this is almost as much of a win as the
single-key case. So the code effectively does this (but through a
function pointer):

#define MULTI_ADDITIONAL_CODE(COMPAR)   
\
{   
\
return comparetup_heap(aT, bT, state);  
\
}

This works because most comparisons won't actually need to look at the
second key, and because the cut-down tuple-class encapsulating
comparator that is used in the last revision of the patch doesn't
actually make any assumption about the particular tuple-class
encapsulating comparator in use.

It may not even be worth-while having a separate copy of the qsort
function for the multiple key case, so the question of binary bloat
becomes entirely irrelevant, as there would be a net gain of zero
copies of qsort_arg.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum rate limit in KBps

2012-02-09 Thread Bruce Momjian
On Wed, Feb 08, 2012 at 10:41:43PM -0500, Greg Smith wrote:
 Just trying to set the expectations bar realistically here.  I don't
 consider the easier problem of checkpoint smoothing a solved one,
 either.  Saying autovacuum needs to reach even that level of
 automation to be a useful improvement over now is a slippery goal.
 Regardless, the simple idea I submitted to this CF is clearly dead
 for now.  I'll take the feedback of this level of change can live
 in a user-side tuning tool and do that instead.  Since I was
 already thinking in the direction of background activity monitoring,
 I have a good idea how I'd need to approach this next, to be more
 likely to gain community buy-in as an automated improvement.  That's
 a longer term project though, which I'll hopefully be able to
 revisit for 9.3.

Totally agree.  If it is hard for us, it is super-hard to admins to set
this, so we had better give it serious thought.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-02-09 Thread Robert Haas
On Thu, Feb 9, 2012 at 9:37 AM, Peter Geoghegan pe...@2ndquadrant.com wrote:
 On 9 February 2012 13:50, Robert Haas robertmh...@gmail.com wrote:
 I'm also more than slightly concerned that we are losing sight of the
 forest for the trees.  I have heard reports that sorting large amounts
 of data is many TIMES slower for us than for a certain other database
 product.  I frankly wonder if this entire patch is a distraction.
 When inlining is the tool of choice for further performance
 improvement, it suggests that we're pretty much out of ideas, and that
 whatever we get out of inlining - be it 6% or 30% - will be the end.
 I don't like that idea very much.

 If you're talking about the product I think you're talking about,
 well, their contracts forbid any actual benchmarks, so we won't have
 any hard figures, but I find it intuitively difficult to believe that
 the difference is that large, mostly because I've demonstrated that
 the performance of qsort is a pretty good proxy for the performance of
 a query like select * from foo order by bar, and qsort can't be too
 far from optimal. Besides, didn't you yourself say I've been
 skeptical of this patch for a number of reasons, the major one of
 which is that most workloads spend only a very small amount of time in
 doing qucksorts?

Emphasis on large amounts of data.

I've said from the beginning that the small-sort case is less
interesting to me: it's nice to make it faster, but nobody's going to
quit using PostgreSQL because a quicksort takes 8ms instead of 6ms.
It's when we start talking about operations that take minutes or hours
that the differences become material.

 I have a new plan. I think you'll like it:

 * drop the type specific specialisations entirely.

Check.

 * drop qsort_arg (the original, unspecialised version). We now have
 exactly the same number of source code copies of qsort as before: 2.

This may be useful elsewhere some day, but I suppose we can always
pull it back out of git if we need it.

 * gut the comparison of the leading sort key only from
 comparetup_heap, comparetup_index_btree, etc (I collectively refer to
 these functions as tuple-class encapsulating comparators, distinct
 from their comparator proper).

 * Instantiate two specialisations of qsort_arg: single key and
 multiple key (exactly the same number as has already been agreed to,
 since we lost the generic version). However, there is one key
 difference now: we call one of the tuple class encapsulating functions
 through a function pointer if the first comparison gives equality - .
 Early indications are that this is almost as much of a win as the
 single-key case. So the code effectively does this (but through a
 function pointer):

 #define MULTI_ADDITIONAL_CODE(COMPAR)                                         
                                   \
 {                                                                             
                                                                           \
        return comparetup_heap(aT, bT, state);                                 
                                  \
 }

 This works because most comparisons won't actually need to look at the
 second key, and because the cut-down tuple-class encapsulating
 comparator that is used in the last revision of the patch doesn't
 actually make any assumption about the particular tuple-class
 encapsulating comparator in use.

 It may not even be worth-while having a separate copy of the qsort
 function for the multiple key case, so the question of binary bloat
 becomes entirely irrelevant, as there would be a net gain of zero
 copies of qsort_arg.

I'm not sure I entirely follow all this, but I'll look at the code
once you have it.  Are you saying that all the comparetup_yadda
functions are redundant to each other in the single-key case?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-02-09 Thread Bruce Momjian
On Thu, Feb 09, 2012 at 07:24:49AM -0500, Noah Misch wrote:
 This patch has gotten more than its fair share of attention for bloat, and I
 think that's mostly because there's a dial-a-bloat-level knob sticking out and
 begging to be frobbed.

I already emailed Peter privately stating that he shouldn't feel bad
about the many critiques of his patch --- this dial-a-bloat-level knob
is new for us, and we have to get used to the idea.  We are more
discussing the idea of a dial-a-bloat-level knob rather than his patch.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-02-09 Thread Simon Riggs
On Wed, Feb 8, 2012 at 2:05 PM, Noah Misch n...@leadboat.com wrote:
 On Wed, Feb 08, 2012 at 11:40:34AM +, Simon Riggs wrote:
 On Wed, Feb 8, 2012 at 3:24 AM, Noah Misch n...@leadboat.com wrote:
  On Tue, Feb 07, 2012 at 08:58:59PM +, Simon Riggs wrote:
  On Thu, Jan 26, 2012 at 8:20 PM, Noah Misch n...@leadboat.com wrote:
   This patch uses FPIs to guard against torn hint writes, even when the
   checksums are disabled. ?One could not simply condition them on the
   page_checksums setting, though. ?Suppose page_checksums=off and we're 
   hinting
   a page previously written with page_checksums=on. ?If that write tears,
   leaving the checksum intact, that block will now fail verification. ?A 
   couple
   of ideas come to mind. ?(a) Read the on-disk page and emit an FPI only 
   if the
   old page had a checksum. ?(b) Add a checksumEnableLSN field to 
   pg_control.
   Whenever the cluster starts with checksums disabled, set the field to
   InvalidXLogRecPtr. ?Whenever the cluster starts with checksums enabled 
   and the
   field is InvalidXLogRecPtr, set the field to the next LSN. ?When a 
   checksum
   failure occurs in a page with LSN older than the stored one, emit 
   either a
   softer warning or no message at all.
 
  We can only change page_checksums at restart (now) so the above seems 
  moot.
 
  If we crash with FPWs enabled we repair any torn pages.
 
  There's no live bug, but that comes at a high cost: the patch has us emit
  full-page images for hint bit writes regardless of the page_checksums 
  setting.

 Sorry, I don't understand what you mean. I don't see any failure cases
 that require that.

 page_checksums can only change at a shutdown checkpoint,

 The statement If that write tears,
   leaving the checksum intact, that block will now fail verification.
 cannot happen, ISTM.

 If we write out a block we update the checksum if page_checksums is
 set, or we clear it. If we experience a torn page at crash, the FPI
 corrects that, so the checksum never does fail verification. We only
 need to write a FPI when we write with checksums.

 If that's wrong, please explain a failure case in detail.

 In the following description, I will treat a heap page as having two facts.
 The first is either CHKSUM or !CHKSUM, and the second is either HINT or !HINT.
 A torn page write lacking the protection of a full-page image can update one
 or both facts despite the transaction having logically updated both.  (This is
 just the normal torn-page scenario.)  Given that, consider the following
 sequence of events:

 1. startup with page_checksums = on
 2. update page P with facts CHKSUM, !HINT
 3. clean write of P
 4. clean shutdown

 5. startup with page_checksums = off
 6. update P with facts !CHKSUM, HINT.  no WAL since page_checksums=off
 7. begin write of P
 8. crash.  torn write of P only updates HINT.  disk state now CHKSUM, HINT

 9. startup with page_checksums = off
 10. crash recovery.  does not touch P, because step 6 generated no WAL
 11. clean shutdown

 12. startup with page_checksums = on
 13. read P.  checksum failure

 Again, this cannot happen with checksum16_with_wallogged_hint_bits.v7.patch,
 because step 6 _does_ write WAL.  That ought to change for the sake of
 page_checksums=off performance, at which point we must protect the above
 scenario by other means.

Thanks for explaining.

Logic fixed.

   PageSetLSN() is not atomic, so the shared buffer content lock we'll be 
   holding
   is insufficient.
 
  Am serialising this by only writing PageLSN while holding buf hdr lock.
 
  That means also taking the buffer header spinlock in every PageGetLSN() 
  caller
  holding only a shared buffer content lock. ?Do you think that will pay off,
  versus the settled pattern of trading here your shared buffer content lock 
  for
  an exclusive one?

 Yes, I think it will pay off. This is the only code location where we
 do that, and we are already taking the buffer header lock, so there is
 effectively no overhead.

 The sites I had in the mind are the calls to PageGetLSN() in FlushBuffer()
 (via BufferGetLSN()) and XLogCheckBuffer().  We don't take the buffer header
 spinlock at either of those locations.  If they remain safe under the new
 rules, we'll need comments explaining why.  I think they may indeed be safe,
 but it's far from obvious.

OK, some slight rework required to do that, no problems.

I checked all other call sites and have updated README to explain.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Add new keywords SNAPSHOT and TYPES to the keyword list in gram.

2012-02-09 Thread Tom Lane
Alvaro Herrera alvhe...@commandprompt.com writes:
 Excerpts from Heikki Linnakangas's message of jue feb 09 06:42:12 -0300 2012:
 src/tools/check_keywords.pl is your friend.

 I wondered a couple of weeks ago if we could, instead, generate the
 lists in gram.y from kwlist.h.

We've looked into that in the past.  bison does not seem to have any
sort of include directive, which means the only way to do it would be
to have the gram.y file be built from pieces at compile time.  Although
Michael is successfully doing something of the sort for ecpg, I find it
way too ugly in return for the amount of benefit we'd get.

If people are sufficiently worried about this, a better answer would be
to teach the makefiles to run check_keywords.pl during every build.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-02-09 Thread Peter Geoghegan
On 9 February 2012 14:51, Robert Haas robertmh...@gmail.com wrote:
 I'm not sure I entirely follow all this, but I'll look at the code
 once you have it.  Are you saying that all the comparetup_yadda
 functions are redundant to each other in the single-key case?

Yes, I am. The main reason that the loops exist in those functions
(which is the only way that they substantially differ) is because they
each have to get the other keys through various ways that characterise
the tuple class that they encapsulate (index_getattr(),
heap_getattr(), etc).

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Notify system doesn't recover from No space error

2012-02-09 Thread Kevin Grittner
On a development system which wasn't being monitored very closely
I think we've uncovered a bug in the listen/notify behavior.  The
system was struggling with lack of disk space for about a week
before this surfaced, so it's probably not the most critical sort
of bug we could have, but it probably merits a fix.
 
We were getting all sorts of logging about the space problem, and
PostgreSQL soldiered bravely on, doing the best it could under such
adverse conditions.  (That's pretty impressive, really.)  A lot of
messages like these:
 
[2012-02-05 01:24:36.816 CST] 14950 cc cc 127.0.0.1(35321) ERROR: 
could not extend file base/16401/46507734: wrote only 4096 of 8192
bytes at block 0
[2012-02-05 01:24:36.816 CST] 14950 cc cc 127.0.0.1(35321) HINT: 
Check free disk space.

[2012-02-05 01:25:40.643 CST] 5323 LOG:  could not write temporary
statistics file pg_stat_tmp/pgstat.tmp: No space left on device

[2012-02-05 01:26:38.971 CST] 15065 cc cc 127.0.0.1(53258) ERROR: 
could not write to hash-join temporary file: No space left on device

I don't know how much of a clue this is, but less than a minute
before our first message about pg_notify problems, the statistics
messages went from complaining about not being able to *write* the
file to also complaining about not being able to *close* the file --
like this:
 
[2012-02-05 01:26:41.159 CST] 5323 LOG:  could not close temporary
statistics file pg_stat_tmp/pgstat.tmp: No space left on device
[2012-02-05 01:26:44.663 CST] 5323 LOG:  could not write temporary
statistics file pg_stat_tmp/pgstat.tmp: No space left on device
[2012-02-05 01:26:45.705 CST] 16252 WARNING:  pgstat wait timeout
 
We got three errors in the log with exactly this DETAIL (file,
offset and message):
 
[2012-02-05 01:27:36.878 CST] 14892 cc cc 127.0.0.1(35320) ERROR: 
could not access status of transaction 0
[2012-02-05 01:27:36.878 CST] 14892 cc cc 127.0.0.1(35320) DETAIL:
 Could not write to file pg_notify/03A5 at offset 188416: No space
left on device.
[2012-02-05 01:27:36.878 CST] 14892 cc cc 127.0.0.1(35320)
STATEMENT:  COMMIT
[2012-02-05 01:27:36.882 CST] 14892 cc cc 127.0.0.1(35320) LOG: 
disconnection: session time: 0:07:00.600 user=cc database=cc
host=127.0.0.1 port=35320
 
After that the message changed to a new offset and a message of
Success. for the failure:
 
[2012-02-05 01:30:36.952 CST] 16383 cc cc 127.0.0.1(38931) ERROR: 
could not access status of transaction 0
[2012-02-05 01:30:36.952 CST] 16383 cc cc 127.0.0.1(38931) DETAIL:
 Could not read from file pg_notify/03A5 at offset 253952:
Success.
[2012-02-05 01:30:36.952 CST] 16383 cc cc 127.0.0.1(38931)
STATEMENT:  COMMIT
[2012-02-05 01:30:36.956 CST] 16383 cc cc 127.0.0.1(38931) LOG: 
disconnection: session time: 0:01:59.537 user=cc database=cc
host=127.0.0.1 port=38931
 
There were 4591 messages over the course of five days with exactly
that filename and offset and a message of Success., all for
transaction 0.  About midway through that time disk space was freed
and all other symptoms cleared.  It really is pretty impressive that
PostgreSQL survived that and was able to resume normal operations in
all other respects once disk space was freed for it.  Only this one
feature stayed in a bad state.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] index-only quals vs. security_barrier views

2012-02-09 Thread Robert Haas
When Heikki worked up his original index-only scan patches (which
didn't end up looking much like what eventually got committed), he had
the notion of an index-only qual.  That is, given a query like this:

select sum(1) from foo where substring(a,1,3) = 'abc';

We could evaluate the substring qual before performing the heap fetch,
and fetch the tuple from the heap only if the qual passes.  The
current code is capable of generating an index-only scan plan for this
query, but the MVCC visibility check always happens first: if the page
is all-visible, we go ahead and evaluate the qual using only the index
data, but if the page is not all-visible, we do the heap fetch first
and then check the qual only if the tuple is visible to our snapshot.
It would be nice to have the ability to do those checks in the other
order, in the case where the projected cost of checking the qual is
less than the projected cost of the heap fetch.  This would allow
index-only scans to win in more situations than they can right now,
because we'd conceivably avoid quite a bit of random I/O if the qual
is fairly selective and there are a decent number of visibility map
bits that are unset.

In fact, this technique might pay off even if we don't have a covering index:

select * from foo where substring(a,1,3) = 'abc';

If the expected selectivity of the qual is low and the index is a lot
smaller than the table, we might want to iterate through all the index
tuples in the entire index and fetch the heap tuples for only those
where the qual passes.  This would allow index-only scans - or
whatever term you want to use, since there's not much that's index
only about this case - to potentially win even when no
visibility-map bits are set at all.

Now, there's a fly in the ointment here, which is that applying
arbitrary user-defined functions to tuples that might not be visible
doesn't sound very safe.  The user-defined function in question might
perform some action based on those invisible tuples that has side
effects, which would be bad, because now we're violating MVCC
semantics.  Or it might throw an error, killing the scan dead on the
basis of the contents of some tuple that the scan shouldn't even see.
 However, there is certainly a class of functions for which this type
of optimization would be safe, and it's an awful lot like the set of
functions that can be safely pushed down through a security_barrier
view - namely, things that don't have side effects or throw errors.
So it's possible that the proleakproof flag KaiGai is proposing to add
to pg_proc could do double duty, serving also to identify when it's
safe to apply a qual to an index tuple when the corresponding heap
tuple might be invisible.  However, I have some reservations about
assuming that the two concepts are exactly the same.  For one thing,
people are inevitably going to want to cheat a little bit more here
than is appropriate for security views, and encouraging people to mark
things LEAKPROOF when they're really not is a security disaster
waiting to happen.  For another thing, there are some important cases
that this doesn't cover, like:

select * from foo where substring(a,1,3) like '%def%';

The like operator doesn't seem to be leakproof in the security sense,
because it can throw an error if the pattern is something like a
single backslash (ERROR:  LIKE pattern must not end with escape
character) and indeed it doesn't seem like it would be safe here
either if the pattern were stored in the table.  But if the pattern
were constant, it'd be OK, or almost OK: there's still the edge case
where the table contains invisible rows but no visible ones - whether
or not we complain about the pattern there ought to be the same as
whether or not we complain about it on a completely empty table.  If
we got to that point, then we might as well consider the qual
leakproof for security purposes under the same set of circumstances
we'd consider it OK to apply to possibly-invisible tuples.

I don't have any appetite for trying to do anything more with
index-only scans for 9.2, though maybe someone else will think
otherwise.  But I would like very much to get KaiGai's leakproof stuff
committed, and so it seems like a good idea to reconcile the needs of
that machinery with what might eventually be needed here.

Comments?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-02-09 Thread Bruce Momjian
On Thu, Feb 09, 2012 at 03:36:23PM +, Peter Geoghegan wrote:
 On 9 February 2012 14:51, Robert Haas robertmh...@gmail.com wrote:
  I'm not sure I entirely follow all this, but I'll look at the code
  once you have it.  Are you saying that all the comparetup_yadda
  functions are redundant to each other in the single-key case?
 
 Yes, I am. The main reason that the loops exist in those functions
 (which is the only way that they substantially differ) is because they
 each have to get the other keys through various ways that characterise
 the tuple class that they encapsulate (index_getattr(),
 heap_getattr(), etc).

Does this help all types for sorting, including strings?

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Add protransform for numeric, varbit, and temporal types

2012-02-09 Thread Robert Haas
On Thu, Feb 9, 2012 at 7:18 AM, Noah Misch n...@leadboat.com wrote:
 The code has a thinko; a given interval typmod ultimately implies a single
 point from which we truncate rightward.  The precision only matters if the
 range covers SECOND.  Thanks; the attached patch improves this.

Thanks, committed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-02-09 Thread Peter Geoghegan
On 9 February 2012 17:16, Bruce Momjian br...@momjian.us wrote:
 Yes, I am. The main reason that the loops exist in those functions
 (which is the only way that they substantially differ) is because they
 each have to get the other keys through various ways that characterise
 the tuple class that they encapsulate (index_getattr(),
 heap_getattr(), etc).

 Does this help all types for sorting, including strings?

Yes, it does, though of course that's not expected to make too much
difference with text when the C locale isn't used, because the
comparisons are inherently much more expensive, and there isn't a
whole lot we can do about that, at least here.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Notify system doesn't recover from No space error

2012-02-09 Thread Tom Lane
Kevin Grittner kevin.gritt...@wicourts.gov writes:
 We got three errors in the log with exactly this DETAIL (file,
 offset and message):
 
 [2012-02-05 01:27:36.878 CST] 14892 cc cc 127.0.0.1(35320) ERROR: 
 could not access status of transaction 0
 [2012-02-05 01:27:36.878 CST] 14892 cc cc 127.0.0.1(35320) DETAIL:
  Could not write to file pg_notify/03A5 at offset 188416: No space
 left on device.
 
 After that the message changed to a new offset and a message of
 Success. for the failure:
 
 [2012-02-05 01:30:36.952 CST] 16383 cc cc 127.0.0.1(38931) ERROR: 
 could not access status of transaction 0
 [2012-02-05 01:30:36.952 CST] 16383 cc cc 127.0.0.1(38931) DETAIL:
  Could not read from file pg_notify/03A5 at offset 253952:
 Success.

IIRC, the could not access status of transaction bits are artifacts
stemming from the use of the SLRU infrastructure for pg_notify access.
Sometime we ought to think a bit harder about how to specialize SLRU's
error messages for the cases where what it's manipulating aren't XIDs.

Also, the Could not read ...: Success bit presumably ought to be
spelled Could not read ...: unexpected EOF.  I thought we'd fixed
that most places, but we must've missed a spot in SLRU.

However, both of those points are just cosmetic.  The substantive issue
is that you say it didn't resume normal behavior once space became
available again.  Can you provide more info about that?  In particular,
what behavior was being seen by the application?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Notify system doesn't recover from No space error

2012-02-09 Thread Kevin Grittner
Tom Lane t...@sss.pgh.pa.us wrote:
 Kevin Grittner kevin.gritt...@wicourts.gov writes:
 
 [2012-02-05 01:30:36.952 CST] 16383 cc cc 127.0.0.1(38931)
 ERROR:  could not access status of transaction 0
 [2012-02-05 01:30:36.952 CST] 16383 cc cc 127.0.0.1(38931)
 DETAIL:  Could not read from file pg_notify/03A5 at offset
 253952: Success.
 
 The substantive issue is that you say it didn't resume normal
 behavior once space became available again.  Can you provide more
 info about that?  In particular, what behavior was being seen by
 the application?
 
The application LISTENs on channel tcn and a trigger function is
attached to most permanent tables to NOTIFY for DML on that channel.
(See the recently committed triggered_change_notification patch if
you want to see the trigger function.)  At this point it might be
hard to determine (at least without re-creating the problem) whether
it was only transactions which issued a NOTIFY, but the application
never really got very far because of COMMIT statements failing with
the above message.
 
The report to us was that testers were unable to start the
application.  I believe that the above error on COMMIT kept the
application from getting past initial tests that the connection was
good.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Notify system doesn't recover from No space error

2012-02-09 Thread Tom Lane
Kevin Grittner kevin.gritt...@wicourts.gov writes:
 The application LISTENs on channel tcn and a trigger function is
 attached to most permanent tables to NOTIFY for DML on that channel.
 ...
 The report to us was that testers were unable to start the
 application.  I believe that the above error on COMMIT kept the
 application from getting past initial tests that the connection was
 good.

OK, so it was issuing a LISTEN and the LISTEN was failing at commit?
(It's a bit hard to believe that you'd be triggering one of the NOTIFYs
during initial connection tests.)

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Add new keywords SNAPSHOT and TYPES to the keyword list in gram.

2012-02-09 Thread Alvaro Herrera
Excerpts from Tom Lane's message of jue feb 09 12:17:59 -0300 2012:
 
 Alvaro Herrera alvhe...@commandprompt.com writes:
  Excerpts from Heikki Linnakangas's message of jue feb 09 06:42:12 -0300 
  2012:
  src/tools/check_keywords.pl is your friend.
 
  I wondered a couple of weeks ago if we could, instead, generate the
  lists in gram.y from kwlist.h.
 
 We've looked into that in the past.  bison does not seem to have any
 sort of include directive, which means the only way to do it would be
 to have the gram.y file be built from pieces at compile time.

Doh, bummer.

 Although
 Michael is successfully doing something of the sort for ecpg, I find it
 way too ugly in return for the amount of benefit we'd get.

Agreed.

 If people are sufficiently worried about this, a better answer would be
 to teach the makefiles to run check_keywords.pl during every build.

FWIW that script is throwing a warning here:
Use of assignment to $[ is deprecated at 
/pgsql/source/HEAD/src/tools/check_keywords.pl line 19.

Also, we should at least do the attached to simplify the process.

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


check-keywords.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Add new keywords SNAPSHOT and TYPES to the keyword list in gram.

2012-02-09 Thread Tom Lane
Alvaro Herrera alvhe...@commandprompt.com writes:
 Excerpts from Tom Lane's message of jue feb 09 12:17:59 -0300 2012:
 If people are sufficiently worried about this, a better answer would be
 to teach the makefiles to run check_keywords.pl during every build.

 FWIW that script is throwing a warning here:
 Use of assignment to $[ is deprecated at 
 /pgsql/source/HEAD/src/tools/check_keywords.pl line 19.

Any Perl hackers want to improve that script?

 Also, we should at least do the attached to simplify the process.
 +check:
 + $(top_srcdir)/src/tools/check_keywords.pl $(top_srcdir)

Actually, what would make sense in that line is to attach it to the
existing maintainer-check target, no?  I don't think top-level
make check descends into this directory.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] index-only quals vs. security_barrier views

2012-02-09 Thread Jesper Krogh

On 2012-02-09 18:02, Robert Haas wrote:

I don't have any appetite for trying to do anything more with
index-only scans for 9.2, though maybe someone else will think
otherwise.  But I would like very much to get KaiGai's leakproof stuff
committed, and so it seems like a good idea to reconcile the needs of
that machinery with what might eventually be needed here.

Those were a couple of nice cases where index-only-scans
could win more than they does today. I have another one here:

2012-02-09 19:17:28.788 jk=# \d testtable
  Table public.testtable
 Column |   Type   |   Modifiers
+--+
 id | integer  | not null default nextval('testtable_id_seq'::regclass)
 fts| tsvector |
Indexes:
prk_idx UNIQUE, btree (id)
fts_id gin (fts)

2012-02-09 19:19:39.054 jk=# explain select id from testtable where fts 
@@ to_tsquery('english','test1000');

  QUERY PLAN
---
 Bitmap Heap Scan on testtable  (cost=20.29..161.28 rows=37 width=4)
   Recheck Cond: (fts @@ '''test1000'''::tsquery)
   -  Bitmap Index Scan on fts_id  (cost=0.00..20.28 rows=37 width=0)
 Index Cond: (fts @@ '''test1000'''::tsquery)
(4 rows)

Time: 0.494 ms
2012-02-09 19:19:52.748 jk=#

In this situation the tuple can be regenerated from the index, but
not from the index-satisfying the where clause, this allows significantly
more complex where-clauses and may also benefit situations where
we only going for one or more of the primary-key/foreing-key columns
for join-conditions.

Above situation does not need to involve a gin-index, but a btree index
where the where clause can be matched up using one index, and the tuple
constructed using another falls into the same category.


--
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Notify system doesn't recover from No space error

2012-02-09 Thread Kevin Grittner
Tom Lane t...@sss.pgh.pa.us wrote:
 Kevin Grittner kevin.gritt...@wicourts.gov writes:
 The application LISTENs on channel tcn and a trigger function is
 attached to most permanent tables to NOTIFY for DML on that
 channel.
 ...
 The report to us was that testers were unable to start the
 application.  I believe that the above error on COMMIT kept the
 application from getting past initial tests that the connection
 was good.
 
 OK, so it was issuing a LISTEN and the LISTEN was failing at
 commit?  (It's a bit hard to believe that you'd be triggering one
 of the NOTIFYs during initial connection tests.)
 
In continued digging through logs I found something to indicate the
transaction on which COMMIT failed for 2528 of the failures.  In all
cases the transaction made a change which would have fired a NOTIFY
at commit (in a deferred trigger).
 
So that initial assumption about the cause doesn't hold up.  I'm not
sure at this point why the tester perception was that they couldn't
get in.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Notify system doesn't recover from No space error

2012-02-09 Thread Kevin Grittner
Kevin Grittner kevin.gritt...@wicourts.gov wrote:
 
 In continued digging through logs I found something to indicate
 the transaction on which COMMIT failed for 2528 of the failures. 
 In all cases the transaction made a change which would have fired
 a NOTIFY at commit (in a deferred trigger).
 
Arg.  Sorry, I got confused there.  No deferred trigger involved in
the triggered change notification; that is only in the replication
code.  I haven't seen anything to suggest that the replication
trigger is contributing to this, but I guess I couldn't rule it out,
either.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] When do we lose column names?

2012-02-09 Thread Andrew Dunstan



On 02/07/2012 03:03 PM, Tom Lane wrote:

Andrew Dunstanand...@dunslane.net  writes:

On 02/07/2012 12:47 PM, Tom Lane wrote:

In general I think we'd have to require that colnames be supplied in all
RowExprs if we go this way.  Anyplace that's trying to slide by without
will have to be fixed.  I don't recall how many places that is.

I just had a thought that maybe we could make this simpler by dummying
up a list of colnames if we don't have one, instead of that assertion.
Or am I on the wrong track.

Well, if there are more than one or two RowExpr creators for which a
dummy set of colnames is the best we can do anyway, that might be a
reasonable answer.  But I think it would encourage people to be lazy
and let the dummy colnames be used even when they can do better, so
I'd rather not take this as a first-choice solution.




OK, the one place that needs to be fixed to avoid the crash caused by 
the json regression tests with the original patch is in


   src/backend/parser/parse_expr.c:transformRowExpr().

Other candidates I have found that don't set colnames and should 
probably use dummy names are:


 * src/backend/parser/gram.y (row: production)
 * src/backend/optimizer/prep/prepunion.c:adjust_appendrel_attrs_mutator()
 * src/backend/optimizer/util/var.c:flatten_join_alias_vars_mutator()


Given a function:

   extern List *makeDummyColnames(List * args);


what's the best place to put it? I couldn't see a terribly obvious place 
in the source.


cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] When do we lose column names?

2012-02-09 Thread Alvaro Herrera

Excerpts from Andrew Dunstan's message of jue feb 09 16:38:27 -0300 2012:

 OK, the one place that needs to be fixed to avoid the crash caused by 
 the json regression tests with the original patch is in
 
 src/backend/parser/parse_expr.c:transformRowExpr().
 
 Other candidates I have found that don't set colnames and should 
 probably use dummy names are:
 
   * src/backend/parser/gram.y (row: production)
   * src/backend/optimizer/prep/prepunion.c:adjust_appendrel_attrs_mutator()
   * src/backend/optimizer/util/var.c:flatten_join_alias_vars_mutator()
 
 
 Given a function:
 
 extern List *makeDummyColnames(List * args);
 
 
 what's the best place to put it? I couldn't see a terribly obvious place 
 in the source.

parse_relation.c?

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] When do we lose column names?

2012-02-09 Thread Tom Lane
Andrew Dunstan and...@dunslane.net writes:
 OK, the one place that needs to be fixed to avoid the crash caused by 
 the json regression tests with the original patch is in

 src/backend/parser/parse_expr.c:transformRowExpr().

 Other candidates I have found that don't set colnames and should 
 probably use dummy names are:

   * src/backend/parser/gram.y (row: production)
   * src/backend/optimizer/prep/prepunion.c:adjust_appendrel_attrs_mutator()
   * src/backend/optimizer/util/var.c:flatten_join_alias_vars_mutator()

Hm, can't the last two get real column names from somewhere?  Also, why
would it be the responsibility of the grammar production to fill in the
list, rather than transformRowExpr?

 Given a function:
 extern List *makeDummyColnames(List * args);
 what's the best place to put it? I couldn't see a terribly obvious place 
 in the source.

If there are enough potential users to justify having such a global
function at all, then we might as well not bother but just let execQual
fill in dummy names on the fly.  Providing such a function means that
there is nothing whatever pushing writers of new RowExpr constructors to
not be lazy --- the path of least resistance will be to call
makeDummyColnames.  I was hoping that there would be few enough places
where no information is available that we'd not need to have a global
function like that.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] When do we lose column names?

2012-02-09 Thread Andrew Dunstan



On 02/09/2012 02:54 PM, Tom Lane wrote:

Andrew Dunstanand...@dunslane.net  writes:

OK, the one place that needs to be fixed to avoid the crash caused by
the json regression tests with the original patch is in
 src/backend/parser/parse_expr.c:transformRowExpr().
Other candidates I have found that don't set colnames and should
probably use dummy names are:
   * src/backend/parser/gram.y (row: production)
   * src/backend/optimizer/prep/prepunion.c:adjust_appendrel_attrs_mutator()
   * src/backend/optimizer/util/var.c:flatten_join_alias_vars_mutator()

Hm, can't the last two get real column names from somewhere?


Possibly. I'll dig a bit deeper.


Also, why
would it be the responsibility of the grammar production to fill in the
list, rather than transformRowExpr?





Sure. I'll just comment the source accordingly.

cheers

andrew



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] index-only quals vs. security_barrier views

2012-02-09 Thread Robert Haas
On Thu, Feb 9, 2012 at 1:33 PM, Jesper Krogh jes...@krogh.cc wrote:
 On 2012-02-09 18:02, Robert Haas wrote:

 I don't have any appetite for trying to do anything more with
 index-only scans for 9.2, though maybe someone else will think
 otherwise.  But I would like very much to get KaiGai's leakproof stuff
 committed, and so it seems like a good idea to reconcile the needs of
 that machinery with what might eventually be needed here.

 Those were a couple of nice cases where index-only-scans
 could win more than they does today. I have another one here:

 2012-02-09 19:17:28.788 jk=# \d testtable
                          Table public.testtable
  Column |   Type   |                       Modifiers
 +--+
  id     | integer  | not null default nextval('testtable_id_seq'::regclass)
  fts    | tsvector |
 Indexes:
    prk_idx UNIQUE, btree (id)
    fts_id gin (fts)

 2012-02-09 19:19:39.054 jk=# explain select id from testtable where fts @@
 to_tsquery('english','test1000');
                              QUERY PLAN
 ---
  Bitmap Heap Scan on testtable  (cost=20.29..161.28 rows=37 width=4)
   Recheck Cond: (fts @@ '''test1000'''::tsquery)
   -  Bitmap Index Scan on fts_id  (cost=0.00..20.28 rows=37 width=0)
         Index Cond: (fts @@ '''test1000'''::tsquery)
 (4 rows)

 Time: 0.494 ms
 2012-02-09 19:19:52.748 jk=#

 In this situation the tuple can be regenerated from the index, but
 not from the index-satisfying the where clause, this allows significantly
 more complex where-clauses and may also benefit situations where
 we only going for one or more of the primary-key/foreing-key columns
 for join-conditions.

I don't understand what you're saying here.

 Above situation does not need to involve a gin-index, but a btree index
 where the where clause can be matched up using one index, and the tuple
 constructed using another falls into the same category.

That doesn't make sense to me.  If you probe index A for rows where a
= 1 and find that CTID (100,1) is such a row, and now want to return a
column value b that is not present in that index, the fastest way to
get the row is going to be to fetch block 100 from the heap and return
the data out of the first tuple.  To get the value out of some other
index that does include column b would require scanning the entire
index looking for that CTID, just so you could then grab the
corresponding index tuple, which wouldn't make any sense at all.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bugs/slowness inserting and indexing cubes

2012-02-09 Thread Jay Levitt

Tom Lane wrote:

Jay Levittjay.lev...@gmail.com  writes:

[Posted at Andres's request]
TL;DR: Inserting and indexing cubes is slow and/or broken in various ways in
various builds.



1. In 9.1.2, inserting 10x rows takes 19x the time.
 - 9.1-HEAD and 9.2 fix this; it now slows down linearly
 - but: 10s  8s  5s!
 - but: comparing Ubuntu binary w/vanilla source build on virtual disks,
might not be significant


FWIW, I find it really hard to believe that there is any real difference
between 9.1.2 and 9.1 branch tip on this.  There have been no
significant changes in either the gist or contrib/cube code in that
branch.  I suspect you have a measurement issue there.


I suspect you're right, given that five runs in a row produced times from 7s 
to 10s.  I just wanted to include it for completeness and in case it 
triggered any a-ha moments.



4. 9.1-HEAD never successfully indexes 10 million rows (never = at least
20 minutes on two runs; I will follow up in a few hours)


Works for me (see above), though it's slower than you might've expected.


So my pre-built 9.1.2 takes 434s, my source-built 9.2 takes 509s, and 
(probably both of our) 9.1-HEAD takes 1918s... is that something to worry 
about, and if so, are there any tests I can run to assist? That bug doesn't 
affect me personally, but y'know, community and all that.  Also, I wonder if 
it's something like 9.2 got way faster doing X, but meanwhile, HEAD got way 
slower doing Y., and this is a canary in the coal mine.


Jay

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] psql tab completion for SELECT

2012-02-09 Thread Peter Eisentraut
In his blog entry http://www.depesz.com/2011/07/08/wish-list-for-psql/
depesz described a simple way to do tab completion for SELECT in psql:


Make tab-completion complete also function names – like: SELECT
pg_gettabtab to see all functions that start with pg_get.

Make tab-completion work for columns in SELECT. I know that when writing
SELECT clause, psql doesn’t know which table it will deal with, but it
could search through all the columns in database.


That seems pretty useful, and it's more or less a one-line change, as in
the attached patch.

diff --git i/src/bin/psql/tab-complete.c w/src/bin/psql/tab-complete.c
index 3854f7f..416aa2f 100644
--- i/src/bin/psql/tab-complete.c
+++ w/src/bin/psql/tab-complete.c
@@ -2555,7 +2555,9 @@ psql_completion(char *text, int start, int end)
 		COMPLETE_WITH_CONST(IS);
 
 /* SELECT */
-	/* naah . . . */
+	/* complete with columns and functions */
+	else if (pg_strcasecmp(prev_wd, SELECT) == 0)
+		COMPLETE_WITH_QUERY(SELECT name FROM (SELECT attname FROM pg_attribute UNION SELECT proname || '(' FROM pg_proc) t (name) WHERE substring(name,1,%d)='%s');
 
 /* SET, RESET, SHOW */
 	/* Complete with a variable name */

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] RFC: Making TRUNCATE more MVCC-safe

2012-02-09 Thread Marti Raudsepp
Hi!

I've always been a little wary of using the TRUNCATE command due to
the warning in the documentation about it not being MVCC-safe:
queries may silently give wrong results and it's hard to tell when
they are affected.

That got me thinking, why can't we handle this like a standby server
does -- if some query has data removed from underneath it, it aborts
with a serialization failure.

Does this solution sound like a good idea?

The attached patch is a lame attempt at implementing this. I added a
new pg_class.relvalidxmin attribute which tracks the Xid of the last
TRUNCATE (maybe it should be called reltruncatexid?). Whenever
starting a relation scan with a snapshot older than relvalidxmin, an
error is thrown. This seems to work out well since TRUNCATE updates
pg_class anyway, and anyone who sees the new relfilenode automatically
knows when it was truncated.

Am I on the right track? Are there any better ways to attach this
information to a relation?
Should I also add another counter to pg_stat_database_conflicts?
Currently this table is only used on standby servers.

Since I wrote it just this afternoon, there are a few things still
wrong with the patch (it doesn't handle xid wraparound for one), so
don't be too picky about the code yet. :)

Example:
  CREATE TABLE foo (i int);
Session A:
  BEGIN ISOLATION LEVEL REPEATABLE READ;
  SELECT txid_current(); -- Force snapshot open
Session B:
  TRUNCATE TABLE foo;
Session A:
  SELECT * FROM foo;
ERROR:  canceling statement due to conflict with TRUNCATE TABLE on foo
DETAIL:  Rows visible to this transaction have been removed.


Patch also available in my github 'truncate' branch:
https://github.com/intgr/postgres/commits/truncate

Regards,
Marti
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
new file mode 100644
index 99a431a..e909b17
*** a/src/backend/access/heap/heapam.c
--- b/src/backend/access/heap/heapam.c
*** heap_beginscan_internal(Relation relatio
*** 1175,1180 
--- 1175,1196 
  	HeapScanDesc scan;
  
  	/*
+ 	 * If the snapshot is older than relvalidxmin then that means TRUNCATE has
+ 	 * removed data that we should still see. Abort rather than giving
+ 	 * potentially bogus results.
+ 	 */
+ 	if (relation-rd_rel-relvalidxmin != InvalidTransactionId 
+ 		snapshot-xmax != InvalidTransactionId 
+ 		NormalTransactionIdPrecedes(snapshot-xmax, relation-rd_rel-relvalidxmin))
+ 	{
+ 		ereport(ERROR,
+ (errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
+  errmsg(canceling statement due to conflict with TRUNCATE TABLE on %s,
+ 		 NameStr(relation-rd_rel-relname)),
+  errdetail(Rows visible to this transaction have been removed.)));
+ 	}
+ 
+ 	/*
  	 * increment relation ref count while scanning relation
  	 *
  	 * This is just to make really sure the relcache entry won't go away while
diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c
new file mode 100644
index aef410a..3f9bd5d
*** a/src/backend/catalog/heap.c
--- b/src/backend/catalog/heap.c
*** InsertPgClassTuple(Relation pg_class_des
*** 787,792 
--- 787,793 
  	values[Anum_pg_class_relhastriggers - 1] = BoolGetDatum(rd_rel-relhastriggers);
  	values[Anum_pg_class_relhassubclass - 1] = BoolGetDatum(rd_rel-relhassubclass);
  	values[Anum_pg_class_relfrozenxid - 1] = TransactionIdGetDatum(rd_rel-relfrozenxid);
+ 	values[Anum_pg_class_relvalidxmin - 1] = TransactionIdGetDatum(rd_rel-relvalidxmin);
  	if (relacl != (Datum) 0)
  		values[Anum_pg_class_relacl - 1] = relacl;
  	else
*** AddNewRelationTuple(Relation pg_class_de
*** 884,889 
--- 885,891 
  		new_rel_reltup-relfrozenxid = InvalidTransactionId;
  	}
  
+ 	new_rel_reltup-relvalidxmin = InvalidTransactionId;
  	new_rel_reltup-relowner = relowner;
  	new_rel_reltup-reltype = new_type_oid;
  	new_rel_reltup-reloftype = reloftype;
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
new file mode 100644
index bfbe642..0d96a6a
*** a/src/backend/catalog/index.c
--- b/src/backend/catalog/index.c
*** reindex_index(Oid indexId, bool skip_con
*** 2854,2860 
  		}
  
  		/* We'll build a new physical relation for the index */
! 		RelationSetNewRelfilenode(iRel, InvalidTransactionId);
  
  		/* Initialize the index and rebuild */
  		/* Note: we do not need to re-establish pkey setting */
--- 2854,2860 
  		}
  
  		/* We'll build a new physical relation for the index */
! 		RelationSetNewRelfilenode(iRel, InvalidTransactionId, InvalidTransactionId);
  
  		/* Initialize the index and rebuild */
  		/* Note: we do not need to re-establish pkey setting */
diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
new file mode 100644
index 1f95abc..ab4aec2
*** a/src/backend/commands/sequence.c
--- b/src/backend/commands/sequence.c
*** ResetSequence(Oid seq_relid)
*** 287,293 
  	 * Create a new storage file for the sequence.	We want to keep the
  	 * sequence's 

Re: [HACKERS] index-only quals vs. security_barrier views

2012-02-09 Thread Jesper Krogh

On 2012-02-09 21:09, Robert Haas wrote:

That doesn't make sense to me.  If you probe index A for rows where a
= 1 and find that CTID (100,1) is such a row, and now want to return a
column value b that is not present in that index, the fastest way to
get the row is going to be to fetch block 100 from the heap and return
the data out of the first tuple.  To get the value out of some other
index that does include column b would require scanning the entire
index looking for that CTID, just so you could then grab the
corresponding index tuple, which wouldn't make any sense at all.


You're right, in my head, everything it wired up against my primary
keys, of-course that isn't the case for the DB. Sorry for the noise.

--
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] psql tab completion for SELECT

2012-02-09 Thread Dimitri Fontaine
Peter Eisentraut pete...@gmx.net writes:
 Make tab-completion complete also function names – like: SELECT
 pg_gettabtab to see all functions that start with pg_get.

 Make tab-completion work for columns in SELECT. I know that when writing
 SELECT clause, psql doesn’t know which table it will deal with, but it
 could search through all the columns in database.

 That seems pretty useful, and it's more or less a one-line change, as in
 the attached patch.

Does that includes support for completing SRF functions in the FROM clause?

Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] xlog location arithmetic

2012-02-09 Thread Euler Taveira de Oliveira
On 08-02-2012 09:35, Fujii Masao wrote:

Fujii, new patch attached. Thanks for your tests.

 But another problem happened. When I changed pg_proc.h so that the unused
 OID was assigned to pg_xlog_location_diff(), and executed the above again,
 I encountered the segmentation fault:
 
I reproduced the problems in my old 32-bit laptop. I fixed it casting to
int64. I also updated the duplicated OID.

 Why OID needs to be reassigned?
 
There isn't a compelling reason. It is just a way to say: hey, it is another
function with the same old name.

I'll not attach another version for pg_size_pretty because it is a matter of
updating a duplicated OID.


-- 
   Euler Taveira de Oliveira - Timbira   http://www.timbira.com.br/
   PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 236a60a..826f002 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -14446,11 +14446,15 @@ SELECT set_config('log_statement_stats', 'off', false);
indexterm
 primarypg_xlogfile_name_offset/primary
/indexterm
+   indexterm
+primarypg_xlog_location_diff/primary
+   /indexterm
 
para
 The functions shown in xref
 linkend=functions-admin-backup-table assist in making on-line backups.
-These functions cannot be executed during recovery.
+	These functions cannot be executed during recovery (except
+	functionpg_xlog_location_diff/function).
/para
 
table id=functions-admin-backup-table
@@ -14518,6 +14522,13 @@ SELECT set_config('log_statement_stats', 'off', false);
entrytypetext/, typeinteger//entry
entryConvert transaction log location string to file name and decimal byte offset within file/entry
   /row
+  row
+   entry
+literalfunctionpg_xlog_location_diff(parameterlocation/ typetext/, parameterlocation/ typetext/)/function/literal
+/entry
+   entrytypenumeric//entry
+   entryCalculate the difference between two transaction log locations/entry
+  /row
  /tbody
 /tgroup
/table
@@ -14611,6 +14622,13 @@ postgres=# SELECT * FROM pg_xlogfile_name_offset(pg_stop_backup());
/para
 
para
+	functionpg_xlog_location_diff/ calculates the difference in bytes
+	between two transaction log locations. It can be used with
+	structnamepg_stat_replication/structname or some functions shown in
+	xref linkend=functions-admin-backup-table to get the replication lag.
+   /para
+
+   para
 For details about proper usage of these functions, see
 xref linkend=continuous-archiving.
/para
diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index 2e10d4d..be7d388 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -26,6 +26,7 @@
 #include replication/walreceiver.h
 #include storage/smgr.h
 #include utils/builtins.h
+#include utils/numeric.h
 #include utils/guc.h
 #include utils/timestamp.h
 
@@ -465,3 +466,83 @@ pg_is_in_recovery(PG_FUNCTION_ARGS)
 {
 	PG_RETURN_BOOL(RecoveryInProgress());
 }
+
+static void
+validate_xlog_location(char *str)
+{
+#define	MAXLSNCOMPONENT		8
+
+	int	len1, len2;
+
+	len1 = strspn(str, 0123456789abcdefABCDEF);
+	if (len1  1 || len1  MAXLSNCOMPONENT || str[len1] != '/')
+		ereport(ERROR,
+(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+			 errmsg(invalid input syntax for transaction log location: \%s\, str)));
+	len2 = strspn(str + len1 + 1, 0123456789abcdefABCDEF);
+	if (len2  1 || len2  MAXLSNCOMPONENT || str[len1 + 1 + len2] != '\0')
+		ereport(ERROR,
+(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+			 errmsg(invalid input syntax for transaction log location: \%s\, str)));
+}
+
+/*
+ * Compute the difference in bytes between two WAL locations.
+ */
+Datum
+pg_xlog_location_diff(PG_FUNCTION_ARGS)
+{
+	text		*location1 = PG_GETARG_TEXT_P(0);
+	text		*location2 = PG_GETARG_TEXT_P(1);
+	char		*str1, *str2;
+	XLogRecPtr	loc1, loc2;
+	Numeric		result;
+
+	/*
+	 * Read and parse input
+	 */
+	str1 = text_to_cstring(location1);
+	str2 = text_to_cstring(location2);
+
+	validate_xlog_location(str1);
+	validate_xlog_location(str2);
+
+	if (sscanf(str1, %X/%X, loc1.xlogid, loc1.xrecoff) != 2)
+		ereport(ERROR,
+(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+			errmsg(could not parse transaction log location \%s\, str1)));
+	if (sscanf(str2, %X/%X, loc2.xlogid, loc2.xrecoff) != 2)
+		ereport(ERROR,
+(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+			errmsg(could not parse transaction log location \%s\, str2)));
+
+	/*
+	 * Sanity check
+	 */
+	if (loc1.xrecoff  XLogFileSize)
+		ereport(ERROR,
+(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+			errmsg(xrecoff \%X\ is out of valid range, 0..%X, loc1.xrecoff, XLogFileSize)));
+	if (loc2.xrecoff  XLogFileSize)
+		ereport(ERROR,
+(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+			errmsg(xrecoff \%X\ is out of valid range, 0..%X, loc2.xrecoff, XLogFileSize)));
+
+	/*
+	 * result = 

[HACKERS] Patch: fix pg_dump for inherited defaults not-null flags

2012-02-09 Thread Tom Lane
Attached is a proposed patch to deal with the issue described here:
http://archives.postgresql.org/pgsql-bugs/2012-02/msg0.php

Even though we'd previously realized that comparing the text of
inherited CHECK expressions is an entirely unsafe way to detect
expression equivalence (cf comments for guessConstraintInheritance),
pg_dump is still doing that for inherited DEFAULT expressions, with
the predictable result that it does the wrong thing in this sort
of example.

Furthermore, as I looked more closely at the code, I realized that
there is another pretty fundamental issue: if an inherited column has a
default expression or NOT NULL bit that it did not inherit from its
parent, flagInhAttrs forces the column to be treated as non-inherited,
so that it will be emitted as part of the child table's CREATE TABLE
command.  This is *wrong* if the column is not attislocal; it will
result in the column incorrectly having the attislocal property after
restore.  (Note: such a situation could only arise if the user had
altered the column's default or NOT NULL property with ALTER TABLE after
creation.)

All of this logic predates the invention of attislocal, and really is
attempting to make up for the lack of that bit, so it's not all that
surprising that it falls down.

So the attached patch makes the emit-column-or-not behavior depend only
on attislocal (except for binary upgrade which has its own kluge
solution for the problem; though note that the whether-to-dump tests now
exactly match the special cases for binary upgrade, which they did not
before).  Also, I've dropped the former attempts to exploit inheritance
of defaults, and so the code will now emit a default explicitly for each
child in an inheritance hierarchy, even if it didn't really need to.
Since the backend doesn't track whether defaults are inherited, this
doesn't cause any failure to restore the catalog state properly.

Although this is a bug fix, it's a nontrivial change in the logic and
so I'm hesitant to back-patch into stable branches.  Given the lack of
prior complaints, maybe it would be best to leave it unfixed in existing
branches?  Not sure.  Thoughts?

regards, tom lane

diff --git a/src/bin/pg_dump/common.c b/src/bin/pg_dump/common.c
index 5a72d0da917a368ebbdbf164406c183693eb939c..266441df61d2c1eedd7faa39b8987254fe076f1c 100644
*** a/src/bin/pg_dump/common.c
--- b/src/bin/pg_dump/common.c
*** flagInhTables(TableInfo *tblinfo, int nu
*** 281,287 
  
  /* flagInhAttrs -
   *	 for each dumpable table in tblinfo, flag its inherited attributes
!  * so when we dump the table out, we don't dump out the inherited attributes
   *
   * modifies tblinfo
   */
--- 281,293 
  
  /* flagInhAttrs -
   *	 for each dumpable table in tblinfo, flag its inherited attributes
!  *
!  * What we need to do here is detect child columns that inherit NOT NULL
!  * bits from their parents (so that we needn't specify that again for the
!  * child) and child columns that have DEFAULT NULL when their parents had
!  * some non-null default.  In the latter case, we make up a dummy AttrDefInfo
!  * object so that we'll correctly emit the necessary DEFAULT NULL clause;
!  * otherwise the backend will apply an inherited default to the column.
   *
   * modifies tblinfo
   */
*** flagInhAttrs(TableInfo *tblinfo, int num
*** 297,303 
  		TableInfo  *tbinfo = (tblinfo[i]);
  		int			numParents;
  		TableInfo **parents;
- 		TableInfo  *parent;
  
  		/* Sequences and views never have parents */
  		if (tbinfo-relkind == RELKIND_SEQUENCE ||
--- 303,308 
*** flagInhAttrs(TableInfo *tblinfo, int num
*** 314,445 
  		if (numParents == 0)
  			continue;			/* nothing to see here, move along */
  
! 		/*
! 		 * For each attr, check the parent info: if no parent has an attr
! 		 * with the same name, then it's not inherited. If there *is* an
! 		 * attr with the same name, then only dump it if:
! 		 *
! 		 * - it is NOT NULL and zero parents are NOT NULL
! 		 *	 OR
! 		 * - it has a default value AND the default value does not match
! 		 *	 all parent default values, or no parents specify a default.
! 		 *
! 		 * See discussion on -hackers around 2-Apr-2001.
! 		 *
! 		 */
  		for (j = 0; j  tbinfo-numatts; j++)
  		{
- 			bool		foundAttr;		/* Attr was found in a parent */
  			bool		foundNotNull;	/* Attr was NOT NULL in a parent */
! 			bool		defaultsMatch;	/* All non-empty defaults match */
! 			bool		defaultsFound;	/* Found a default in a parent */
! 			AttrDefInfo *attrDef;
! 
! 			foundAttr = false;
! 			foundNotNull = false;
! 			defaultsMatch = true;
! 			defaultsFound = false;
  
! 			attrDef = tbinfo-attrdefs[j];
  
  			for (k = 0; k  numParents; k++)
  			{
  int			inhAttrInd;
  
- parent = parents[k];
  inhAttrInd = strInArray(tbinfo-attnames[j],
  			

Re: [HACKERS] psql tab completion for SELECT

2012-02-09 Thread Tom Lane
Peter Eisentraut pete...@gmx.net writes:
 That seems pretty useful, and it's more or less a one-line change, as in
 the attached patch.

That seems pretty nearly entirely bogus.  What is the argument for
supposing that the word right after SELECT is a function name?  I would
think it would be a column name (from who-knows-what table) much more
often.  Also, if it is useful, people will expect it to work in more
places than just the first word after SELECT --- for instance, somebody
who didn't realize what a kluge it was would expect it to also work
right after a top-level comma after SELECT.  Or probably after a left
parenthesis in the SELECT list.  Etc.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers