[HACKERS] New replication mode: write

2012-01-12 Thread Fujii Masao
Hi,

http://archives.postgresql.org/message-id/AANLkTilgyL3Y1jkDVHX02433COq7JLmqicsqmOsbuyA1%40mail.gmail.com

Previously I proposed the replication mode "recv" on the above thread,
but it's not
committed yet. Now I'd like to propose that mode again because it's
useful to reduce
the overhead of synchronous replication. Attached patch implements that mode.

If you choose that mode, transaction waits for its WAL to be write()'d
on the standby,
IOW, waits until the standby saves the WAL in the memory. Which provides lower
level of durability than that current synchronous replication (i.e.,
transaction waits for
its WAL to be flushed to the disk) does. However, it's practically
useful setting
because it can decrease the response time for the transaction, and
causes no data loss
unless both the master and the standby crashes and the database of the
master gets
corrupted at the same time.

In the patch, you can choose that mode by setting synchronous_commit to write.
I renamed that mode to "write" from "recv" on the basis of its actual behavior.

I measured how much "write" mode improves the performance in
synchronous replication.
Here is the result:

synchronous_commit = on
tps = 424.510843 (including connections establishing)
tps = 420.767883 (including connections establishing)
tps = 419.715658 (including connections establishing)
tps = 428.810001 (including connections establishing)
tps = 337.341445 (including connections establishing)

synchronous_commit = write
tps = 550.752712 (including connections establishing)
tps = 407.104036 (including connections establishing)
tps = 455.576190 (including connections establishing)
tps = 453.548672 (including connections establishing)
tps = 555.171325 (including connections establishing)

I used pgbench (scale factor = 100) as a benchmark and ran the
following command.

pgbench -c 8 -j 8 -T 60 -M prepared

I always ran CHECKPOINT in both master and standby before starting each pgbench
test, to prevent CHECKPOINT from affecting the result of the performance test.

Thought? Comments?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
***
*** 1559,1565  SET ENABLE_SEQSCAN TO OFF;
 
  Specifies whether transaction commit will wait for WAL records
  to be written to disk before the command returns a success
! indication to the client.  Valid values are on,
  local, and off.  The default, and safe, value
  is on.  When off, there can be a delay between
  when success is reported to the client and when the transaction is
--- 1559,1565 
 
  Specifies whether transaction commit will wait for WAL records
  to be written to disk before the command returns a success
! indication to the client.  Valid values are on, write,
  local, and off.  The default, and safe, value
  is on.  When off, there can be a delay between
  when success is reported to the client and when the transaction is
***
*** 1579,1589  SET ENABLE_SEQSCAN TO OFF;
  If  is set, this
  parameter also controls whether or not transaction commit will wait
  for the transaction's WAL records to be flushed to disk and replicated
! to the standby server.  The commit wait will last until a reply from
! the current synchronous standby indicates it has written the commit
! record of the transaction to durable storage.  If synchronous
  replication is in use, it will normally be sensible either to wait
! both for WAL records to reach both the local and remote disks, or
  to allow the transaction to commit asynchronously.  However, the
  special value local is available for transactions that
  wish to wait for local flush to disk, but not synchronous replication.
--- 1579,1597 
  If  is set, this
  parameter also controls whether or not transaction commit will wait
  for the transaction's WAL records to be flushed to disk and replicated
! to the standby server.  When write, the commit wait will
! last until a reply from the current synchronous standby indicates
! it has received the commit record of the transaction to memory.
! Normally this causes no data loss at the time of failover. However,
! if both primary and standby crash, and the database cluster of
! the primary gets corrupted, recent committed transactions might
! be lost. When on,  the commit wait will last until a reply
! from the current synchronous standby indicates it has flushed
! the commit record of the transaction to durable storage. This will
! avoids any data loss unless the database cluster of both primary and
! standby gets corrupted simultaneously. If synchronous
  replication is in 

[HACKERS] review of: collation for (expr)

2012-01-12 Thread probabble
Compiling on Ubuntu 10.04 LTS AMD64 on a GoGrid virtual machine from
2012-01-12 checkout.

Bison upgraded to v2.5, and downgraded to v2.4.1

Make process for both versions resulted in the following errors:

make[3]: Leaving directory `/usr/local/src/pgbuild/src/backend/catalog'
make -C parser gram.h
make[3]: Entering directory `/usr/local/src/pgbuild/src/backend/parser'
/usr/local/bin/bison -d  -o gram.c gram.y
gram.y: conflicts: 370 reduce/reduce
gram.y: expected 0 reduce/reduce conflicts
gram.y:10482.27-10494.33: warning: rule useless in parser due to conflicts:
func_expr: COLLATION FOR '(' a_expr ')'
make[3]: *** [gram.c] Error 1
make[3]: Leaving directory `/usr/local/src/pgbuild/src/backend/parser'
make[2]: *** [parser/gram.h] Error 2
make[2]: Leaving directory `/usr/local/src/pgbuild/src/backend'
make[1]: *** [all-backend-recurse] Error 2
make[1]: Leaving directory `/usr/local/src/pgbuild/src'
make: *** [all-src-recurse] Error 2


[HACKERS] Patch: DROP INDEX CONCURRENTLY partially reviewed

2012-01-12 Thread Linh Nguyen
https://commitfest.postgresql.org/action/patch_view?id=742

Compiled on Ubuntu 10.04.2 LTS from 2012-01-12 git checkout

- patch successfully applied
- docs built and viewable and reviewed
- source/patch built
- make check ran successfully
- additional testing with pgbench passed

Completed steps 1,2,3 of Reviewers' Guide 




[HACKERS] gistVacuumUpdate

2012-01-12 Thread YAMAMOTO Takashi
hi,

gistVacuumUpdate was removed when old-style VACUUM FULL was removed.
i wonder why.
it was not practical and REINDEX is preferred?

anyway, the removal seems incomplete and there are some leftovers:
F_TUPLES_DELETED
F_DELETED
XLOG_GIST_PAGE_DELETE

YAMAMOTO Takashi

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Postgres ReviewFest Patch: URI connection string support for libpq

2012-01-12 Thread Nick Roosevelt
Just reviewed the patch for adding URI connection string support for libpg.

There seem to be many tabs in the patch.  Perhaps the indentation is not
correct.

Also, the patch did not run correctly:

patching file src/interfaces/libpq/fe-connect.c
Hunk #1 succeeded at 282 with fuzz 1.
Hunk #2 FAILED at 300.
Hunk #3 FAILED at 583.
Hunk #4 FAILED at 3742.
Hunk #5 FAILED at 4132.
Hunk #6 FAILED at 4279.
Hunk #7 FAILED at 4455.
6 out of 7 hunks FAILED -- saving rejects to file
src/interfaces/libpq/fe-connect.c.rej

Seems like the file has changed since this patch was created.  Please
recreate the patch.

Thanks,

Nick Roosevelt


[HACKERS] 9.2 Reviewfest - logging hooks patch

2012-01-12 Thread Christopher Maujean
Patch



In reviewing the referenced patch, we found that it has no developer
documentation and no regression tests. While it compiles and installs,
there is nothing to tell us how to use (or test) it.


--Christopher



Please consider the environment before printing this email.

The information contained in this email may be confidential and/or legally 
privileged. It has been sent for the sole use of the intended recipient(s). If 
the reader of this message is not an intended recipient, you are hereby 
notified that any unauthorized review, use, disclosure, dissemination, 
distribution, or copying of this communication, or any of its contents, is 
strictly prohibited. If you have received this communication in error, please 
reply to the sender and destroy all copies of the message. Finally, the 
recipient should check this email and any attachments for the presence of 
viruses. The company accepts no liability for any damage caused by any virus 
transmitted by this email.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Standalone synchronous master

2012-01-12 Thread Fujii Masao
On Wed, Jan 4, 2012 at 11:28 PM, Aidan Van Dyk  wrote:
> So, I'm a big fan of syncrep guaranteeing it's guarantees.  To me,
> that's the whole point.  Having it "fall out of sync rep" at any point
> *automatically* seems to be exactly counter to the point of sync rep.

Yes, what Alexander proposed is not sync rep. It's new replication mode.
If we adopt the proposal, we have three replication modes, async, sync,
what Alexander proposed, like Oracle DataGuard provides. If you need
the guarantee which sync rep provides, you can choose sync as replication
mode.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Text comparison suddenly can't find collation?

2012-01-12 Thread Johann 'Myrkraverk' Oskarsson
Hi all,

Why would a string comparison work in one case and not another?  In
the following example, it works to compare a and b, but not a and d.

This is in a C module which calls

  DirectFunctionCall2( text_le, d1, d2 );

DEBUG:  Comparing a == b
DEBUG:  Comparing a == d
ERROR: could not determine which collation to use for string
comparison

I'll try and see if I can come up with a minimal example in a bit.


-- 
   Johann Oskarssonhttp://www.2ndquadrant.com/|[]
   PostgreSQL Development, 24x7 Support, Training and Services  --+--
  |
   Blog: http://my.opera.com/myrkraverk/blog/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] reprise: pretty print viewdefs

2012-01-12 Thread Hitoshi Harada
On Tue, Dec 27, 2011 at 8:02 AM, Andrew Dunstan  wrote:
>
>
> Updated, with docs and tests. Since the docs mark the versions of
> pg_get_viewdef() that take text as the first param as deprecated, I removed
> that variant of the new flavor. I left adding extra psql support to another
> day - I think this already does a good job of cleaning it up without any
> extra adjustments.
>
> I'll add this to the upcoming commitfest.

I've looked at (actually tested with folks in reviewfest, so not so in
detail :P) the patch. It applies with some hunks and compiles cleanly,
documentation and tests are prepared. make install passed without
fails.

$ patch -p1 < ~/Downloads/viewdef.patch
patching file doc/src/sgml/func.sgml
Hunk #1 succeeded at 13736 (offset -22 lines).
Hunk #2 succeeded at 13747 (offset -22 lines).
patching file src/backend/utils/adt/ruleutils.c
patching file src/include/catalog/pg_proc.h
Hunk #1 succeeded at 3672 (offset -43 lines).
patching file src/include/utils/builtins.h
Hunk #1 succeeded at 604 (offset -20 lines).
patching file src/test/regress/expected/polymorphism.out
Hunk #1 succeeded at 1360 (offset -21 lines).
patching file src/test/regress/expected/rules.out
patching file src/test/regress/expected/with.out
patching file src/test/regress/sql/rules.sql

The change of the code is in only ruleutiles.c, which looks sane to
me, with some trailing white spaces. I wonder if it's worth working
more than on target list, namely like FROM clause, expressions.

For example, pg_get_viewdef('pg_stat_activity', true).

# select pg_get_viewdef('pg_stat_activity'::regclass, true);

   pg_get_viewdef
--
  SELECT s.datid, d.datname, s.procpid, s.usesysid, u.rolname AS
usename,

  +
 s.application_name, s.client_addr, s.client_hostname,
s.client_port,

+
 s.backend_start, s.xact_start, s.query_start, s.waiting,
s.current_query

 +
FROM pg_database d, pg_stat_get_activity(NULL::integer) s(datid,
procpid, usesysid, application_name, current_query, waiting,
xact_start, query_start, backend_start, client_addr, client_hostname,
client_port), pg_authid u+
   WHERE s.datid = d.oid AND s.usesysid = u.oid;
(1 row)

It doesn't help much although the SELECT list is wrapped within 80
characters. For expressions, I mean:

=# select pg_get_viewdef('v', true);
 pg_get_viewdef
-
  SELECT a.a, a.a - 1 AS b, a.a - 2 AS c, a.a - 3 AS d,
 +
 a.a / 1 AS long_long_name, a.a * 1000 AS bcd,
 +
 CASE
 +
 WHEN a.a = 1 THEN 1
 +
 WHEN (a.a + a.a + a.a + a.a + a.a + a.a + a.a + a.a + a.a
+ a.a + a.a + a.a) = 2 THEN 2+
 ELSE 10
 +
 END AS what_about_case_expression
 +
FROM generate_series(1, 100) a(a);
(1 row)

So my conclusion is it's better than nothing, but we could do better
job here. From timeline perspective, it'd be ok to apply this patch
and improve more later in 9.3+.

Thanks,
-- 
Hitoshi Harada

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Review of patch renaming constraints

2012-01-12 Thread Joshua Berkus

Compiling on Ubuntu 10.04 LTS AMD64 on a GoGrid virtual machine from 2012-01-12 
git checkout.

Patch applied fine.

Docs are present, build, look good and are clear.

Changes to gram.y required Bison 2.5 to compile.  Are we requiring Bison 2.5 
now?  There's no configure check for it, so it took me quite a while to figure 
out what was wrong.

Make check passed.  Patch has tests for rename constraint.

Most normal uses of alter table ... rename constraint ... worked normally.  
However, the patch does not deal correctly with constraints which are not 
inherited, such as primary key constraints:

create table master ( category text not null, status int not null, value text );

alter table master add constraint master_key primary key ( category, status );

alter table master rename constraint master_key to master_primary_key;

create table partition_1 () inherits ( master );

create table partition_2 () inherits ( master );

alter table master rename constraint master_primary_key to master_key;

postgres=# alter table master rename constraint master_primary_key to 
master_key;
ERROR:  constraint "master_primary_key" for table "partition_1" does not exist
STATEMENT:  alter table master rename constraint master_primary_key to 
master_key;
ERROR:  constraint "master_primary_key" for table "partition_1" does not exist


-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com
San Francisco

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [COMMITTERS] pgsql: Send new protocol keepalive messages to standby servers.

2012-01-12 Thread Fujii Masao
On Fri, Jan 13, 2012 at 4:19 AM, Simon Riggs  wrote:
> On Thu, Jan 12, 2012 at 10:37 AM, Fujii Masao  wrote:
>> On Thu, Jan 12, 2012 at 5:53 PM, Simon Riggs  wrote:
>>> PrimaryKeepaliveMessage is a message type that uses WalSndrMessage.
>>> That message type is only sent when the WalSndr is quiet, so what is
>>> the difference, in that case?
>>
>> Oh, you are right. There is no difference.
>
> Here are the changes we discussed. Further comments before commit?

Can you add the test for avoiding useless call of GetReplicationApplyDelay()
and GetReplicationTransferLatency()?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] log messages for archive recovery progress

2012-01-12 Thread Fujii Masao
On Fri, Jan 13, 2012 at 12:58 AM, Satoshi Nagayasu  wrote:
> Anyway, how about this one?
>
> If we have 47 in archive, and 48 in pg_xlog,
>
> (1) LOG: restored log file "00080047" from archive
> (2) LOG: replaying log file "00080047"
> (3) LOG: could not restore file "00080048" from archive
> (4) LOG: replaying log file "00080048"
>
> In this case, "(4) replying" after "(3) could not restore from archive"
> would means that it has started replaying a WAL from pg_xlog.

Looks confusing. In this way, you always need to look at more than one
messages to understand the meaning of one message. Which seems
not good design. What if lots of log files exist in pg_xlog and you got
the following message? To figure out where the log file 98 comes from,
you need to find (3) from lots of log messages.

LOG: replaying log file "00080098"

What about the following simpler way?

(1) LOG: restored log file "00080047" from archive
(2) LOG: replaying log file "00080047" from archive
(4) LOG: replaying log file "00080048" from pg_xlog

In file-base log-shipping with standby_mode=on case, restoring from archive
is repeated periodically, which would flood the log file with the messages
like (3). So I don't like emitting the message like (3) with LOG level.

> I just got another option in my mind.
>
> Telling both two numbers of WAL files, from archive and
> pg_xlog directory, those have been applied during archive recovery
> would make sense?
>
> How about this one?
>
> LOG: XXX file(s) from archive, YYY file(s) from pg_xlog successfully
> applied.

What's the use case of this message?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpoint writeback via sync_file_range

2012-01-12 Thread Greg Smith

On 1/11/12 9:25 AM, Andres Freund wrote:

The heavy pressure putting it directly in the writeback queue
leads to less efficient io because quite often it won't reorder sensibly with
other io anymore and thelike. At least that was my experience in using it with
in another application.


Sure, this is one of the things I was cautioning about in the Double 
Writes thread, with VACUUM being the worst such case I've measured.


The thing to realize here is that the data we're talking about must be 
flushed to disk in the near future.  And Linux will happily cache 
gigabytes of it.  Right now, the database asks for that to be forced to 
disk via fsync, which means in chunks that can be large as a gigabyte.


Let's say we have a traditional storage array and there's competing 
activity.  10MB/s would be a good random I/O write rate in that 
situation.  A single fsync that forces 1GB out at that rate will take 
*100 seconds*.  And I've seen exactly that when trying to--about 80 
seconds is my current worst checkpoint stall ever.


And we don't have a latency vs. throughput knob any finer than that.  If 
one is added, and you turn it too far toward latency, throughput is 
going to tank for the reasons you've also seen.  Less reordering, 
elevator sorting, and write combining.  If the database isn't going to 
micro-manage the writes, it needs to give the OS room to do that work 
for it.


The most popular OS level approach to adjusting for this trade-off seems 
to be "limit the cache size".  That hasn't worked out very well when 
I've tried it, again getting back to not having enough working room for 
writes queued to reorganize them usefully.  One theory I've considered 
is that we might improve the VACUUM side of that using the same 
auto-tuning approach that's been applied to two other areas now:  scale 
the maximum size of the ring buffers based on shared_buffers.  I'm not 
real confident in that idea though, because ultimately it won't change 
the rate at which dirty buffers from VACUUM are evicted--and that's the 
source of the bottleneck in that area.


There is one piece of information the database knows, but it isn't 
communicating well to the OS yet.  I could do a better job of advising 
how to prioritize the writes that must happen soon--but not necessarily 
right now.  Yes, forcing them into write-back will be counterproductive 
from a throughput perspective.  The longer they sit at the "Dirty" cache 
level above that, the better the odds they'll be done efficiently.  But 
this is the checkpoint process we're talking about here.  It's going to 
force the information to disk soon regardless.  An intermediate step 
pushing to write-back should give the OS a bit more room to move around 
than fsync does, making the potential for a latency gain here seem quite 
real.  We'll see how the benchmarking goes.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Fix breakage from earlier plperl fix.

2012-01-12 Thread Alex Hunsaker
On Fri, Jan 6, 2012 at 14:05, Tom Lane  wrote:
> Alex Hunsaker  writes:
>> Oh my... I dunno exactly what I was smoking last night, but its a good
>> thing I didn't share :-). Uh so my test program was also completely
>> wrong, Ill have to redo it all. I've narrowed it down to:
>>         if ((type == SVt_PVGV || SvREADONLY(sv)))
>>         {
>>             if (type != SVt_PV &&
>>                 type != SVt_NV)
>>             {
>>                 sv = newSVsv(sv);
>>             }
>>        }
>
> Has anyone tried looking at the source code for SvPVutf8 to see exactly
> what cases it fails on?  The fact that there's an explicit croak() call
> makes me think it might not be terribly hard to tell.

Well its easy to find the message, its not so easy to trace it back up
:-). It is perl source code after all. It *looks* like its just:
sv.c:
Perl_sv_pvn_force_flags(SV *sv, STRLEN, I32 flags)
{
 [ Flags is SV_GMAGIC ]
if (SvREADONLY(sv) && !(flags & SV_MUTABLE_RETURN))
   // more or less...
   Perl_croak(aTHX_ "Can't coerce readonly %s to string", ref)

if ((SvTYPE(sv) > SVt_PVLV && SvTYPE(sv) != SVt_PVFM)
|| isGV_with_GP(sv))
Perl_croak(aTHX_ "Can't coerce %s to string in %s",
sv_reftype(sv,0),
}

Given that I added this hunk:
+
+   if (SvREADONLY(sv) ||
+   isGV_with_GP(sv) ||
+   (SvTYPE(sv) > SVt_PVLV && SvTYPE(sv) != SVt_PVFM))
+   sv = newSVsv(sv);
+   else
+   /* increase the reference count so we cant just
SvREFCNT_dec() it when
+* we are done */
+   SvREFCNT_inc(sv);

And viola all of these work (both in 5.14 and 5.8.9, although 5.8.9
gives different notices...)

do language plperl $$ elog(NOTICE, *foo); $$;
NOTICE:  *main::foo
CONTEXT:  PL/Perl anonymous code block

do language plperl $$ elog(NOTICE, $^V); $$;
NOTICE:  v5.14.2
CONTEXT:  PL/Perl anonymous code block

do language plperl $$ elog(NOTICE, ${^TAINT}); $$;
NOTICE:  0
CONTEXT:  PL/Perl anonymous code block

So I've done that in the attached patch. ${^TAINT} seemed to be the
only case that gave consistent notices in 5.8.9 and up so I added it
to the regression tests.

Util.c/o not depending on plperl_helpers.h was also throwing me for a
loop so I fixed it and SPI.c...

Thoughts?
*** a/src/pl/plperl/GNUmakefile
--- b/src/pl/plperl/GNUmakefile
***
*** 72,82  perlchunks.h: $(PERLCHUNKS)
  
  all: all-lib
  
! SPI.c: SPI.xs
  	@if [ x"$(perl_privlibexp)" = x"" ]; then echo "configure switch --with-perl was not specified."; exit 1; fi
  	$(PERL) $(XSUBPPDIR)/ExtUtils/xsubpp -typemap $(perl_privlibexp)/ExtUtils/typemap $< >$@
  
! Util.c: Util.xs
  	@if [ x"$(perl_privlibexp)" = x"" ]; then echo "configure switch --with-perl was not specified."; exit 1; fi
  	$(PERL) $(XSUBPPDIR)/ExtUtils/xsubpp -typemap $(perl_privlibexp)/ExtUtils/typemap $< >$@
  
--- 72,82 
  
  all: all-lib
  
! SPI.c: SPI.xs plperl_helpers.h
  	@if [ x"$(perl_privlibexp)" = x"" ]; then echo "configure switch --with-perl was not specified."; exit 1; fi
  	$(PERL) $(XSUBPPDIR)/ExtUtils/xsubpp -typemap $(perl_privlibexp)/ExtUtils/typemap $< >$@
  
! Util.c: Util.xs plperl_helpers.h
  	@if [ x"$(perl_privlibexp)" = x"" ]; then echo "configure switch --with-perl was not specified."; exit 1; fi
  	$(PERL) $(XSUBPPDIR)/ExtUtils/xsubpp -typemap $(perl_privlibexp)/ExtUtils/typemap $< >$@
  
*** a/src/pl/plperl/expected/plperl_elog.out
--- b/src/pl/plperl/expected/plperl_elog.out
***
*** 58,60  select uses_global();
--- 58,62 
   uses_global worked
  (1 row)
  
+ -- make sure we don't choke on readonly values
+ do language plperl $$ elog('NOTICE', ${^TAINT}); $$;
*** a/src/pl/plperl/plperl_helpers.h
--- b/src/pl/plperl/plperl_helpers.h
***
*** 47,74  sv2cstr(SV *sv)
  {
  	char	   *val, *res;
  	STRLEN		len;
- 	SV *nsv;
  
  	/*
  	 * get a utf8 encoded char * out of perl. *note* it may not be valid utf8!
  	 *
  	 * SvPVutf8() croaks nastily on certain things, like typeglobs and
  	 * readonly objects such as $^V. That's a perl bug - it's not supposed to
! 	 * happen. To avoid crashing the backend, we make a copy of the
! 	 * sv before passing it to SvPVutf8(). The copy is garbage collected 
  	 * when we're done with it.
  	 */
! 	nsv = newSVsv(sv);
! 	val = SvPVutf8(nsv, len);
  
  	/*
  	 * we use perl's length in the event we had an embedded null byte to ensure
  	 * we error out properly
  	 */
! 	res =  utf_u2e(val, len);
  
  	/* safe now to garbage collect the new SV */
! 	SvREFCNT_dec(nsv);
  
  	return res;
  }
--- 47,81 
  {
  	char	   *val, *res;
  	STRLEN		len;
  
  	/*
  	 * get a utf8 encoded char * out of perl. *note* it may not be valid utf8!
  	 *
  	 * SvPVutf8() croaks nastily on certain things, like typeglobs and
  	 * readonly objects such as $^V. That's a perl bug - it's not supposed to
! 	 * happen. To avoid crashing the backend, we make a copy of the sv before
! 	 * passing it to SvPVutf8().

Re: [HACKERS] Remembering bug #6123

2012-01-12 Thread Kevin Grittner
Tom Lane  wrote:
 
> Hmm, I would think you'd get assertion failures from calling
> HeapTupleHeaderGetCmax when xmax isn't the current transaction.
> (But I'm not sure that the regression tests really exercise such
> cases ... did you try the isolation tests with this?)  I was
> thinking we should probably define the cmax as being returned only
> in SelfUpdated cases.
 
You were right.  One of the isolation tests failed an assertion;
things pass with the attached change, setting the cmax
conditionally.  Some comments updated.  I hope this is helpful.  I
can't take more time right now, because we're getting major snowfall
and I've got to leave now to get home safely.
 
-Kevin

*** a/src/backend/access/heap/heapam.c
--- b/src/backend/access/heap/heapam.c
***
*** 2317,2324  simple_heap_insert(Relation relation, HeapTuple tup)
   *
   *relation - table to be modified (caller must hold suitable lock)
   *tid - TID of tuple to be deleted
!  *ctid - output parameter, used only for failure case (see below)
!  *update_xmax - output parameter, used only for failure case (see below)
   *cid - delete command ID (used for visibility test, and stored into
   *cmax if successful)
   *crosscheck - if not InvalidSnapshot, also check tuple against this
--- 2317,2323 
   *
   *relation - table to be modified (caller must hold suitable lock)
   *tid - TID of tuple to be deleted
!  *hufd - output structure, used only for failure case (see below)
   *cid - delete command ID (used for visibility test, and stored into
   *cmax if successful)
   *crosscheck - if not InvalidSnapshot, also check tuple against this
***
*** 2330,2343  simple_heap_insert(Relation relation, HeapTuple tup)
   * (the last only possible if wait == false).
   *
   * In the failure cases, the routine returns the tuple's t_ctid and t_xmax.
   * If t_ctid is the same as tid, the tuple was deleted; if different, the
   * tuple was updated, and t_ctid is the location of the replacement tuple.
   * (t_xmax is needed to verify that the replacement tuple matches.)
   */
  HTSU_Result
  heap_delete(Relation relation, ItemPointer tid,
!   ItemPointer ctid, TransactionId *update_xmax,
!   CommandId cid, Snapshot crosscheck, bool wait)
  {
HTSU_Result result;
TransactionId xid = GetCurrentTransactionId();
--- 2329,2343 
   * (the last only possible if wait == false).
   *
   * In the failure cases, the routine returns the tuple's t_ctid and t_xmax.
+  * If HeapTupleSelfUpdated, it also returns the cmax from the tuple.
   * If t_ctid is the same as tid, the tuple was deleted; if different, the
   * tuple was updated, and t_ctid is the location of the replacement tuple.
   * (t_xmax is needed to verify that the replacement tuple matches.)
   */
  HTSU_Result
  heap_delete(Relation relation, ItemPointer tid,
!   HeapUpdateFailureData *hufd, CommandId cid,
!   Snapshot crosscheck, bool wait)
  {
HTSU_Result result;
TransactionId xid = GetCurrentTransactionId();
***
*** 2498,2505  l1:
   result == HeapTupleUpdated ||
   result == HeapTupleBeingUpdated);
Assert(!(tp.t_data->t_infomask & HEAP_XMAX_INVALID));
!   *ctid = tp.t_data->t_ctid;
!   *update_xmax = HeapTupleHeaderGetXmax(tp.t_data);
UnlockReleaseBuffer(buffer);
if (have_tuple_lock)
UnlockTuple(relation, &(tp.t_self), ExclusiveLock);
--- 2498,2507 
   result == HeapTupleUpdated ||
   result == HeapTupleBeingUpdated);
Assert(!(tp.t_data->t_infomask & HEAP_XMAX_INVALID));
!   hufd->update_ctid = tp.t_data->t_ctid;
!   hufd->update_xmax = HeapTupleHeaderGetXmax(tp.t_data);
!   if (result == HeapTupleSelfUpdated)
!   hufd->update_cmax = HeapTupleHeaderGetCmax(tp.t_data);
UnlockReleaseBuffer(buffer);
if (have_tuple_lock)
UnlockTuple(relation, &(tp.t_self), ExclusiveLock);
***
*** 2631,2643  void
  simple_heap_delete(Relation relation, ItemPointer tid)
  {
HTSU_Result result;
!   ItemPointerData update_ctid;
!   TransactionId update_xmax;
  
result = heap_delete(relation, tid,
!&update_ctid, &update_xmax,
!GetCurrentCommandId(true), 
InvalidSnapshot,
!true /* wait for commit */ );
switch (result)
{
case HeapTupleSelfUpdated:
--- 2633,2643 
  simple_heap_delete(Relation relation, ItemPointer tid)
  {
HTSU_Result result;
!   HeapUpdateFailureData hu

Re: [HACKERS] Remembering bug #6123

2012-01-12 Thread Kevin Grittner
"Kevin Grittner"  wrote:
>>  I was thinking we should probably define the cmax as being
>> returned only in SelfUpdated cases.
>  
> I thought about that, but didn't see how it could be other than
> self-updated.  If you do, I guess I missed something.
 
Oh, I see it now.  Oops.
 
I can fix that and the comments if you like.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Remembering bug #6123

2012-01-12 Thread Kevin Grittner
Tom Lane  wrote:
> "Kevin Grittner"  writes:
 
>> Am I getting closer?
> 
> Hmm, I would think you'd get assertion failures from calling
> HeapTupleHeaderGetCmax when xmax isn't the current transaction.
> (But I'm not sure that the regression tests really exercise such
> cases ... did you try the isolation tests with this?)
 
I didn't go farther than a `make check`, but I'm doing a more
thorough set of tests now.
 
> I was thinking we should probably define the cmax as being
> returned only in SelfUpdated cases.
 
I thought about that, but didn't see how it could be other than
self-updated.  If you do, I guess I missed something.
 
> You failed to update assorted relevant comments, too.  But I can
> take it from here.
 
I can take another pass to polish it if you'd like.  I didn't expect
this to be the last version I posted; I was afraid I might still
have some restructuring to do.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Remembering bug #6123

2012-01-12 Thread Tom Lane
"Kevin Grittner"  writes:
> OK, I got rid of the parrots and candles and added a structure to
> hold the data returned only on failure.
 
> Am I getting closer?

Hmm, I would think you'd get assertion failures from calling
HeapTupleHeaderGetCmax when xmax isn't the current transaction.
(But I'm not sure that the regression tests really exercise such
cases ... did you try the isolation tests with this?)  I was thinking
we should probably define the cmax as being returned only in SelfUpdated
cases.

You failed to update assorted relevant comments, too.  But I can take
it from here.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ERRCODE_READ_ONLY_SQL_TRANSACTION

2012-01-12 Thread Tom Lane
Simon Riggs  writes:
> Hot Standby returns ERRCODE_READ_ONLY_SQL_TRANSACTION in most cases
> for illegal actions on a standby.

> There are two possible but not normally seen cases that give errors,
> but don't set the correct sqlstate, which makes it difficult to
> diagnose misdirected SQL from more normal SQL problems.

> *Patch corrects this. Thanks to Dimitri for the report.

I don't think I like this patch: you are promoting what are and ought to
be very low-level internal sanity checks into user-facing errors (which
among other things will require translation effort for the messages).
I note that you didn't bother even to adjust the adjacent comments
saying this should never happen.

What I want to know is what code path led to these and why the read-only
ereport did not occur at a far higher level.  If we have gaps in the RO
checking we should be fixing them somewhere else, not band-aiding here.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Remembering bug #6123

2012-01-12 Thread Kevin Grittner
Tom Lane  wrote:
 
> You forgot to attach the patch, but the approach seems totally
> Rube Goldberg to me anyway.  Why not just fix heap_update/
> heap_delete to return the additional information?  It's not like
> we don't whack their parameter lists around regularly.
> 
> Rather than having three output parameters to support the case,
> I'm a bit inclined to merge them into a single-purpose struct
> type.  But that's mostly cosmetic.
 
OK, I got rid of the parrots and candles and added a structure to
hold the data returned only on failure.
 
Am I getting closer?
 
-Kevin

*** a/src/backend/access/heap/heapam.c
--- b/src/backend/access/heap/heapam.c
***
*** 2336,2343  simple_heap_insert(Relation relation, HeapTuple tup)
   */
  HTSU_Result
  heap_delete(Relation relation, ItemPointer tid,
!   ItemPointer ctid, TransactionId *update_xmax,
!   CommandId cid, Snapshot crosscheck, bool wait)
  {
HTSU_Result result;
TransactionId xid = GetCurrentTransactionId();
--- 2336,2343 
   */
  HTSU_Result
  heap_delete(Relation relation, ItemPointer tid,
!   HeapUpdateFailureData *hufd, CommandId cid,
!   Snapshot crosscheck, bool wait)
  {
HTSU_Result result;
TransactionId xid = GetCurrentTransactionId();
***
*** 2498,2505  l1:
   result == HeapTupleUpdated ||
   result == HeapTupleBeingUpdated);
Assert(!(tp.t_data->t_infomask & HEAP_XMAX_INVALID));
!   *ctid = tp.t_data->t_ctid;
!   *update_xmax = HeapTupleHeaderGetXmax(tp.t_data);
UnlockReleaseBuffer(buffer);
if (have_tuple_lock)
UnlockTuple(relation, &(tp.t_self), ExclusiveLock);
--- 2498,2506 
   result == HeapTupleUpdated ||
   result == HeapTupleBeingUpdated);
Assert(!(tp.t_data->t_infomask & HEAP_XMAX_INVALID));
!   hufd->update_ctid = tp.t_data->t_ctid;
!   hufd->update_xmax = HeapTupleHeaderGetXmax(tp.t_data);
!   hufd->update_cmax = HeapTupleHeaderGetCmax(tp.t_data);
UnlockReleaseBuffer(buffer);
if (have_tuple_lock)
UnlockTuple(relation, &(tp.t_self), ExclusiveLock);
***
*** 2631,2643  void
  simple_heap_delete(Relation relation, ItemPointer tid)
  {
HTSU_Result result;
!   ItemPointerData update_ctid;
!   TransactionId update_xmax;
  
result = heap_delete(relation, tid,
!&update_ctid, &update_xmax,
!GetCurrentCommandId(true), 
InvalidSnapshot,
!true /* wait for commit */ );
switch (result)
{
case HeapTupleSelfUpdated:
--- 2632,2642 
  simple_heap_delete(Relation relation, ItemPointer tid)
  {
HTSU_Result result;
!   HeapUpdateFailureData hufd;
  
result = heap_delete(relation, tid,
!&hufd, 
GetCurrentCommandId(true),
!InvalidSnapshot, true /* wait 
for commit */ );
switch (result)
{
case HeapTupleSelfUpdated:
***
*** 2693,2700  simple_heap_delete(Relation relation, ItemPointer tid)
   */
  HTSU_Result
  heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
!   ItemPointer ctid, TransactionId *update_xmax,
!   CommandId cid, Snapshot crosscheck, bool wait)
  {
HTSU_Result result;
TransactionId xid = GetCurrentTransactionId();
--- 2692,2699 
   */
  HTSU_Result
  heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
!   HeapUpdateFailureData *hufd, CommandId cid,
!   Snapshot crosscheck, bool wait)
  {
HTSU_Result result;
TransactionId xid = GetCurrentTransactionId();
***
*** 2873,2880  l2:
   result == HeapTupleUpdated ||
   result == HeapTupleBeingUpdated);
Assert(!(oldtup.t_data->t_infomask & HEAP_XMAX_INVALID));
!   *ctid = oldtup.t_data->t_ctid;
!   *update_xmax = HeapTupleHeaderGetXmax(oldtup.t_data);
UnlockReleaseBuffer(buffer);
if (have_tuple_lock)
UnlockTuple(relation, &(oldtup.t_self), ExclusiveLock);
--- 2872,2880 
   result == HeapTupleUpdated ||
   result == HeapTupleBeingUpdated);
Assert(!(oldtup.t_data->t_infomask & HEAP_XMAX_INVALID));
!   hufd->update_ctid = oldtup.t_data->t_ctid;
!   hufd->update_xmax = HeapTupleHeaderGetXmax(oldtup.t_d

[HACKERS] ERRCODE_READ_ONLY_SQL_TRANSACTION

2012-01-12 Thread Simon Riggs
Hot Standby returns ERRCODE_READ_ONLY_SQL_TRANSACTION in most cases
for illegal actions on a standby.

There are two possible but not normally seen cases that give errors,
but don't set the correct sqlstate, which makes it difficult to
diagnose misdirected SQL from more normal SQL problems.

*Patch corrects this. Thanks to Dimitri for the report.

Backpatching to 9.0

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
index 8c045fb..62031eb 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -60,7 +60,9 @@ GetNewTransactionId(bool isSubXact)
 
 	/* safety check, we should never get this far in a HS slave */
 	if (RecoveryInProgress())
-		elog(ERROR, "cannot assign TransactionIds during recovery");
+		ereport(ERROR,
+(errcode(ERRCODE_READ_ONLY_SQL_TRANSACTION),
+ errmsg("cannot assign TransactionIds during recovery")));
 
 	LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 19ef66b..8ac194a 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -713,7 +713,9 @@ XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata)
 
 	/* cross-check on whether we should be here or not */
 	if (!XLogInsertAllowed())
-		elog(ERROR, "cannot make new WAL entries during recovery");
+		ereport(ERROR,
+(errcode(ERRCODE_READ_ONLY_SQL_TRANSACTION),
+ errmsg("cannot make new WAL entries during recovery")));
 
 	/* info's high bits are reserved for use by me */
 	if (info & XLR_INFO_MASK)

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] rewriteheap.c bug: toast rows don't get XIDs matching their parents

2012-01-12 Thread Tom Lane
While working on bug #6393 I was reminded of the truth of $SUBJECT: any
rows inserted into the new toast table will have the xmin of the CLUSTER
or VACUUM FULL operation, and invalid xmax, whereas their parent heap
rows will have xmin/xmax copied from the previous instance of the table.
This does not matter much for ordinary live heap rows, but it's also
necessary for CLUSTER/VACUUM FULL to copy recently-dead,
insert-in-progress, and delete-in-progress rows.  In such cases, a later
plain VACUUM might reap the parent heap rows and not the toast rows,
leading to a storage leak that won't be recovered short of another
CLUSTER/VACUUM FULL.

I can't remember if we discussed this risk when the heap rewrite code
was written.  I'm not sure it's worth fixing, but at the least it ought
to be documented in the comments in rewriteheap.c.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP -- renaming implicit sequences

2012-01-12 Thread Thomas Munro
On 12 January 2012 00:58, Tom Lane  wrote:
> Thomas Munro  writes:
>> Here is an unfinished patch to implement something which appears on
>> the TODO list under ALTER: automatic renaming of sequences created
>> with serial when the table and column names change.  I've often wanted
>> this feature and it seemed like a good starter project.
>
> Hmm ... this seems a bit inconsistent with the fact that we got rid of
> automatic renaming of indexes a year or three back.  Won't renaming of
> serials have all the same problems that caused us to give up on renaming
> indexes?

Ah.  I assume this is it:

http://archives.postgresql.org/pgsql-committers/2009-12/msg00209.php

I used ChooseRelationName to generate a new unique name during
transformation.  Presumably the implicit ALTER SEQUENCE will fail at
execution if someone manages to nab the name after
ChooseRelationName's get_relname_relid check.

So with the patch, ALTER TABLE RENAME with serial columns has the same
concurrency caveat as CREATE TABLE with serial columns, which I hoped
was reasonable or at least symmetrical.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] measuring spinning

2012-01-12 Thread Robert Haas
On Thu, Jan 12, 2012 at 4:00 AM, Simon Riggs  wrote:
> Please can you repeat the test, focusing on minutes 10-30 of a 30
> minute test run. That removes much of the noise induced during cache
> priming.
>
> My suggested size of database is one that is 80% size of RAM, with
> shared_buffers set to 40% of RAM or 2GB whichever is higher. That
> represents the common case where people know how big their data is and
> purchase RAM accordingly, then set shared_buffers in line with current
> wisdom.

OK, first, let me remember to attach the patch this time, since as
Thom pointed out, I didn't do that before.

Second, those guidelines sound quite different from what I've heard
before, which is 25% of RAM up to a *maximum* of 8GB.  This machine
has 383GB of RAM so 40% of memory would be shared_buffers = 153GB,
which is a lot bigger than I've heard recommended in the past.  So I
could potentially do that test, but I'm a bit surprised by the
proposed configuration.  So far I have been focusing mostly on pgbench
at scale factor 100, because that fits into 8GB of shared_buffers.
Using a larger working set that does not fit in shared_buffers will
expose new bottlenecks, but I've found it's most useful to me to try
to isolate the effect of different bottlenecks, which means running
the simplest workload that demonstrates at least one meaningful
problem.  Anyway, please let me know your thoughts on this.

On x86, we are apparently, getting to the point where things scale
pretty well.  On Nate Boley's 32-core machine (which is still busy
with other work, so I can't do any more tests there right now),
pgbench on  unlogged tables at scale factor 100 scales nearly linearly
all the way out to 32 clients.  On permanent tables, it is
bottlenecked by WALInsertLock and throughput is roughly halved.  On
the other hand, on Itanium, we don't scale even on a read-only
workload.  It is not yet clear to me why, but part of the problem may
be that spinlock contention seems to be much more damaging on Itanium
than it is on x86, and we have way too much of it.

I ran some 30-minute tests and here are the top spinners.  I have not
isolated the first 10 minutes from the remainder of the tests,
although that could possibly be done.  I think I'd need to start the
server, run pgbench for 10 minutes, add a marker line to the log,
restart pgbench for another 20 minutes, and then only process the log
data after the marker point.  Doable, but I haven't done it yet, and
I'm not convinced the results would be much different.  I think
latency would be a more interesting thing to graph against time, but I
haven't gotten there yet either.

SELECT-only test:

lwlock 34: shacq 29192894 exacq 34021 blk 47 spin 6468
lwlock 45: shacq 29701770 exacq 34235 blk 42 spin 7393
lwlock 40: shacq 30353555 exacq 35257 blk 45 spin 7585
lwlock 33: shacq 34585153 exacq 35175 blk 61 spin 9450
lwlock 41: shacq 36603119 exacq 35135 blk 50 spin 10231
lwlock 48: shacq 41095856 exacq 35196 blk 55 spin 10723
lwlock 42: shacq 36566360 exacq 34619 blk 65 spin 11171
lwlock 47: shacq 40809540 exacq 34976 blk 77 spin 12168
lwlock 43: shacq 38972147 exacq 35256 blk 48 spin 12253
lwlock 35: shacq 41278265 exacq 35171 blk 71 spin 13023
lwlock 44: shacq 42915161 exacq 35296 blk 54 spin 13704
lwlock 46: shacq 45259281 exacq 35113 blk 69 spin 14633
lwlock 39: shacq 40758171 exacq 35174 blk 52 spin 15125
lwlock 38: shacq 44954521 exacq 34629 blk 70 spin 17291
lwlock 37: shacq 45250155 exacq 34834 blk 69 spin 17422
lwlock 328934: shacq 205772443 exacq 0 blk 0 spin 372594
lwlock 4: shacq 205775132 exacq 356 blk 162 spin 439076
lwlock 36: shacq 246651001 exacq 34587 blk 392 spin 508843

Unlogged tables:

lwlock 397680: shacq 27253037 exacq 18 blk 6 spin 7130
lwlock 328724: shacq 28593420 exacq 136 blk 17 spin 8417
lwlock 7: shacq 0 exacq 27803235 blk 713207 spin 10802
lwlock 3: shacq 430 exacq 27801696 blk 519648 spin 11206
lwlock 328942: shacq 56285945 exacq 0 blk 0 spin 39198
lwlock 46: shacq 6309 exacq 47931 blk 1677 spin 39419
lwlock 45: shacq 63003036 exacq 48607 blk 1669 spin 39541
lwlock 40: shacq 65456415 exacq 48770 blk 1751 spin 44276
lwlock 33: shacq 68438336 exacq 48371 blk 1882 spin 45955
lwlock 35: shacq 68170010 exacq 48193 blk 1869 spin 47024
lwlock 39: shacq 65045489 exacq 47748 blk 1798 spin 47246
lwlock 41: shacq 67985476 exacq 48075 blk 1778 spin 48409
lwlock 37: shacq 69042800 exacq 48529 blk 1854 spin 49465
lwlock 34: shacq 71838042 exacq 48021 blk 1816 spin 51078
lwlock 38: shacq 69249910 exacq 48673 blk 1861 spin 51443
lwlock 36: shacq 84116654 exacq 48646 blk 1832 spin 52778
lwlock 44: shacq 74827084 exacq 48039 blk 1934 spin 54133
lwlock 42: shacq 76580702 exacq 47829 blk 2045 spin 60652
lwlock 48: shacq 95540241 exacq 48460 blk 2295 spin 69160
lwlock 47: shacq 104968331 exacq 48252 blk 2998 spin 101368
lwlock 11: shacq 103820073 exacq 28055301 blk 5097966 spin 138136
lwlock 43: shacq 128230139 exacq 48940 blk 3785 spin 168432
lwlock 4: shacq 264853747 exacq 2780

Re: [HACKERS] pgbench post-connection command

2012-01-12 Thread Heikki Linnakangas

On 12.01.2012 22:04, Tom Lane wrote:

Simon Riggs  writes:

On Thu, Jan 12, 2012 at 5:32 PM, Tom Lane  wrote:

More like "\once ... any SQL command or meta command here ..."
if we want to extend the scripting language.  But I'd be perfectly happy
with a command-line switch that specifies a script file to be run once.



Once per connection, yes?


Hmmm ... good question.  Heikki was speculating about doing CREATE TABLE
or similar, which you'd want done only once period.


I was creating a separate table for each connection to work with...


 But I see no very
strong reason why cases like that couldn't be handled outside of
pgbench.  So yeah, once per connection.


... so that is exactly what I was thinking too.

For things that only need to run once, period, you can just do:

psql -c "CREATE TABLE"; pgbench ...

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Remembering bug #6123

2012-01-12 Thread Kevin Grittner
Tom Lane  wrote:
 
> You forgot to attach the patch, but the approach seems totally
> Rube Goldberg to me anyway.  Why not just fix heap_update/
> heap_delete to return the additional information?  It's not like
> we don't whack their parameter lists around regularly.
 
OK.
 
> Rather than having three output parameters to support the case,
> I'm a bit inclined to merge them into a single-purpose struct
> type.  But that's mostly cosmetic.
 
OK.
 
Working on v5.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Remembering bug #6123

2012-01-12 Thread Tom Lane
"Kevin Grittner"  writes:
> Tom Lane  wrote:
>> it needs to check the tuple's cmax [...]  And that means the patch
>> will be a bit more invasive than this, because heap_update and
>> heap_delete don't return that information at present.
 
> I'm thinking that I could keep the test for:
 
>   GetCurrentCommandId(false) != estate->es_output_cid
 
> as a "first pass".  If that's true, I could use EvalPlanQualFetch()
> to find the last version of the tuple, and generate the error if the
> tuple's cmax != estate->es_output_cid.  I think, although I'm not
> entirely sure, that EvalPlanQualFetch needs a little work to support
> this usage.
 
> Attached is a patch based on these thoughts.  Is it on the right
> track?  I suspect I haven't got everything covered, but thought a
> reality check was in order at this point.

You forgot to attach the patch, but the approach seems totally Rube
Goldberg to me anyway.  Why not just fix heap_update/heap_delete to
return the additional information?  It's not like we don't whack their
parameter lists around regularly.

Rather than having three output parameters to support the case, I'm
a bit inclined to merge them into a single-purpose struct type.
But that's mostly cosmetic.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Remembering bug #6123

2012-01-12 Thread Kevin Grittner
"Kevin Grittner"  wrote:
 
> Attached is a patch based on these thoughts.
 
OK, really attached this time.
 
-Kevin

*** a/src/backend/executor/execMain.c
--- b/src/backend/executor/execMain.c
***
*** 1921,1928  EvalPlanQualFetch(EState *estate, Relation relation, int 
lockmode,
switch (test)
{
case HeapTupleSelfUpdated:
-   /* treat it as deleted; do not process 
*/
ReleaseBuffer(buffer);
return NULL;
  
case HeapTupleMayBeUpdated:
--- 1921,1936 
switch (test)
{
case HeapTupleSelfUpdated:
ReleaseBuffer(buffer);
+   if (!ItemPointerEquals(&update_ctid, 
&tuple.t_self))
+   {
+   /* it was updated, so look at 
the updated version */
+   tuple.t_self = update_ctid;
+   /* updated row should have xmin 
matching this xmax */
+   priorXmax = update_xmax;
+   continue;
+   }
+   /* treat it as deleted; do not process 
*/
return NULL;
  
case HeapTupleMayBeUpdated:
*** a/src/backend/executor/nodeModifyTable.c
--- b/src/backend/executor/nodeModifyTable.c
***
*** 354,359  ldelete:;
--- 354,398 
switch (result)
{
case HeapTupleSelfUpdated:
+   /*
+* There is no sensible action to take if the 
BEFORE DELETE
+* trigger for a row issues an UPDATE for the 
same row, either
+* directly or by performing DML which fires 
other triggers
+* which do the update.  We don't want to 
discard the original
+* DELETE while keeping the triggered actions 
based on its
+* deletion; and it would be no better to allow 
the original
+* DELETE while discarding some of its 
triggered actions.
+* Updating the row which is being deleted 
risks losing some
+* information which might be important 
according to business
+* rules; so throwing an error is the only safe 
course.
+* 
+* We want to be careful not to trigger this 
for join/updates
+* which join to the same row more than once, 
so we check
+* whether the tuple was expired by a command 
other than the
+* one which initially fired the trigger.  If 
it is, the
+* current command ID must have changed, so do 
that check
+* first, because it is fast.  If it has 
changed, chase down
+* to the last version of the row to see if it 
was changed by
+* a subsequent command.
+*/
+   if (GetCurrentCommandId(false) != 
estate->es_output_cid)
+   {
+   HeapTuple   copyTuple;
+ 
+   copyTuple = EvalPlanQualFetch(estate,
+   
  resultRelationDesc,
+   
  LockTupleExclusive,
+   
  &update_ctid,
+   
  update_xmax);
+   if (copyTuple == NULL
+   || 
HeapTupleHeaderGetCmax(copyTuple->t_data) != estate->es_output_cid)
+   {
+   ereport(ERROR,
+   
(errcode(ERRCODE_TRIGGERED_DATA_CHANGE_VIOLATION),
+errmsg("cannot 
update or delete a row from its BEFORE DELETE trigger"),
+

Re: [HACKERS] Remembering bug #6123

2012-01-12 Thread Kevin Grittner
Tom Lane  wrote:
 
> it needs to check the tuple's cmax [...]  And that means the patch
> will be a bit more invasive than this, because heap_update and
> heap_delete don't return that information at present.
 
I'm thinking that I could keep the test for:
 
  GetCurrentCommandId(false) != estate->es_output_cid
 
as a "first pass".  If that's true, I could use EvalPlanQualFetch()
to find the last version of the tuple, and generate the error if the
tuple's cmax != estate->es_output_cid.  I think, although I'm not
entirely sure, that EvalPlanQualFetch needs a little work to support
this usage.
 
Attached is a patch based on these thoughts.  Is it on the right
track?  I suspect I haven't got everything covered, but thought a
reality check was in order at this point.
 
It does pass regression tests, including the new ones I added.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgbench post-connection command

2012-01-12 Thread Tom Lane
Simon Riggs  writes:
> On Thu, Jan 12, 2012 at 5:32 PM, Tom Lane  wrote:
>> More like "\once ... any SQL command or meta command here ..."
>> if we want to extend the scripting language.  But I'd be perfectly happy
>> with a command-line switch that specifies a script file to be run once.

> Once per connection, yes?

Hmmm ... good question.  Heikki was speculating about doing CREATE TABLE
or similar, which you'd want done only once period.  But I see no very
strong reason why cases like that couldn't be handled outside of
pgbench.  So yeah, once per connection.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] create index regression fail

2012-01-12 Thread Alvaro Herrera

Excerpts from Jaime Casanova's message of jue ene 12 16:22:17 -0300 2012:
> On Thu, Jan 12, 2012 at 1:59 PM, Tom Lane  wrote:
> > Jaime Casanova  writes:
> >> the query where the regression fails is:
> >
> >> SELECT count(*) FROM dupindexcols
> >>   WHERE f1 > 'LX' and id < 1000 and f1 ~<~ 'YX';
> >
> >> my first theory was that it was because some locale because mine is
> >> es_EC.UTF-8 but the content of the table doesn't justify that,
> >
> > [ experiments... ]  Looks like you're wrong about that.  In ec_EC locale
> > on my machine, the test accepts these rows that are not accepted in C
> > locale:
> >
> >  223 | LLKAAA
> >  738 | LLEAAA
> >
> 
> oh! i remember now a thread where we talk about glibc not doing the
> right thing about this:
> http://archives.postgresql.org/message-id/20081215160148.gf4...@alvh.no-ip.org
> 
> an update to that post would be that in that time CH and LL where
> independent letters but sorted as if they weren't, now they both were
> degraded to be a combination of two letters.

Yeah, some of these are pretty recent developments by the "real academia
de la lengua", I guess glibc hasn't gotten word about it 100% yet.  On
the other hand I wonder if es_EC is different from other spanish locales
for some reason ...

-- 
Álvaro Herrera 
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Remembering bug #6123

2012-01-12 Thread Kevin Grittner
Tom Lane  wrote:
> Florian Pflug  writes:
>> On Jan12, 2012, at 17:30 , Tom Lane wrote:
>>> Actually, on reflection there might be a reason for checking
>>> update_ctid, with a view to allowing "harmless" cases.
> 
>> I've argued against that in the past, and I still think it's a
>> bad idea.
> 
> Well, the main argument for it in my mind is that we could avoid
> breaking existing behavior in cases where it's not clearly
> essential to do so.  Perhaps that's less compelling than keeping
> the behavior simple, since it's true that the specific cases we
> could preserve the old behavior in are pretty infrequent.  I'm not
> wedded to the idea, but would like to hear a few other peoples'
> opinions.
 
FWIW, I started from the position that the "harmless" cases should
be quietly handled.  But Florian made what, to me, were persuasive
arguments against that.  A DELETE trigger might fire twice for one
delete, mucking up data integrity, for example.  His proposal seemed
to make my use case hard to handle, but then he pointed out how
triggers could be coded to allow that -- it's just that you would
need to go out of your way to do so, reducing the chance of falling
into bad behavior accidentally.
 
So count me as a convert to Florian's POV, even if I didn't succeed
in ripping out all the code from my former POV from the v3 patch.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] create index regression fail

2012-01-12 Thread Jaime Casanova
On Thu, Jan 12, 2012 at 1:59 PM, Tom Lane  wrote:
> Jaime Casanova  writes:
>> the query where the regression fails is:
>
>> SELECT count(*) FROM dupindexcols
>>   WHERE f1 > 'LX' and id < 1000 and f1 ~<~ 'YX';
>
>> my first theory was that it was because some locale because mine is
>> es_EC.UTF-8 but the content of the table doesn't justify that,
>
> [ experiments... ]  Looks like you're wrong about that.  In ec_EC locale
> on my machine, the test accepts these rows that are not accepted in C
> locale:
>
>  223 | LLKAAA
>  738 | LLEAAA
>

oh! i remember now a thread where we talk about glibc not doing the
right thing about this:
http://archives.postgresql.org/message-id/20081215160148.gf4...@alvh.no-ip.org

an update to that post would be that in that time CH and LL where
independent letters but sorted as if they weren't, now they both were
degraded to be a combination of two letters.

-- 
Jaime Casanova         www.2ndQuadrant.com
Professional PostgreSQL: Soporte 24x7 y capacitación

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgbench post-connection command

2012-01-12 Thread Simon Riggs
On Thu, Jan 12, 2012 at 5:32 PM, Tom Lane  wrote:

> More like "\once ... any SQL command or meta command here ..."
> if we want to extend the scripting language.  But I'd be perfectly happy
> with a command-line switch that specifies a script file to be run once.

Once per connection, yes?

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [COMMITTERS] pgsql: Send new protocol keepalive messages to standby servers.

2012-01-12 Thread Simon Riggs
On Thu, Jan 12, 2012 at 10:37 AM, Fujii Masao  wrote:
> On Thu, Jan 12, 2012 at 5:53 PM, Simon Riggs  wrote:
>> PrimaryKeepaliveMessage is a message type that uses WalSndrMessage.
>> That message type is only sent when the WalSndr is quiet, so what is
>> the difference, in that case?
>
> Oh, you are right. There is no difference.

Here are the changes we discussed. Further comments before commit?

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index ee59571..590e4a1 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -750,7 +750,7 @@ ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime)
 	walrcv->lastMsgReceiptTime = lastMsgReceiptTime;
 	SpinLockRelease(&walrcv->mutex);
 
-	elog(DEBUG2, "sendtime %s receipttime %s replication apply delay %d transfer latency %d",
+	elog(DEBUG2, "sendtime %s receipttime %s replication apply delay %d ms transfer latency %d ms",
 	timestamptz_to_str(sendTime),
 	timestamptz_to_str(lastMsgReceiptTime),
 	GetReplicationApplyDelay(),
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 3598e56..3611713 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -834,7 +834,12 @@ WalSndLoop(void)
 			if (pq_is_send_pending())
 wakeEvents |= WL_SOCKET_WRITEABLE;
 			else
+			{
 WalSndKeepalive(output_message);
+/* Try to flush pending output to the client */
+if (pq_flush_if_writable() != 0)
+	break;
+			}
 
 			/* Determine time until replication timeout */
 			if (replication_timeout > 0)

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] create index regression fail

2012-01-12 Thread Tom Lane
Jaime Casanova  writes:
> the query where the regression fails is:

> SELECT count(*) FROM dupindexcols
>   WHERE f1 > 'LX' and id < 1000 and f1 ~<~ 'YX';

> my first theory was that it was because some locale because mine is
> es_EC.UTF-8 but the content of the table doesn't justify that,

[ experiments... ]  Looks like you're wrong about that.  In ec_EC locale
on my machine, the test accepts these rows that are not accepted in C
locale:

  223 | LLKAAA
  738 | LLEAAA

I'll change the lower bound to 'MA' and see if that's any more portable.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Remembering bug #6123

2012-01-12 Thread Tom Lane
Florian Pflug  writes:
> On Jan12, 2012, at 17:30 , Tom Lane wrote:
>> Actually, on reflection there might be a reason for checking
>> update_ctid, with a view to allowing "harmless" cases.

> I've argued against that in the past, and I still think it's a bad idea.

Well, the main argument for it in my mind is that we could avoid
breaking existing behavior in cases where it's not clearly essential
to do so.  Perhaps that's less compelling than keeping the behavior
simple, since it's true that the specific cases we could preserve the old
behavior in are pretty infrequent.  I'm not wedded to the idea, but
would like to hear a few other peoples' opinions.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] create index regression fail

2012-01-12 Thread Jaime Casanova
Hi,

In current HEAD, create index regression is failing (at least here).
Is anyone else seeing this?
Attached regression.diffs

the query where the regression fails is:

SELECT count(*) FROM dupindexcols
  WHERE f1 > 'LX' and id < 1000 and f1 ~<~ 'YX';

my first theory was that it was because some locale because mine is
es_EC.UTF-8 but the content of the table doesn't justify that,
my second theory was that it was because the index-only scan it's does
but turning off indexonly scan didn't give a different answer

this test case was added in commit
e2c2c2e8b1df7dfdb01e7e6f6191a569ce3c3195 to "Improve planner's
handling of duplicated index column expressions", so this commit broke
something or the count is wrong in the expected result

-- 
Jaime Casanova         www.2ndQuadrant.com
Professional PostgreSQL: Soporte 24x7 y capacitación


regression.diffs
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sending notifications from the master to the standby

2012-01-12 Thread Josh Berkus

> Many people clearly do think this is useful.

It also comes under the heading of "avoiding surprising behavior".  That
is, users instinctively expect to be able to LISTEN on standbys, and are
surprised when they can't.

> I personally don't think it will be that complex. I'm willing to
> review and maintain it if the patch works the way we want it to.
> 

I think we need some performance testing for the review for it to be valid.

1) How does this patch affect the speed and throughput of LISTEN/NOTIFY
on a standalone server?

2) Can we actually attach more LISTENers to multiple standbys than we
could to a single Master?

Unfortunately, I don't have an application which can LISTEN in a way
which doesn't eclipse any differences in througput or response time we
would see on the DB side.  Does anyone?

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgbench post-connection command

2012-01-12 Thread Tom Lane
Simon Riggs  writes:
> On Thu, Jan 12, 2012 at 4:24 PM, Tom Lane  wrote:
>> I don't believe that works for multiple \set commands, which is the
>> more likely use-case for this; as noted upthread, executing SET here
>> is quite unnecessary since you can get that behavior with
>> "export PGOPTIONS".

> OK, so you want...

> \setonce 

More like "\once ... any SQL command or meta command here ..."
if we want to extend the scripting language.  But I'd be perfectly happy
with a command-line switch that specifies a script file to be run once.

A difficulty with \once is that it's not very clear what should happen
in the case of multiple scripts (multiple -f switches).  Should we go
through all of them at startup looking for \once switches?  Or should
it happen the first time a given script is selected for execution?
And do conflicting \once \sets in different scripts affect each other?
If we go with a command-line switch then the confusion goes away, as it
is then clear that the start script's settings should be inherited by
all the client tasks.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Remembering bug #6123

2012-01-12 Thread Florian Pflug
On Jan12, 2012, at 17:30 , Tom Lane wrote:
> Actually, on reflection there might be a reason for checking
> update_ctid, with a view to allowing "harmless" cases.  I see
> these cases:
> 
> * UPDATE finds a trigger already updated the row: must throw error
> since we can't apply the update.
> 
> * UPDATE finds a trigger already deleted the row: arguably, we could
> let the deletion stand and ignore the update action.

I've argued against that in the past, and I still think it's a bad idea.
The BEFORE UPDATE trigger might have done some actions which aren't valid
now that the row has been deleted. If it actually *is* safe to let a
self-delete take precent, the BEFORE UPDATE trigger ought to check whether
the row still exists, and return NULL if it doesn't.

> * DELETE finds a trigger already updated the row: must throw error
> since we can't apply the delete.
> 
> * DELETE finds a trigger already deleted the row: arguably, there's
> no reason to complain.

I'm not convinced that is a a good idea, either. If we do that, there will
essentially be two BEFORE DELETE trigger invocations for a single deleted
row, which seems wrong. And again - if a BEFORE DELETE trigger *does* deal
with this case nicely, it simply has to check whether the row still exists
before returning non-NULL.

Also, without these exceptions, the behaviour (post-patch) is simply
explain by

  Either don't cause recursive same-row updates from BEFORE trigger,
  or have your trigger return NULL.

and except for the case of multiple BEFORE triggers on the same table
you can always assume that

  A BEFORE trigger's view of a tuple modifications always reflects the
  actual modification that will eventually happen (or an error is
  thrown)

Adding a lists of "buts" and "ifs" to these rules has a higher chance of
adding confusion and causing bugs than to actually help people, IMHO.

Especially since (judged from the number of years the current behaviour
as stood undisputed) these cases seems to arise quite infrequently in
practice.

best regards,
Florian Pflug


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgbench post-connection command

2012-01-12 Thread Simon Riggs
On Thu, Jan 12, 2012 at 4:24 PM, Tom Lane  wrote:
> Simon Riggs  writes:
>> On Thu, Jan 12, 2012 at 3:26 PM, Tom Lane  wrote:
>>> This seems rather poorly designed, mainly because there's no reason to
>>> think that a single command would be sufficient.
>
>> It supports multiple commands via multi-statement requests
>> e.g.
>> -x "SET this = on; SET that = off"
>
> I don't believe that works for multiple \set commands, which is the
> more likely use-case for this; as noted upthread, executing SET here
> is quite unnecessary since you can get that behavior with
> "export PGOPTIONS".

OK, so you want...

\setonce 

or?

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Remembering bug #6123

2012-01-12 Thread Tom Lane
I wrote:
>> ... It ought to be comparing the tuple's xmax to
>> es_output_cid.

Sigh, need to go find some caffeine.  Obviously, it needs to check the
tuple's cmax, not xmax.  And that means the patch will be a bit more
invasive than this, because heap_update and heap_delete don't return
that information at present.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Remembering bug #6123

2012-01-12 Thread Tom Lane
"Kevin Grittner"  writes:
> Tom Lane  wrote:
>> Also, what's the point of testing update_ctid?  I don't see that
>> it matters whether the outdate was a delete or an update.
 
> The update_ctid code was a carry-over from my old, slightly
> different approach, which I failed to change as I should have.  I'll
> fix that along with the other.

Actually, on reflection there might be a reason for checking
update_ctid, with a view to allowing "harmless" cases.  I see
these cases:

* UPDATE finds a trigger already updated the row: must throw error
since we can't apply the update.

* UPDATE finds a trigger already deleted the row: arguably, we could
let the deletion stand and ignore the update action.

* DELETE finds a trigger already updated the row: must throw error
since we can't apply the delete.

* DELETE finds a trigger already deleted the row: arguably, there's
no reason to complain.

Don't know if that was your reasoning as well.  But if it is, then again
the comment needs to cover that.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] JSON for PG 9.2

2012-01-12 Thread Andrew Dunstan



On 01/12/2012 11:21 AM, Merlin Moncure wrote:

On Thu, Jan 12, 2012 at 9:51 AM, Andrew Dunstan  wrote:

On 01/12/2012 10:44 AM, Pavel Stehule wrote:

this should be little bit more enhanced to support a row arrays - it
can be merged with some routines from pst tool
http://okbob.blogspot.com/2010/11/new-version-of-pst-collection-is.html

I will be covering composites.

curious, will your function work on unregistered composites?  What
would this do?

select array_to_json(array[row('a', 1), row('b', 2)]);




Expected behaviour is something like this:

   andrew=# select q2json('
  select $$a$$ || x as b,
 y as c,
 array[row(x.*,array[1,2,3]),
   row(y.*,array[4,5,6])] as z
  from generate_series(1,2) x,
   generate_series(4,5) y');

 [{"b":"a1","c":4,"z":[{"f1":1,"f2":[1,2,3]},{"f1":4,"f2":[4,5,6]}]},
  {"b":"a1","c":5,"z":[{"f1":1,"f2":[1,2,3]},{"f1":5,"f2":[4,5,6]}]},
  {"b":"a2","c":4,"z":[{"f1":2,"f2":[1,2,3]},{"f1":4,"f2":[4,5,6]}]},
  {"b":"a2","c":5,"z":[{"f1":2,"f2":[1,2,3]},{"f1":5,"f2":[4,5,6]}]}]

cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] JSON for PG 9.2

2012-01-12 Thread Pavel Stehule
2012/1/12 Merlin Moncure :
> On Thu, Jan 12, 2012 at 9:51 AM, Andrew Dunstan  wrote:
>> On 01/12/2012 10:44 AM, Pavel Stehule wrote:
>>> this should be little bit more enhanced to support a row arrays - it
>>> can be merged with some routines from pst tool
>>> http://okbob.blogspot.com/2010/11/new-version-of-pst-collection-is.html
>>
>> I will be covering composites.
>
> curious, will your function work on unregistered composites?  What
> would this do?
>
> select array_to_json(array[row('a', 1), row('b', 2)]);

it can do it - but it has to use some defaults for names.

Pavel

>
> merlin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgbench post-connection command

2012-01-12 Thread Tom Lane
Simon Riggs  writes:
> On Thu, Jan 12, 2012 at 3:26 PM, Tom Lane  wrote:
>> This seems rather poorly designed, mainly because there's no reason to
>> think that a single command would be sufficient.

> It supports multiple commands via multi-statement requests
> e.g.
> -x "SET this = on; SET that = off"

I don't believe that works for multiple \set commands, which is the
more likely use-case for this; as noted upthread, executing SET here
is quite unnecessary since you can get that behavior with
"export PGOPTIONS".

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] JSON for PG 9.2

2012-01-12 Thread Merlin Moncure
On Thu, Jan 12, 2012 at 9:51 AM, Andrew Dunstan  wrote:
> On 01/12/2012 10:44 AM, Pavel Stehule wrote:
>> this should be little bit more enhanced to support a row arrays - it
>> can be merged with some routines from pst tool
>> http://okbob.blogspot.com/2010/11/new-version-of-pst-collection-is.html
>
> I will be covering composites.

curious, will your function work on unregistered composites?  What
would this do?

select array_to_json(array[row('a', 1), row('b', 2)]);

merlin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Remembering bug #6123

2012-01-12 Thread Tom Lane
Florian Pflug  writes:
> So, I guess my question is, if we add safeguards against these sorts of
> bugs for triggers, should we also add them to FOR UPDATE? Historically,
> we seem to have taken the stand that modifications of self-updated tuples
> should be ignored. If we're going to reverse that decision (which I think
> Kevin has argued for quite convincingly), it seems consistent to complain
> about all modifications to self-updated tuples, not only to those involving
> triggers.

I'm not very convinced, except for the specific case of updates caused
by cascaded triggers.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Remembering bug #6123

2012-01-12 Thread Kevin Grittner
Tom Lane  wrote:
 
> So what we need to do is check whether the outdate was done by a
> later CommandId than current.  I see that your patch is attempting
> to deal with these issues by testing GetCurrentCommandId(false) !=
> estate->es_output_cid, but that seems completely wrong to me, as
> what it does is complain if *any* additional command has been
> executed in the transaction, regardless of what changed the target
> tuple.  It ought to be comparing the tuple's xmax to
> es_output_cid.  And the comment needs to cover why it's worrying
> about that.
 
OK.  I'll rework based on your comments.
 
> Also, what's the point of testing update_ctid?  I don't see that
> it matters whether the outdate was a delete or an update.
 
The update_ctid code was a carry-over from my old, slightly
different approach, which I failed to change as I should have.  I'll
fix that along with the other.
 
Thanks,
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgbench post-connection command

2012-01-12 Thread Simon Riggs
On Thu, Jan 12, 2012 at 3:26 PM, Tom Lane  wrote:
> Simon Riggs  writes:
>> New -x option for pgbench executes the given command once after
>> connection of each session.
>
> This seems rather poorly designed, mainly because there's no reason to
> think that a single command would be sufficient.

It supports multiple commands via multi-statement requests
e.g.

-x "SET this = on; SET that = off"


> What would make more sense to me is to add an option for a one-time
> script, or possibly extend the script language to allow marking some
> commands as "do once".

That seems a little overcooked. Any long preparatory script can be
executed before pgbench. So this command is only for options that need
to be set on each database session. If you really do have a long
script that needs to be run for each session, you can set the command
to "SELECT execute_my_initial_function()" and then write your script
as a function.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgbench post-connection command

2012-01-12 Thread Tom Lane
Heikki Linnakangas  writes:
> On 12.01.2012 17:26, Tom Lane wrote:
>> Simon Riggs  writes:
>>> New -x option for pgbench executes the given command once after
>>> connection of each session.

> If it's just for SET, you could just put the GUCs in postgresql.conf.

Or use PGOPTIONS.  I think there might be a use-case for this sort of
thing, but it's going to be more complicated than a SET or two; that's
why I think the feature ought to provide for a script not just a
command.

One particular use-case that comes to mind is executing \set commands
that only need to be done once, not once per iteration.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Remembering bug #6123

2012-01-12 Thread Tom Lane
"Kevin Grittner"  writes:
> Tom Lane  wrote:
>> ...  All we have to do is start treating
>> HeapTupleSelfUpdated result from heap_update or heap_delete as an
>> error case instead of an okay-do-nothing case.  There doesn't even
>> need to be an explicit check that this was caused by a trigger,
>> because AFAICS there isn't any other possibility.
 
> I think that's pretty much what my previously posted patches did. 
> Here's a slightly modified one, based on Florian's feedback.  Is
> this what you had in mind, or am I misunderstanding?
 
Actually, I was just about to follow up with a comment that that was too
simplistic: it would break the current behavior of join updates/deletes
that join to the same target row more than once.  (There might be an
argument for restricting those, but this discussion isn't about that.)
So what we need to do is check whether the outdate was done by a later
CommandId than current.  I see that your patch is attempting to deal
with these issues by testing GetCurrentCommandId(false) !=
estate->es_output_cid, but that seems completely wrong to me, as what it
does is complain if *any* additional command has been executed in the
transaction, regardless of what changed the target tuple.  It ought to
be comparing the tuple's xmax to es_output_cid.  And the comment needs
to cover why it's worrying about that.

Also, what's the point of testing update_ctid?  I don't see that it
matters whether the outdate was a delete or an update.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] log messages for archive recovery progress

2012-01-12 Thread Satoshi Nagayasu

2012/01/13 0:13, Robert Haas wrote:

On Thu, Jan 12, 2012 at 10:04 AM, Simon Riggs  wrote:

On Thu, Jan 12, 2012 at 2:13 PM, Robert Haas  wrote:

On Wed, Jan 11, 2012 at 9:01 AM, Simon Riggs  wrote:

On Wed, Jan 11, 2012 at 1:54 PM, Satoshi Nagayasu  wrote:


However, I'm a bit afraid that it will confuse DBA if we use
"restored" under the pg_xlog replay context, because we have
already used "restored" that means a WAL file as successfully
"copied" (not "replayed") from archive directory into pg_xlog
directory under the archive recovery context.

So, to determine the status of copying WAL files from
archive directory, I think we can use "restored", or
"could not restore" on failure.

And to determine the status of replaying WAL files
in pg_xlog directory (even if a WAL is copied from archive),
we have to use "recover" or "replay".


Agreed. I can change "restored" to "using", so we have two message types

LOG:  restored log file "00080047" from archive
LOG:  using pre-existing log file "00080047" from pg_xlog


using seems pretty fuzzy to me.  replaying?


That was my first thought, but the message relates to which file has
been selected, and how. Once it has been selected it will be replayed.
The idea is to have the two messages look similar.

The original message was "restored log file..." and says nothing about
replaying.

We could change the old message (ugh! backwards compatibility alert)

  LOG:  replaying log file "00080047" after restore from archive
  LOG:  replaying log file "00080047" already in pg_xlog

which doesn't sound much stronger to me... not sure.


Hmm, I don't know.  But that phrasing does at least have the advantage
of being clearly parallel, which I like.


It seems difficult to keep backward compatibility. :)

Anyway, how about this one?

If we have 47 in archive, and 48 in pg_xlog,

(1) LOG: restored log file "00080047" from archive
(2) LOG: replaying log file "00080047"
(3) LOG: could not restore file "00080048" from archive
(4) LOG: replaying log file "00080048"

In this case, "(4) replying" after "(3) could not restore from archive"
would means that it has started replaying a WAL from pg_xlog.

And if we have 47 in archive, and do not have 48 in pg_xlog,

(5) LOG: restored log file "00080047" from archive
(6) LOG: replaying log file "00080047"
(7) LOG: could not restore file "00080048" from archive
(8) LOG: could not replay file "00080048"

In this case, "(8) could not replay file" after "(7) could not restore
from archive" would means that "48" is not found both archive and
pg_xlog, so that the latest WAL would gone.

I just got another option in my mind.

Telling both two numbers of WAL files, from archive and
pg_xlog directory, those have been applied during archive recovery
would make sense?

How about this one?

LOG: XXX file(s) from archive, YYY file(s) from pg_xlog successfully 
applied.


Thanks,
--
Satoshi Nagayasu 
Uptime Technologies, LLC. http://www.uptime.jp

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgbench post-connection command

2012-01-12 Thread Heikki Linnakangas

On 12.01.2012 17:26, Tom Lane wrote:

Simon Riggs  writes:

New -x option for pgbench executes the given command once after
connection of each session.


If it's just for SET, you could just put the GUCs in postgresql.conf.


This seems rather poorly designed, mainly because there's no reason to
think that a single command would be sufficient.

What would make more sense to me is to add an option for a one-time
script, or possibly extend the script language to allow marking some
commands as "do once".


Hmm, that might be handy. I wanted to write a some tests a while ago 
where each session operated on a separate table. With this, I could've 
put the CREATE TABLE statement in the statup-script.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] JSON for PG 9.2

2012-01-12 Thread Andrew Dunstan



On 01/12/2012 10:44 AM, Pavel Stehule wrote:

2012/1/12 Andrew Dunstan:


On 01/12/2012 09:00 AM, Joey Adams wrote:

I wrote an array_to_json function during GSoC 2010:


http://git.postgresql.org/gitweb/?p=json-datatype.git;a=blob;f=json_io.c#l289

It's not exposed as a procedure called array_to_json: it's part of the
to_json function, which decides what to do based on the argument type.



Excellent, this is just the point at which I stopped work last night, so
with your permission I'll steal this and it will save me a good chunk of
time.


this should be little bit more enhanced to support a row arrays - it
can be merged with some routines from pst tool
http://okbob.blogspot.com/2010/11/new-version-of-pst-collection-is.html




I will be covering composites.

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgbench post-connection command

2012-01-12 Thread Alvaro Herrera

Excerpts from Tom Lane's message of jue ene 12 12:26:49 -0300 2012:
> Simon Riggs  writes:
> > New -x option for pgbench executes the given command once after
> > connection of each session.
> 
> This seems rather poorly designed, mainly because there's no reason to
> think that a single command would be sufficient.
> 
> What would make more sense to me is to add an option for a one-time
> script, or possibly extend the script language to allow marking some
> commands as "do once".

Maybe use isolation tester spec files as examples, which has blocks for
a "setup" as well as "teardown", in addition to whatever commands are to
run.  It's possible that teardown is not ever necessary -- who knows?

-- 
Álvaro Herrera 
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Remembering bug #6123

2012-01-12 Thread Florian Pflug
On Jan12, 2012, at 00:32 , Kevin Grittner wrote:
> Going back through the patches we had to make to 9.0 to move to
> PostgreSQL triggers, I noticed that I let the issues raised as bug
> #6123 lie untouched during the 9.2 development cycle.  In my view,
> the best suggestion for a solution was proposed by Florian here:
> 
> http://archives.postgresql.org/pgsql-hackers/2011-08/msg00388.php

I've recently encountered a related issue, this time not related to
triggers but to FOR UPDATE.

My code declared a cursor which iterates over pairs of parent and
child rows, and updated the parent which values from the child. I
declared the cursor "FOR UPDATE OF parent", and used "WHERE CURRENT OF"
to update the parent row while iterating over the cursor. The code
looked something like this

DECLARE
  c_parent_child CURSOR FOR
SELECT ... FROM parent JOIN child ON ...
FOR UPDATE OF parent;

BEGIN
  FOR v_row IN c_parent_child LOOP
...
UPDATE parent SET ... WHERE CURRENT OF c_parent_child
  END LOOP

I figured that was all safe and sound, since with the cursor's results
should be unaffected by the later UPDATES - after all, it's cmax is
smaller than any of the UPDATEs command ids. What I didn't take into
account, however, is that "FOR UPDATE OF" will silently skip rows which
have been updated (by the same transaction) after the cursor's snapshot
was established.

Thus, what happened was that things worked fine for parents with only
one child, but for parents with multiple children only the first child
got be processed. Once I realized that source of that problem, the fix was
easy - simply using the parent table's primary key to do the update, and
dropping the "FOR UPDATE OF" clause fixed the problem.

So, I guess my question is, if we add safeguards against these sorts of
bugs for triggers, should we also add them to FOR UPDATE? Historically,
we seem to have taken the stand that modifications of self-updated tuples
should be ignored. If we're going to reverse that decision (which I think
Kevin has argued for quite convincingly), it seems consistent to complain
about all modifications to self-updated tuples, not only to those involving
triggers.

OTOH, the more cases we complain, the larger the chance that we cause serious
issues for people who upgrade to 9.2. (or 9.3, whatever).

best regards,
Florian Pflug


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] JSON for PG 9.2

2012-01-12 Thread Pavel Stehule
2012/1/12 Andrew Dunstan :
>
>
> On 01/12/2012 09:00 AM, Joey Adams wrote:
>>
>> I wrote an array_to_json function during GSoC 2010:
>>
>>
>> http://git.postgresql.org/gitweb/?p=json-datatype.git;a=blob;f=json_io.c#l289
>>
>> It's not exposed as a procedure called array_to_json: it's part of the
>> to_json function, which decides what to do based on the argument type.
>>
>
>
> Excellent, this is just the point at which I stopped work last night, so
> with your permission I'll steal this and it will save me a good chunk of
> time.
>

this should be little bit more enhanced to support a row arrays - it
can be merged with some routines from pst tool
http://okbob.blogspot.com/2010/11/new-version-of-pst-collection-is.html

Regards

Pavel

> cheers
>
> andrew
>

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Remembering bug #6123

2012-01-12 Thread Kevin Grittner
Tom Lane  wrote:
 
> I suggest that the current behavior was designed for the case of
> independent concurrent updates, and you have not made a good
> argument for changing that.  What does make sense to me, in light
> of these examples, is to complain if a BEFORE trigger modifies the
> row "itself" (including indirect updates).  IOW, at conclusion of
> trigger firing (I see no need to do it for each trigger), check to
> see if the target row has been outdated *by the current
> transaction*, and throw error if not.
> 
> And, if you accept the statement of the fix like that, then
> actually there is no performance hit because there is no need to
> introduce new tests.  All we have to do is start treating
> HeapTupleSelfUpdated result from heap_update or heap_delete as an
> error case instead of an okay-do-nothing case.  There doesn't even
> need to be an explicit check that this was caused by a trigger,
> because AFAICS there isn't any other possibility.
 
I think that's pretty much what my previously posted patches did. 
Here's a slightly modified one, based on Florian's feedback.  Is
this what you had in mind, or am I misunderstanding?
 
-Kevin

*** a/src/backend/executor/nodeModifyTable.c
--- b/src/backend/executor/nodeModifyTable.c
***
*** 354,359  ldelete:;
--- 354,379 
switch (result)
{
case HeapTupleSelfUpdated:
+   /*
+* There is no sensible action to take if the 
BEFORE DELETE
+* trigger for a row issues an UPDATE for the 
same row, either
+* directly or by performing DML which fires 
other triggers
+* which do the update.  We don't want to 
discard the original
+* DELETE while keeping the triggered actions 
based on its
+* deletion; and it would be no better to allow 
the original
+* DELETE while discarding some of its 
triggered actions.
+* Updating the row which is being deleted 
risks losing some
+* information which might be important 
according to business
+* rules; so throwing an error is the only safe 
course.
+*/
+   if (!ItemPointerEquals(tupleid, &update_ctid)
+   && GetCurrentCommandId(false) != 
estate->es_output_cid)
+   {
+   ereport(ERROR,
+   
(errcode(ERRCODE_TRIGGERED_DATA_CHANGE_VIOLATION),
+errmsg("cannot update 
or delete a row from its BEFORE DELETE trigger"),
+errhint("Consider 
moving code to an AFTER DELETE trigger.")));
+   }
/* already deleted by self; nothing to do */
return NULL;
  
***
*** 570,575  lreplace:;
--- 590,613 
switch (result)
{
case HeapTupleSelfUpdated:
+   /*
+* There is no sensible action to take if the 
BEFORE UPDATE
+* trigger for a row issues another UPDATE for 
the same row,
+* either directly or by performing DML which 
fires other
+* triggers which do the update.  We don't want 
to discard the
+* original UPDATE while keeping the triggered 
actions based
+* on its update; and it would be no better to 
allow the
+* original UPDATE while discarding some of its 
triggered
+* actions.
+*/
+   if (!ItemPointerEquals(tupleid, &update_ctid)
+   && GetCurrentCommandId(false) != 
estate->es_output_cid)
+   {
+   ereport(ERROR,
+   
(errcode(ERRCODE_TRIGGERED_DATA_CHANGE_VIOLATION),
+errmsg("cannot update 
or delete a row from its BEFORE UPDATE trigger"),
+errhint("Consider 
moving code to an AFTER UPDATE trigger.")));
+   }
/* already deleted by self; nothing to do */
return NULL;
  
*** a/src/test/regress/expected/triggers.out
--- b/src/test/regress/expected/triggers.out

Re: [HACKERS] Remembering bug #6123

2012-01-12 Thread Tom Lane
"Kevin Grittner"  writes:
> Tom Lane  wrote:
>> While that sounds relatively safe, if possibly performance-
>> impacting, it's not apparent to me how it fixes the problem you
>> complained of.  The triggers you were using were modifying rows
>> other than the one being targeted by the triggering action, so a
>> test like the above would not notice that they'd done anything.
 
> My initial use-case was that a BEFORE DELETE trigger was deleting
> related "child" rows, and the BEFORE DELETE trigger at the child
> level was updating counts on the original (parent) row.  The proposed
> change would cause an error to be thrown when the parent level
> returned a non-NULL value from its BEFORE DELETE trigger.

Ah, I see.

I suggest that the current behavior was designed for the case of
independent concurrent updates, and you have not made a good argument
for changing that.  What does make sense to me, in light of these
examples, is to complain if a BEFORE trigger modifies the row "itself"
(including indirect updates).  IOW, at conclusion of trigger firing
(I see no need to do it for each trigger), check to see if the target
row has been outdated *by the current transaction*, and throw error if
not.

And, if you accept the statement of the fix like that, then actually
there is no performance hit because there is no need to introduce new
tests.  All we have to do is start treating HeapTupleSelfUpdated result
from heap_update or heap_delete as an error case instead of an
okay-do-nothing case.  There doesn't even need to be an explicit check
that this was caused by a trigger, because AFAICS there isn't any other
possibility.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgbench post-connection command

2012-01-12 Thread Tom Lane
Simon Riggs  writes:
> New -x option for pgbench executes the given command once after
> connection of each session.

This seems rather poorly designed, mainly because there's no reason to
think that a single command would be sufficient.

What would make more sense to me is to add an option for a one-time
script, or possibly extend the script language to allow marking some
commands as "do once".

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] JSON for PG 9.2

2012-01-12 Thread Andrew Dunstan



On 01/12/2012 09:00 AM, Joey Adams wrote:

I wrote an array_to_json function during GSoC 2010:

 
http://git.postgresql.org/gitweb/?p=json-datatype.git;a=blob;f=json_io.c#l289

It's not exposed as a procedure called array_to_json: it's part of the
to_json function, which decides what to do based on the argument type.




Excellent, this is just the point at which I stopped work last night, so 
with your permission I'll steal this and it will save me a good chunk of 
time.


cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sending notifications from the master to the standby

2012-01-12 Thread Simon Riggs
On Wed, Jan 11, 2012 at 11:38 PM, Tom Lane  wrote:

>> The obvious first use case for this is for cache invalidation.
>
> Yeah, upthread Simon pointed out that propagating notifies would be
> useful for flushing caches in applications that watch the database in a
> read-only fashion.  I grant that such a use-case is technically possible
> within the limitations of a slave server; I'm just dubious that it's a
> sufficiently attractive use-case to justify the complexity and future
> maintenance costs of the sort of designs we are talking about.  Or in
> other words: so far, cache invalidation is not the "first" use-case,
> it's the ONLY POSSIBLE use-case.  That's not useful enough.

Many people clearly do think this is useful.

I personally don't think it will be that complex. I'm willing to
review and maintain it if the patch works the way we want it to.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] log messages for archive recovery progress

2012-01-12 Thread Robert Haas
On Thu, Jan 12, 2012 at 10:04 AM, Simon Riggs  wrote:
> On Thu, Jan 12, 2012 at 2:13 PM, Robert Haas  wrote:
>> On Wed, Jan 11, 2012 at 9:01 AM, Simon Riggs  wrote:
>>> On Wed, Jan 11, 2012 at 1:54 PM, Satoshi Nagayasu  wrote:
>>>
 However, I'm a bit afraid that it will confuse DBA if we use
 "restored" under the pg_xlog replay context, because we have
 already used "restored" that means a WAL file as successfully
 "copied" (not "replayed") from archive directory into pg_xlog
 directory under the archive recovery context.

 So, to determine the status of copying WAL files from
 archive directory, I think we can use "restored", or
 "could not restore" on failure.

 And to determine the status of replaying WAL files
 in pg_xlog directory (even if a WAL is copied from archive),
 we have to use "recover" or "replay".
>>>
>>> Agreed. I can change "restored" to "using", so we have two message types
>>>
>>> LOG:  restored log file "00080047" from archive
>>> LOG:  using pre-existing log file "00080047" from pg_xlog
>>
>> using seems pretty fuzzy to me.  replaying?
>
> That was my first thought, but the message relates to which file has
> been selected, and how. Once it has been selected it will be replayed.
> The idea is to have the two messages look similar.
>
> The original message was "restored log file..." and says nothing about
> replaying.
>
> We could change the old message (ugh! backwards compatibility alert)
>
>  LOG:  replaying log file "00080047" after restore from 
> archive
>  LOG:  replaying log file "00080047" already in pg_xlog
>
> which doesn't sound much stronger to me... not sure.

Hmm, I don't know.  But that phrasing does at least have the advantage
of being clearly parallel, which I like.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] log messages for archive recovery progress

2012-01-12 Thread Simon Riggs
On Thu, Jan 12, 2012 at 2:13 PM, Robert Haas  wrote:
> On Wed, Jan 11, 2012 at 9:01 AM, Simon Riggs  wrote:
>> On Wed, Jan 11, 2012 at 1:54 PM, Satoshi Nagayasu  wrote:
>>
>>> However, I'm a bit afraid that it will confuse DBA if we use
>>> "restored" under the pg_xlog replay context, because we have
>>> already used "restored" that means a WAL file as successfully
>>> "copied" (not "replayed") from archive directory into pg_xlog
>>> directory under the archive recovery context.
>>>
>>> So, to determine the status of copying WAL files from
>>> archive directory, I think we can use "restored", or
>>> "could not restore" on failure.
>>>
>>> And to determine the status of replaying WAL files
>>> in pg_xlog directory (even if a WAL is copied from archive),
>>> we have to use "recover" or "replay".
>>
>> Agreed. I can change "restored" to "using", so we have two message types
>>
>> LOG:  restored log file "00080047" from archive
>> LOG:  using pre-existing log file "00080047" from pg_xlog
>
> using seems pretty fuzzy to me.  replaying?

That was my first thought, but the message relates to which file has
been selected, and how. Once it has been selected it will be replayed.
The idea is to have the two messages look similar.

The original message was "restored log file..." and says nothing about
replaying.

We could change the old message (ugh! backwards compatibility alert)

 LOG:  replaying log file "00080047" after restore from archive
 LOG:  replaying log file "00080047" already in pg_xlog

which doesn't sound much stronger to me... not sure.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] pgbench post-connection command

2012-01-12 Thread Simon Riggs
New -x option for pgbench executes the given command once after
connection of each session.

e.g.
pgbench -x "SET synchronous_commit = off"
pgbench -x "SET foo_extension.enabled = on"

Allows easier testing of user parameters. Command listed in final report

$ pgbench -c 2 -t 4 -x "set synchronous_commit = off"
starting vacuum...end.
transaction type: TPC-B (sort of)
post connect command: [set synchronous_commit = off]
scaling factor: 1
query mode: simple
number of clients: 2
number of threads: 1
number of transactions per client: 4
number of transactions actually processed: 8/8
tps = 122.897304 (including connections establishing)
tps = 179.953212 (excluding connections establishing)

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c
index 9081f09..1a975d8 100644
--- a/contrib/pgbench/pgbench.c
+++ b/contrib/pgbench/pgbench.c
@@ -150,6 +150,7 @@ char	   *pgoptions = NULL;
 char	   *pgtty = NULL;
 char	   *login = NULL;
 char	   *dbName;
+char	   *postConnectCommand = NULL;
 
 volatile bool timer_exceeded = false;	/* flag from signal handler */
 
@@ -367,6 +368,7 @@ usage(const char *progname)
 	 "  -t NUM   number of transactions each client runs (default: 10)\n"
 		   "  -T NUM   duration of benchmark test in seconds\n"
 		   "  -v   vacuum all four standard tables before tests\n"
+		   "  -x SQLCMDexecute SQLCMD once after connection of each db session\n"
 		   "\nCommon options:\n"
 		   "  -d   print debugging output\n"
 		   "  -h HOSTNAME  database server host or socket directory\n"
@@ -454,6 +456,9 @@ doConnect(void)
 		return NULL;
 	}
 
+	if (postConnectCommand)
+		executeStatement(conn, postConnectCommand);
+
 	return conn;
 }
 
@@ -1769,6 +1774,7 @@ printResults(int ttype, int normal_xacts, int nclients,
 		s = "Custom query";
 
 	printf("transaction type: %s\n", s);
+	printf("post connect command: [%s]\n", (postConnectCommand ? postConnectCommand : "not set"));
 	printf("scaling factor: %d\n", scale);
 	printf("query mode: %s\n", QUERYMODE[querymode]);
 	printf("number of clients: %d\n", nclients);
@@ -1910,7 +1916,7 @@ main(int argc, char **argv)
 	state = (CState *) xmalloc(sizeof(CState));
 	memset(state, 0, sizeof(CState));
 
-	while ((c = getopt_long(argc, argv, "ih:nvp:dSNc:j:Crs:t:T:U:lf:D:F:M:", long_options, &optindex)) != -1)
+	while ((c = getopt_long(argc, argv, "ih:nvp:dSNc:j:Crs:t:T:U:lf:D:F:M:x:", long_options, &optindex)) != -1)
 	{
 		switch (c)
 		{
@@ -2062,6 +2068,9 @@ main(int argc, char **argv)
 	exit(1);
 }
 break;
+			case 'x':
+postConnectCommand = optarg;
+break;
 			case 0:
 /* This covers long options which take no argument. */
 break;

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] log messages for archive recovery progress

2012-01-12 Thread Robert Haas
On Wed, Jan 11, 2012 at 9:01 AM, Simon Riggs  wrote:
> On Wed, Jan 11, 2012 at 1:54 PM, Satoshi Nagayasu  wrote:
>
>> However, I'm a bit afraid that it will confuse DBA if we use
>> "restored" under the pg_xlog replay context, because we have
>> already used "restored" that means a WAL file as successfully
>> "copied" (not "replayed") from archive directory into pg_xlog
>> directory under the archive recovery context.
>>
>> So, to determine the status of copying WAL files from
>> archive directory, I think we can use "restored", or
>> "could not restore" on failure.
>>
>> And to determine the status of replaying WAL files
>> in pg_xlog directory (even if a WAL is copied from archive),
>> we have to use "recover" or "replay".
>
> Agreed. I can change "restored" to "using", so we have two message types
>
> LOG:  restored log file "00080047" from archive
> LOG:  using pre-existing log file "00080047" from pg_xlog

using seems pretty fuzzy to me.  replaying?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] JSON for PG 9.2

2012-01-12 Thread Joey Adams
I wrote an array_to_json function during GSoC 2010:


http://git.postgresql.org/gitweb/?p=json-datatype.git;a=blob;f=json_io.c#l289

It's not exposed as a procedure called array_to_json: it's part of the
to_json function, which decides what to do based on the argument type.

- Joey

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Remembering bug #6123

2012-01-12 Thread Kevin Grittner
Tom Lane  wrote:
> "Kevin Grittner"  writes:
 
>> Going back through the patches we had to make to 9.0 to move to
>> PostgreSQL triggers, I noticed that I let the issues raised as bug
>> #6123 lie untouched during the 9.2 development cycle. In my view,
>> the best suggestion for a solution was proposed by Florian here:
> 
>> http://archives.postgresql.org/pgsql-hackers/2011-08/msg00388.php
> 
> Do you mean this:
> 
> After every BEFORE trigger invocation, if the trigger returned
> non-NULL, check if latest row version is still the same as when
> the trigger started. If not, complain.
 
That is the consice statement of it, yes.
 
> While that sounds relatively safe, if possibly performance-
> impacting, it's not apparent to me how it fixes the problem you
> complained of.  The triggers you were using were modifying rows
> other than the one being targeted by the triggering action, so a
> test like the above would not notice that they'd done anything.
 
My initial use-case was that a BEFORE DELETE trigger was deleting
related "child" rows, and the BEFORE DELETE trigger at the child
level was updating counts on the original (parent) row.  The proposed
change would cause an error to be thrown when the parent level
returned a non-NULL value from its BEFORE DELETE trigger.  That would
prevent the silent corruption of the data, so it's a big step forward
in my view; but it's not the behavior we most want in our shop for
this particular case.  In the messages later in the thread, Florian
pointed out that this pattern would allow us to get the desired
behavior:
 
| BEFORE DELETE ON :
|   DELETE FROM  WHERE parent_id = OLD.id;
|   IF FOUND THEN
| -- Removing children might have modified our row,
| -- so returning non-NULL is not an option
| DELETE FROM  WHERE id = OLD.id;
| RETURN NULL;
|   ELSE
| -- No children removed, so our row should be unmodified
| RETURN OLD;
|   END IF;
 
The advantage of Florian's approach is that it changes the default
behavior to something very safe, while allowing arbitrarily complex
behavior through correspondingly more complex code.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] CLOG contention

2012-01-12 Thread Simon Riggs
On Thu, Jan 5, 2012 at 6:26 PM, Simon Riggs  wrote:

> Patch to remove clog contention caused by dirty clog LRU.

v2, minor changes, updated for recent commits

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index 69b6ef3..f3e08e6 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -594,6 +594,26 @@ CheckPointCLOG(void)
 	TRACE_POSTGRESQL_CLOG_CHECKPOINT_DONE(true);
 }
 
+/*
+ * Conditionally flush the CLOG LRU.
+ *
+ * When a backend does ExtendCLOG we need to write the CLOG LRU if it is
+ * dirty. Performing I/O while holding XidGenLock prevents new write
+ * transactions from starting. To avoid that we flush the CLOG LRU, if
+ * we think that a page write is due soon, according to a heuristic.
+ *
+ * Note that we're reading ShmemVariableCache->nextXid without a lock
+ * since the exact value doesn't matter as input into our heuristic.
+ */
+void
+CLOGBackgroundFlushLRU(void)
+{
+	TransactionId xid = ShmemVariableCache->nextXid;
+	int		threshold = (CLOG_XACTS_PER_PAGE - (CLOG_XACTS_PER_PAGE / 4));
+
+	if (TransactionIdToPgIndex(xid) > threshold)
+		SlruBackgroundFlushLRUPage(ClogCtl);
+}
 
 /*
  * Make sure that CLOG has room for a newly-allocated XID.
diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c
index 30538ff..aea6c09 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -885,6 +885,82 @@ SlruReportIOError(SlruCtl ctl, int pageno, TransactionId xid)
 }
 
 /*
+ * Identify the LRU slot but just leave it as it is.
+ *
+ * Control lock must be held at entry, and will be held at exit.
+ */
+static int
+SlruIdentifyLRUSlot(SlruCtl ctl)
+{
+	SlruShared	shared = ctl->shared;
+	int			slotno;
+	int			cur_count;
+	int			bestslot;
+	int			best_delta;
+	int			best_page_number;
+
+	/*
+	 * If we find any EMPTY slot, just select that one. Else locate the
+	 * least-recently-used slot.
+	 *
+	 * Normally the page_lru_count values will all be different and so
+	 * there will be a well-defined LRU page.  But since we allow
+	 * concurrent execution of SlruRecentlyUsed() within
+	 * SimpleLruReadPage_ReadOnly(), it is possible that multiple pages
+	 * acquire the same lru_count values.  In that case we break ties by
+	 * choosing the furthest-back page.
+	 *
+	 * In no case will we select the slot containing latest_page_number
+	 * for replacement, even if it appears least recently used.
+	 *
+	 * Notice that this next line forcibly advances cur_lru_count to a
+	 * value that is certainly beyond any value that will be in the
+	 * page_lru_count array after the loop finishes.  This ensures that
+	 * the next execution of SlruRecentlyUsed will mark the page newly
+	 * used, even if it's for a page that has the current counter value.
+	 * That gets us back on the path to having good data when there are
+	 * multiple pages with the same lru_count.
+	 */
+	cur_count = (shared->cur_lru_count)++;
+	best_delta = -1;
+	bestslot = 0;			/* no-op, just keeps compiler quiet */
+	best_page_number = 0;	/* ditto */
+	for (slotno = 0; slotno < shared->num_slots; slotno++)
+	{
+		int			this_delta;
+		int			this_page_number;
+
+		if (shared->page_status[slotno] == SLRU_PAGE_EMPTY)
+			return slotno;
+		this_delta = cur_count - shared->page_lru_count[slotno];
+		if (this_delta < 0)
+		{
+			/*
+			 * Clean up in case shared updates have caused cur_count
+			 * increments to get "lost".  We back off the page counts,
+			 * rather than trying to increase cur_count, to avoid any
+			 * question of infinite loops or failure in the presence of
+			 * wrapped-around counts.
+			 */
+			shared->page_lru_count[slotno] = cur_count;
+			this_delta = 0;
+		}
+		this_page_number = shared->page_number[slotno];
+		if ((this_delta > best_delta ||
+			 (this_delta == best_delta &&
+			  ctl->PagePrecedes(this_page_number, best_page_number))) &&
+			this_page_number != shared->latest_page_number)
+		{
+			bestslot = slotno;
+			best_delta = this_delta;
+			best_page_number = this_page_number;
+		}
+	}
+
+	return bestslot;
+}
+
+/*
  * Select the slot to re-use when we need a free slot.
  *
  * The target page number is passed because we need to consider the
@@ -905,11 +981,8 @@ SlruSelectLRUPage(SlruCtl ctl, int pageno)
 	/* Outer loop handles restart after I/O */
 	for (;;)
 	{
-		int			slotno;
-		int			cur_count;
 		int			bestslot;
-		int			best_delta;
-		int			best_page_number;
+		int			slotno;
 
 		/* See if page already has a buffer assigned */
 		for (slotno = 0; slotno < shared->num_slots; slotno++)
@@ -919,69 +992,14 @@ SlruSelectLRUPage(SlruCtl ctl, int pageno)
 return slotno;
 		}
 
-		/*
-		 * If we find any EMPTY slot, just select that one. Else locate the
-		 * least-recently-used slot to replace.
-		 *
-		 * Normally the page_lru_count values will all be different and so
-		

[HACKERS] Simulating Clog Contention

2012-01-12 Thread Simon Riggs
In order to simulate real-world clog contention, we need to use
benchmarks that deal with real world situations.

Currently, pgbench pre-loads data using COPY and executes a VACUUM so
that all hint bits are set on every row of every page of every table.
Thus, as pgbench runs it sees zero clog accesses from historical data.
As a result, clog access is minimised and the effects of clog
contention in the real world go unnoticed.

The following patch adds a pgbench option -I to load data using
INSERTs, so that we can begin benchmark testing with rows that have
large numbers of distinct un-hinted transaction ids. With a database
pre-created using this we will be better able to simulate and thus
more easily measure clog contention. Note that current clog has space
for 1 million xids, so a scale factor of greater than 10 is required
to really stress the clog.

The patch uses multiple connections to load data using a predefined
script similar to the -N or -S logic.

$ pgbench --help
pgbench is a benchmarking tool for PostgreSQL.

Usage:
  pgbench [OPTIONS]... [DBNAME]

Initialization options:
  -i   invokes initialization mode using COPY
  -I   invokes initialization mode using INSERTs
...


$ pgbench -I -c 4 -t 1
creating tables...
filling accounts table with 10 rows using inserts
set primary key...
NOTICE:  ALTER TABLE / ADD PRIMARY KEY will create implicit index
"pgbench_branches_pkey" for table "pgbench_branches"
NOTICE:  ALTER TABLE / ADD PRIMARY KEY will create implicit index
"pgbench_tellers_pkey" for table "pgbench_tellers"
NOTICE:  ALTER TABLE / ADD PRIMARY KEY will create implicit index
"pgbench_accounts_pkey" for table "pgbench_accounts"
done.
transactions option ignored
transaction type: Load pgbench_accounts using INSERTs
scaling factor: 1
query mode: simple
number of clients: 4
number of threads: 1
number of transactions per client: 25000
number of transactions actually processed: 10/10
tps = 828.194854 (including connections establishing)
tps = 828.440330 (excluding connections establishing)

Yes, my laptop really is that slow. Contributions to improve that
situation gratefully received.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c
index 9081f09..7423532 100644
--- a/contrib/pgbench/pgbench.c
+++ b/contrib/pgbench/pgbench.c
@@ -282,6 +282,12 @@ static char *select_only = {
 	"SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n"
 };
 
+/* -I case */
+static char *insert_accounts = {
+	"\\set naccounts " CppAsString2(naccounts) "\n"
+	"INSERT INTO pgbench_accounts (aid, bid, abalance) VALUES (nextval('pgbench_accounts_load_seq'), 1 + (lastval()/(:naccounts)), 0);\n"
+};
+
 /* Function prototypes */
 static void setalarm(int seconds);
 static void *threadRun(void *arg);
@@ -340,7 +346,8 @@ usage(const char *progname)
 		   "Usage:\n"
 		   "  %s [OPTIONS]... [DBNAME]\n"
 		   "\nInitialization options:\n"
-		   "  -i   invokes initialization mode\n"
+		   "  -i   invokes initialization mode using COPY\n"
+		   "  -I   invokes initialization mode using INSERTs\n"
 		   "  -F NUM   fill factor\n"
 		   "  -s NUM   scaling factor\n"
 		   "  --index-tablespace=TABLESPACE\n"
@@ -1256,7 +1263,7 @@ disconnect_all(CState *state, int length)
 
 /* create tables and setup data */
 static void
-init(void)
+init(bool init_uses_copy)
 {
 	/*
 	 * Note: TPC-B requires at least 100 bytes per row, and the "filler"
@@ -1308,6 +1315,8 @@ init(void)
 	if ((con = doConnect()) == NULL)
 		exit(1);
 
+	fprintf(stderr, "creating tables...\n");
+
 	for (i = 0; i < lengthof(DDLs); i++)
 	{
 		char		opts[256];
@@ -1356,50 +1365,73 @@ init(void)
 
 	executeStatement(con, "commit");
 
-	/*
-	 * fill the pgbench_accounts table with some data
-	 */
-	fprintf(stderr, "creating tables...\n");
+	if (init_uses_copy)
+	{
+		/*
+		 * fill the pgbench_accounts table with some data
+		 */
+		fprintf(stderr, "loading accounts table with %d rows using COPY\n",
+			naccounts * scale);
 
-	executeStatement(con, "begin");
-	executeStatement(con, "truncate pgbench_accounts");
+		executeStatement(con, "begin");
+		executeStatement(con, "truncate pgbench_accounts");
 
-	res = PQexec(con, "copy pgbench_accounts from stdin");
-	if (PQresultStatus(res) != PGRES_COPY_IN)
-	{
-		fprintf(stderr, "%s", PQerrorMessage(con));
-		exit(1);
-	}
-	PQclear(res);
+		res = PQexec(con, "copy pgbench_accounts from stdin");
+		if (PQresultStatus(res) != PGRES_COPY_IN)
+		{
+			fprintf(stderr, "%s", PQerrorMessage(con));
+			exit(1);
+		}
+		PQclear(res);
 
-	for (i = 0; i < naccounts * scale; i++)
-	{
-		int			j = i + 1;
+		for (i = 0; i < naccounts * scale; i++)
+		{
+			int			j = i + 1;
+
+			snprintf(sql, 256, "%d\t%d\t%d\t\n", j, i / naccounts + 1, 0);
+			if (PQputline(con, sql))
+			{
+fprintf(stderr, "PQputline failed\n");
+exit(1);
+	

Re: [HACKERS] CLOG contention, part 2

2012-01-12 Thread Simon Riggs
On Sun, Jan 8, 2012 at 2:25 PM, Simon Riggs  wrote:

> I've taken that idea and used it to build a second Clog cache, known
> as ClogHistory which allows access to the read-only tail of pages in
> the clog. Once a page has been written to for the last time, it will
> be accessed via the ClogHistory Slru in preference to the normal Clog
> Slru. This separates historical accesses by readers from current write
> access by committers. Historical access doesn't force dirty writes,
> nor are commits made to wait when historical access occurs.

Why do we need this in 9.2?

We now have clog_buffers = 32 and we have write rates ~16,000 tps. At
those write rates we fill a clog buffer every 2 seconds, so the clog
cache completely churns every 64 seconds.

If we wish to achieve those rates in the real world, any access to
data that was written by a transaction more than a minute ago will
cause clog cache page faults, leading to stalls in new transactions.

To avoid those problems we need
* background writing of the clog LRU (already posted as a separate patch)
* a way of separating access to historical data from the main commit
path (this patch)

And to evaluate such situations, we need a way to simulate data that
contains many transactions. 32 buffers can hold just over 1 million
transaction ids, so benchmarks against databases containing > 10
million separate transactions are recommended (remembering that this
is just 10 mins of data on high TPS systems). A pgbench patch is
provided separately to aid in the evaluation.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [COMMITTERS] pgsql: Send new protocol keepalive messages to standby servers.

2012-01-12 Thread Fujii Masao
On Thu, Jan 12, 2012 at 5:53 PM, Simon Riggs  wrote:
> PrimaryKeepaliveMessage is a message type that uses WalSndrMessage.
> That message type is only sent when the WalSndr is quiet, so what is
> the difference, in that case?

Oh, you are right. There is no difference.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [WIP] Double-write with Fast Checksums

2012-01-12 Thread Simon Riggs
On Thu, Jan 12, 2012 at 12:09 AM, Tom Lane  wrote:
> Simon Riggs  writes:
>> On Wed, Jan 11, 2012 at 11:07 PM, Josh Berkus  wrote:
>>> On 1/11/12 1:25 PM, Dan Scales wrote:
 And just wanted to reiterate one other benefit of double writes -- it 
 greatly reduces the size of the WAL logs.
>
>>> Even if you're replicating?
>
>> Yes, but it will increase random I/O on the standby when we replay if
>> we don't have FPWs.
>
> The question is how you prevent torn pages when a slave server crashes
> during replay.  Right now, the presence of FPIs in the WAL stream,
> together with the requirement that replay restart from a checkpoint,
> is sufficient to guarantee that any torn pages will be fixed up.  If
> you remove FPIs from WAL and don't transmit some substitute information,
> ISTM you've lost protection against slave server crashes.

Sure, you need either FPW or DW to protect you. Whatever is used on
the primary must also be used on the standbys.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [COMMITTERS] pgsql: Send new protocol keepalive messages to standby servers.

2012-01-12 Thread Simon Riggs
On Thu, Jan 12, 2012 at 3:09 AM, Fujii Masao  wrote:
> On Thu, Jan 12, 2012 at 12:20 AM, Simon Riggs  wrote:
>>> +static void
>>> +ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime)
>>>
>>> walEnd is not used in ProcessWalSndrMessage() at all. Can't we remove it?
>>> If yes, walEnd field in WalSndrMessage is also not used anywhere, so ISTM
>>> we can remove it.
>>
>> It's there to allow extension of the message processing to be more
>> complex than it currently is. Changing the protocol is much harder
>> than changing a function call.
>>
>> I'd like to keep it since it doesn't have any negative effects.
>
> OK. Another problem about walEnd is that WalDataMessageHeader.walEnd is not
> the same kind of location as WalSndrMessage.walEnd. The former indicates the
> location that WAL has already been flushed (maybe not sent yet), i.e.,
> "send request
> location". OTOH, the latter indicates the location that WAL has
> already been sent.
> Is this inconsistency intentional?

WalSndrMessage isn't set to anything, its just a definition.

PrimaryKeepaliveMessage is a message type that uses WalSndrMessage.
That message type is only sent when the WalSndr is quiet, so what is
the difference, in that case?

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] measuring spinning

2012-01-12 Thread Simon Riggs
On Thu, Jan 12, 2012 at 1:48 AM, Robert Haas  wrote:

> Just to whet your appetite, here are the top spinners on a 32-client
> SELECT-only test on a 32-core Itanium server.  All the locks not shown
> below have two orders of magnitude less of a problem than these do.

Please can you repeat the test, focusing on minutes 10-30 of a 30
minute test run. That removes much of the noise induced during cache
priming.

My suggested size of database is one that is 80% size of RAM, with
shared_buffers set to 40% of RAM or 2GB whichever is higher. That
represents the common case where people know how big their data is and
purchase RAM accordingly, then set shared_buffers in line with current
wisdom.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers