Re: [HACKERS] Patch: Allow SQL-language functions to reference parameters by parameter name

2012-01-29 Thread Matthew Draper
On 25/01/12 18:37, Hitoshi Harada wrote:
 I'm still not sure whether to just revise (almost) all the SQL function
 examples to use parameter names, and declare them the right choice; as
 it's currently written, named parameters still seem rather second-class.
 
 Agreed. 

I'll try a more comprehensive revision of the examples.

 The patch seems ok, except an example I've just found.
 
 db1=# create function t(a int, t t) returns int as $$ select t.a $$
 language sql;
 CREATE FUNCTION
 db1=# select t(0, row(1, 2)::t);
  t
 ---
  1
 (1 row)
 
 Should we throw an error in such ambiguity? Or did you make it happen
 intentionally? If latter, we should also mention the rule in the
 manual.


I did consider it, and felt it was the most consistent:

# select t.x, t, z from (select 1) t(x), (select 2) z(t);
 x | t |  z
---+---+-
 1 | 2 | (2)
(1 row)


I haven't yet managed to find the above behaviour described in the
documentation either, though. To me, it feels like an obscure corner
case, whose description would leave the rules seeming more complicated
than they generally are.

Maybe it'd be better suited to be explicitly discussed in the comments?


Thanks,

Matthew

-- 
matt...@trebex.net

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [v9.2] Add GUC sepgsql.client_label

2012-01-29 Thread Kohei KaiGai
2012/1/28 Kohei KaiGai kai...@kaigai.gr.jp:
 2012/1/26 Robert Haas robertmh...@gmail.com:
 On Thu, Jan 26, 2012 at 2:07 PM, Kohei KaiGai kai...@kaigai.gr.jp wrote:
 2012/1/26 Robert Haas robertmh...@gmail.com:
 I'm wondering if a function would be a better fit than a GUC.  I don't
 think you can really restrict the ability to revert a GUC change -
 i.e. if someone does a SET and then a RESET, you pretty much have to
 allow that.  I think.  But if you expose a function then it can work
 however you like.

 One benefit to use GUC is that we can utilize existing mechanism to
 revert a value set within a transaction block on error.
 If we implement same functionality with functions, XactCallback allows
 sepgsql to get control on appropriate timing?

 Not sure, but I thought the use case was to set this at connection
 startup time and then hand the connection off to a client.  What keeps
 the client from just issuing RESET?

 In the use case of Joshua, the original security label does not privileges
 to reference any tables, and it cannot translate any domains without
 credential within IC-cards. Thus, RESET is harmless.

 However, I also assume one other possible use-case; the originator
 has privileges to translate 10 of different domains, but disallows to
 revert it. In this case, RESET without permission checks are harmful.
 (This scenario will be suitable with multi-category-model.)

 Apart from this issue, I found a problem on implementation using GUC
 options. It makes backend crash, if it tries to reset sepgsql.client_label
 without permissions within error handler because of double-errors.

I found an idea to avoid this scenario.

The problematic case is reset GUC variable because of transaction
rollback and permission violation, however, we don't intend to apply
permission checks, since it is not committed yet.
Thus, I considered to check state of the transaction using
IsTransactionState() on the assign_hook of GUC variable.

Unlike function based implementation, it is available to utilize existing
infrastructure to set the client_label to be transaction-aware.

It shall be implemented as:

void
sepgsql_assign_client_label(const char *newval, void *extra)
{
if (!IsTransactionState)
return;

/* check whether valid security context */

/* check permission check to switch current context */
}

How about such an implementation?

Thanks,
-- 
KaiGai Kohei kai...@kaigai.gr.jp

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] patch for parallel pg_dump

2012-01-29 Thread Joachim Wieland
On Fri, Jan 27, 2012 at 4:57 PM, Robert Haas robertmh...@gmail.com wrote:
 But even if you do know that subclassing
 is intended, that doesn't prove that the particular Archive object is
 always going to be an ArchiveHandle under the hood.  If it is, why not
 just pass it as an ArchiveHandle to begin with?

I know that you took back some of your comments, but I'm with you
here. Archive is allocated as an ArchiveHandle and then casted back to
Archive*, so you always know that an Archive is an ArchiveHandle. I'm
all for getting rid of Archive and just using ArchiveHandle throughout
pg_dump which would get rid of these useless casts. I admit that I
might have made it a bit worse by adding a few more of these casts but
the fundamental issue was already there and there is precedence for
casting between them in both directions :-)

Joachim

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] patch for parallel pg_dump

2012-01-29 Thread Tom Lane
Joachim Wieland j...@mcknight.de writes:
 I know that you took back some of your comments, but I'm with you
 here. Archive is allocated as an ArchiveHandle and then casted back to
 Archive*, so you always know that an Archive is an ArchiveHandle. I'm
 all for getting rid of Archive and just using ArchiveHandle throughout
 pg_dump which would get rid of these useless casts.

I'd like to see a more thoroughgoing look at the basic structure of
pg_dump.  Everybody who's ever looked at that code has found it
confusing, with the possible exception of the original author who is
long gone from the project anyway.  I don't know exactly what would make
it better, but the useless distinction between Archive and ArchiveHandle
seems like a minor annoyance, not the core disease.

Not that there'd be anything wrong with starting with that.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] CLOG contention, part 2

2012-01-29 Thread Simon Riggs
On Sat, Jan 28, 2012 at 1:52 PM, Simon Riggs si...@2ndquadrant.com wrote:

 Also, I think the general approach is wrong.  The only reason to have
 these pages in shared memory is that we can control access to them to
 prevent write/write and read/write corruption.  Since these pages are
 never written, they don't need to be in shared memory.   Just read
 each page into backend-local memory as it is needed, either
 palloc/pfree each time or using a single reserved block for the
 lifetime of the session.  Let the kernel worry about caching them so
 that the above mentioned reads are cheap.

 Will think on that.

For me, there are arguments both ways as to whether it should be in
shared or local memory.

The one factor that makes the answer shared for me is that its much
easier to reuse existing SLRU code. We dont need to invent a new way
of cacheing/access etc. We just rewire what we already have. So
overall, the local/shared debate is much less important that the
robustness/code reuse angle. That's what makes this patch fairly
simple.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgsql_fdw, FDW for PostgreSQL server

2012-01-29 Thread Kohei KaiGai
Hi Harada-san,

I checked the fdw_helper_funcs_v3.patch, pgsql_fdw_v5.patch and
pgsql_fdw_pushdown_v1.patch. My comments are below.

[BUG]
Even though pgsql_fdw tries to push-down qualifiers being executable
on the remove side at the deparseSql(), it does not remove qualifiers
being pushed down from the baserel-baserestrictinfo, thus, these
qualifiers are eventually executed twice.

See the result of this EXPLAIN.
  postgres=# EXPLAIN SELECT * FROM ft1 WHERE a  2 AND f_leak(b);
QUERY PLAN
  
--
   Foreign Scan on ft1  (cost=107.43..122.55 rows=410 width=36)
 Filter: (f_leak(b) AND (a  2))
 Remote SQL: DECLARE pgsql_fdw_cursor_0 SCROLL CURSOR FOR SELECT
a, b FROM public.t1 WHERE (a  2)
  (3 rows)

My expectation is (a  2) being executed on the remote-side and f_leak(b)
being executed on the local-side. But, filter of foreign-scan on ft1 has both
of qualifiers. It has to be removed, if a RestrictInfo get pushed-down.

[Design comment]
I'm not sure the reason why store_result() uses MessageContext to save
the Tuplestorestate within PgsqlFdwExecutionState.
The source code comment says it is used to avoid memory leaks in error
cases. I also have a similar experience on implementation of my fdw module,
so, I could understand per-scan context is already cleared at the timing of
resource-release-callback, thus, handlers to external resource have to be
saved on separated memory context.
In my personal opinion, the right design is to declare a memory context for
pgsql_fdw itself, instead of the abuse of existing memory context.
(More wise design is to define sub-memory-context for each foreign-scan,
then, remove the sub-memory-context after release handlers.)

[Design comment]
When BEGIN should be issued on the remote-side?
The connect_pg_server() is an only chance to issue BEGIN command
at the remote-side on connection being opened. However, the connection
shall be kept unless an error is not raised. Thus, the remote-side will
continue to work within a single transaction block, even if local-side iterates
a pair of BEGIN and COMMIT.
I'd like to suggest to close the transaction block at the timing of either
end of the scan, transaction or sub-transaction.

[Comment to Improve]
Also, which transaction isolation level should be specified in this case?
An idea is its isolation level is specified according to the current isolation
level on the local-side.
(Of course, it is your choice if it is not supported right now.)

[Comment to improve]
It seems to me the design of exprFunction is not future-proof, if we add
a new node type that contains two or more function calls, because it can
return an oid of functions.
I think, the right design is to handle individual node-types within the
large switch statement at foreign_expr_walker().
Of course, it is just my sense.

[Comment to improve]
The pgsql_fdw_handler() allocates FdwRoutine using makeNode(),
then it set function-pointers on each fields.
Why don't you return a pointer to statically declared FdwRoutine
variable being initialized at compile time, like:

  static FdwRoutine pgsql_fdw_handler = {
  .type   = T_FdwRoutine,
  .PlanForeignScan= pgsqlPlanForeignScan,
  .ExplainForeignScan = pgsqlExplainForeignScan,
  .BeginForeignScan   = pgsqlBeginForeignScan,
  .IterateForeignScan = pgsqlIterateForeignScan,
  .ReScanForeignScan  = pgsqlReScanForeignScan,
  .EndForeignScan = pgsqlEndForeignScan,
  };

  Datum
  pgsql_fdw_handler(PG_FUNCTION_ARGS)
  {
PG_RETURN_POINTER(pgsql_fdw_handler);
  }

[Question to implementation]
At pgsqlIterateForeignScan(), it applies null-check on festate-tuples
and bool-checks on festete-cursor_opened.
Do we have a possible scenario that festate-tuples is not null, but
festate-cursor_opened, or an opposite combination?
If null-check on festate-tuples is enough to detect the first call of
the iterate callback, it is not my preference to have redundant flag.

Thanks,

2011年12月14日15:02 Shigeru Hanada shigeru.han...@gmail.com:
 (2011/12/13 14:46), Tom Lane wrote:
 Shigeru Hanadashigeru.han...@gmail.com  writes:
 Agreed.  How about to add a per-column boolean FDW option, say
 pushdown, to pgsql_fdw?  Users can tell pgsql_fdw that the column can
 be pushed down safely by setting this option to true.

 [ itch... ] That doesn't seem like the right level of granularity.
 ISTM the problem is with whether specific operators have the same
 meaning at the far end as they do locally.  If you try to attach the
 flag to columns, you have to promise that *every* operator on that
 column means what it does locally, which is likely to not be the
 case ever if you look hard enough.  Plus, having to set the flag on
 each individual column of the same datatype seems pretty tedious.

 Indeed, I too think that labeling on each columns is not the best way,
 but 

Re: [HACKERS] Group commit, revised

2012-01-29 Thread Jesper Krogh

On 2012-01-29 01:48, Jeff Janes wrote:

I ran three modes, head, head with commit_delay, and the group_commit patch

shared_buffers = 600MB
wal_sync_method=fsync

optionally with:
commit_delay=5
commit_siblings=1

pgbench -i -s40

for clients in 1 5 10 15 20 25 30
pgbench -T 30 -M prepared -c $clients -j $clients

ran 5 times each, taking maximum tps from the repeat runs.

The results are impressive.

clients headhead_commit_delay   group_commit
1   23.923.023.9
5   31.051.359.9
10  35.056.595.7
15  35.664.9136.4
20  34.368.7159.3
25  36.564.1168.8
30  37.283.871.5

I haven't inspected that deep fall off at 30 clients for the patch.

By way of reference, if I turn off synchronous commit, I get
tps=1245.8 which is 100% CPU limited.  This sets an theoretical upper
bound on what could be achieved by the best possible group committing
method.

If the group_commit patch goes in, would we then rip out commit_delay
and commit_siblings?


Adding to the list of tests that isn't excactly a real-world system I 
decided

to repeat Jeff's tests on a Intel(R) Core(TM)2 Duo CPU E7500  @ 2.93GHz
with 4GB of memory and an Intel X25-M 160GB SSD drive underneath.


BaselineCommitdelay Group commit
1   1168.67 1233.33 1212.67
5   2611.33 3022.00 2647.67
10  3044.67 .33 3296.33
15  3153.33 3177.00 3456.67
20  3087.33 3126.33 3618.67
25  2715.00 2359.00 3309.33
30  2736.33 2831.67 2737.67


Numbers are average over 3 runs.

I have set checkpoint_segments to 30 .. otherwise same configuration as 
Jeff.

Attached is a graph.

Nice conclusion is.. group commit outperforms baseline in all runs (on 
this system).


My purpose was actual more to try to quantify the difference between a 
single SSD and

a single rotating disk.

--
Jesper
attachment: pgbench.png
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] PGCon 2012 Call for Papers - extension

2012-01-29 Thread Dan Langille
We apologize that http://www.bsdcan.org/ was offline for 12 hours from early 
Sunday morning.

The deadline for submissions has been extended to Tuesday 31 January.

PGCon 2012 will be held 17-18 May 2012, in Ottawa at the University of
Ottawa.  It will be preceded by two days of tutorials on 15-16 May 2012.

We are now accepting proposals for talks.  Proposals can be quite 
simple. We do not require academic-style papers.

If you are doing something interesting with PostgreSQL, please submit
a proposal.  You might be one of the backend hackers or work on a
PostgreSQL related project and want to share your know-how with
others. You might be developing an interesting system using PostgreSQL
as the foundation. Perhaps you migrated from another database to
PostgreSQL and would like to share details.  These, and other stories
are welcome. Both users and developers are encouraged to share their
experiences.

Here are a some ideas to jump start your proposal process:

- novel ways in which PostgreSQL is used
- migration of production systems from another database
- data warehousing
- tuning PostgreSQL for different work loads
- replication and clustering
- hacking the PostgreSQL code
- PostgreSQL derivatives and forks
- applications built around PostgreSQL
- benchmarking and performance engineering
- case studies
- location-aware and mapping software with PostGIS
- The latest PostgreSQL features and features in development
- research and teaching with PostgreSQL
- things the PostgreSQL project could do better
- integrating PostgreSQL with 3rd-party software

Both users and developers are encouraged to share their experiences.

The schedule is:

8 Jan 2012 Proposal acceptance begins
31 Jan 2012 Proposal acceptance ends
19 Feb 2012 Confirmation of accepted proposals

See also http://www.pgcon.org/2012/papers.php

Instructions for submitting a proposal to PGCon 2012 are available
from: http://www.pgcon.org/2012/submissions.php

-- 
Dan Langille - http://langille.org


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] CLOG contention, part 2

2012-01-29 Thread Simon Riggs
On Fri, Jan 27, 2012 at 10:05 PM, Jeff Janes jeff.ja...@gmail.com wrote:
 On Sat, Jan 21, 2012 at 7:31 AM, Simon Riggs si...@2ndquadrant.com wrote:

 Yes, it was. Sorry about that. New version attached, retesting while
 you read this.

 In my hands I could never get this patch to do anything.  The new
 cache was never used.

 I think that that was because RecentXminPageno never budged from -1.

 I think that that, in turn, is because the comparison below can never
 return true, because the comparison is casting both sides to uint, and
 -1 cast to uint is very large

        /* When we commit advance ClogCtl's shared RecentXminPageno if needed 
 */
        if (ClogCtl-shared-RecentXminPageno  
 TransactionIdToPage(RecentXmin))
                 ClogCtl-shared-RecentXminPageno =
 TransactionIdToPage(RecentXmin);

Thanks for looking at the patch.

The patch works fine. RecentXminPageno does move forwards as it is
supposed to and there are no uints anywhere in that calculation.

The pageno only moves forwards every 32,000 transactions, so I'm
guessing that your testing didn't go on for long enough to show it
working correctly.

As regards to effectiveness, you need to execute more than 1 million
transactions before the main clog cache fills, which might sound a
lot, but its approximately 1 minute of heavy transactions at the
highest rate Robert has published.

I've specifically designed the pgbench changes required to simulate
conditions of clog contention to help in the evaluation of this patch.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Group commit, revised

2012-01-29 Thread Greg Smith

On 01/28/2012 07:48 PM, Jeff Janes wrote:

Others are going to test this out on high-end systems. I wanted to
try it out on the other end of the scale.  I've used a Pentium 4,
3.2GHz,
with 2GB of RAM and with a single IDE drive running ext4.  ext4 is
amazingly bad on IDE, giving about 25 fsync's per second (and it lies
about fdatasync, but apparently not about fsync)


Fantastic, I had to stop for a minute to check the date on your message 
for a second there, make sure it hadn't come from some mail server 
that's been backed up on delivery the last five years.  I'm cleaning 
house toward testing this out here, and I was going to test on the same 
system using both fast and horribly slow drives.  Both ends of the scale 
are important, and they benefit in a very different way from these changes.



I haven't inspected that deep fall off at 30 clients for the patch.
By way of reference, if I turn off synchronous commit, I get
tps=1245.8 which is 100% CPU limited.  This sets an theoretical upper
bound on what could be achieved by the best possible group committing
method.


This sort of thing is why I suspect that to completely isolate some 
results, we're going to need a moderately high end server--with lots of 
cores--combined with an intentionally mismatched slow drive.  It's too 
easy to get pgbench and/or PostgreSQL to choke on something other than 
I/O when using smaller core counts.  I don't think I have anything where 
the floor is 24 TPS per client though.  Hmmm...I think I can connect an 
IDE drive to my MythTV box and format it with ext4.  Thanks for the test 
idea.


One thing you could try on this system is using the -N Do not update 
pgbench_tellers and pgbench_branches.  That eliminates a lot of the 
contention that might be pulling down your higher core count tests, 
while still giving a completely valid test of whether the group commit 
mechanism works.  Not sure whether that will push up the top-end 
usefully for you, worth a try if you have time to test again.



If the group_commit patch goes in, would we then rip out commit_delay
and commit_siblings?


The main reason those are still hanging around at all are to allow 
pushing on the latency vs. throughput trade-off on really busy systems.  
The use case is that you expect, say, 10 clients to constantly be 
committing at a high rate.  So if there's just one committing so far, 
assume it's the leading edge of a wave and pause a bit for the rest to 
come in.  I don't think the cases where this is useful behavior--people 
both want it and the current mechanism provides it--are very common in 
the real world.  It can be useful for throughput oriented benchmarks 
though, which is why I'd say it hasn't killed off yet.


We'll have to see whether the final form this makes sense in will 
usefully replace that sort of thing.  I'd certainly be in favor of 
nuking commit_delay and commit_siblings with a better option; it would 
be nice if we don't eliminate this tuning option in the process though.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: GiST for range types (was Re: [HACKERS] Range Types - typo + NULL string constructor)

2012-01-29 Thread Jeff Davis
On Tue, 2012-01-24 at 16:07 +0400, Alexander Korotkov wrote:
 Hi!
 
 
 New version of patch is attached.

Thank you for the updates. I have a small patch attached.

The only code change I made was very minor: I changed the constants used
in the penalty function because your version used INFINITE_BOUND_PENALTY
when adding an empty range, and that didn't quite make sense to me. If
I'm mistaken you can leave it as-is.

I also attached range-gist-test.sql, which I used for a performance
test. I mix various types of ranges together in a larger table of 1.1M
tuples. And then I create a smaller table that only contains normal
ranges and empty ranges. There are two tests:
  1. Create an index on the big table
  2. Do a range join (using overlaps rather than equals) where the
smaller table is on the outer side of a nested loop join and an index
scan over the larger table on the inner.

The index creation time reduces by a small amount with the patch, from
around 16s without the patch to around 13s with the patch. The query
time, however, dropped from around 26s to around 14s! Almost 2x speedup
with the patch!

Moreover, looking at the loop timing in the explain analyze output, it
goes from about 7..24 ms per loop down to about 1.5..13 ms per loop.
That seems to indicate that the index distribution is better, with more
queries returning quickly.

So, great work Alexander! Very convincing results.

Marking ready for committer, but please apply my comment fixes at your
discretion.

Regards,
Jeff Davis

PS: the test was run on my workstation (Intel(R) Core(TM) i7-2600 CPU @
3.40GHz) with work_mem=512MB, shared_buffers=512MB, and
checkpoint_segments=32. The rest of the settings were default.




range-gist-comments.patch.gz
Description: GNU Zip compressed data

\timing on

drop table big;
drop table small;

create temp table tmp_foo(i int, ir int8range);
insert into tmp_foo select g % 100, 'empty'::int8range from generate_series(1,5) g;
insert into tmp_foo select g % 100, int8range(NULL,NULL) from generate_series(1,1) g;
insert into tmp_foo select g % 100, int8range(NULL,((random()-0.5)*g*10)::int8) from generate_series(1,2) g;
insert into tmp_foo select g % 100, int8range(((random()-0.5)*g*10)::int8,NULL) from generate_series(1,2) g;
insert into tmp_foo select g % 100,
  int8range(
(g*10 + 10*(random()-0.5))::int8,
(g*10 + 10 + 10*(random()-0.5))::int8 )
  from generate_series(1,100) g;

create table big as select * from tmp_foo order by random();
drop table tmp_foo;

create table tmp_foo(i int, ir int8range);
insert into tmp_foo select g*1000 % 100, 'empty'::int8range from generate_series(1,50) g;
insert into tmp_foo select g*1000 % 100,
  int8range(
(g*10*1000 + 10*(random()-0.5))::int8,
(g*10*1000 + 10 + 10*(random()-0.5))::int8 )
  from generate_series(1,1000) g;

create table small as select * from tmp_foo order by random();
drop table tmp_foo;

vacuum;
vacuum;
vacuum;

create index big_idx on big using gist (ir);

analyze;

set enable_bitmapscan=false;

explain analyze select sum(upper(intersection) - lower(intersection))
from (
  select small.ir * big.ir as intersection from small,big
  where small.ir  big.ir and not lower_inf(big.ir) and not upper_inf(big.ir)
) s;

set enable_bitmapscan to default;
set enable_indexscan=false;

explain analyze select sum(upper(intersection) - lower(intersection))
from (
  select small.ir * big.ir as intersection from small,big
  where small.ir  big.ir and not lower_inf(big.ir) and not upper_inf(big.ir)
) s;

set enable_indexscan to default;

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] CLOG contention, part 2

2012-01-29 Thread Jeff Janes
On Sun, Jan 29, 2012 at 12:18 PM, Simon Riggs si...@2ndquadrant.com wrote:
 On Fri, Jan 27, 2012 at 10:05 PM, Jeff Janes jeff.ja...@gmail.com wrote:
 On Sat, Jan 21, 2012 at 7:31 AM, Simon Riggs si...@2ndquadrant.com wrote:

 Yes, it was. Sorry about that. New version attached, retesting while
 you read this.

 In my hands I could never get this patch to do anything.  The new
 cache was never used.

 I think that that was because RecentXminPageno never budged from -1.

 I think that that, in turn, is because the comparison below can never
 return true, because the comparison is casting both sides to uint, and
 -1 cast to uint is very large

        /* When we commit advance ClogCtl's shared RecentXminPageno if needed 
 */
        if (ClogCtl-shared-RecentXminPageno  
 TransactionIdToPage(RecentXmin))
                 ClogCtl-shared-RecentXminPageno =
 TransactionIdToPage(RecentXmin);

 Thanks for looking at the patch.

 The patch works fine. RecentXminPageno does move forwards as it is
 supposed to and there are no uints anywhere in that calculation.

Maybe it is system dependent.  Or, are you running this patch on top
of some other uncommitted patch (other than the pgbench one)?

RecentXmin is a TransactionID, which is a uint32.
I think the TransactionIdToPage macro preserves that.

If I cast to a int, then I see advancement:

if (ClogCtl-shared-RecentXminPageno  (int) TransactionIdToPage(RecentXmin))


...
 I've specifically designed the pgbench changes required to simulate
 conditions of clog contention to help in the evaluation of this patch.

Yep, I've used that one for the testing.

Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Should I implement DROP INDEX CONCURRENTLY?

2012-01-29 Thread Simon Riggs
On Thu, Jan 26, 2012 at 2:25 AM, Robert Haas robertmh...@gmail.com wrote:
 On Sat, Jan 21, 2012 at 10:11 AM, Simon Riggs si...@2ndquadrant.com wrote:
 Your caution is wise. All users of an index have already checked
 whether the index is usable at plan time, so although there is much
 code that assumes they can look at the index itself, that is not
 executed until after the correct checks.

 I'll look at VACUUM and other utilities, so thanks for that.

 I don't see much point in having the higher level lock, except perhaps
 as a test this code works.

 I thought of another way this can cause a problem: suppose that while
 we're dropping the relation with only ShareUpdateExclusiveLock, we get
 as far as calling DropRelFileNodeBuffers.  Just after we finish, some
 other process that holds AccessShareLock or RowShareLock or
 RowExclusiveLock reads and perhaps even dirties a page in the
 relation.

I can't see any way that situation can occur. The patch *explicitly*
waits until all people that can see the index as usable have dropped
their lock. So I don't think this is necessary. Having said that,
since we are talking about the index and not the whole table, if I
believe the above statement then I can't have any reasonable objection
to doing as you suggest.

Patch now locks index in AccessExclusiveLock in final stage of drop.

v3 attached.

If you have suggestions to improve grammar issues, they;re most
welcome. Otherwise this seems good to go.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services
diff --git a/doc/src/sgml/ref/drop_index.sgml b/doc/src/sgml/ref/drop_index.sgml
index 7177ef2..aeb1531 100644
--- a/doc/src/sgml/ref/drop_index.sgml
+++ b/doc/src/sgml/ref/drop_index.sgml
@@ -21,7 +21,9 @@ PostgreSQL documentation
 
  refsynopsisdiv
 synopsis
-DROP INDEX [ IF EXISTS ] replaceable class=PARAMETERname/replaceable [, ...] [ CASCADE | RESTRICT ]
+DROP INDEX
+	[ IF EXISTS ] replaceable class=PARAMETERname/replaceable [, ...] [ CASCADE | RESTRICT ]
+	CONCURRENTLY replaceable class=PARAMETERname/replaceable
 /synopsis
  /refsynopsisdiv
 
@@ -50,6 +52,30 @@ DROP INDEX [ IF EXISTS ] replaceable class=PARAMETERname/replaceable [, ..
/varlistentry
 
varlistentry
+termliteralCONCURRENTLY/literal/term
+listitem
+ para
+  When this option is used, productnamePostgreSQL/ will drop the
+  index without taking any locks that prevent concurrent selects, inserts,
+  updates, or deletes on the table; whereas a standard index drop
+  waits for a lock that locks out everything on the table until it's done.
+  Concurrent drop index is a two stage process. First, we mark the index
+  both invalid and not ready then commit the change. Next we wait until
+  there are no users locking the table who can see the index.
+ /para
+ para
+  There are several caveats to be aware of when using this option.
+  Only one index name can be specified if the literalCONCURRENTLY/literal
+  parameter is specified. Only one concurrent index drop can occur on a
+  table at a time and no modifications on the table are allowed meanwhile.
+  Regular commandDROP INDEX/ command can be performed within
+  a transaction block, but commandDROP INDEX CONCURRENTLY/ cannot.
+  There is no CASCADE option when dropping an index concurrently.
+ /para
+/listitem
+   /varlistentry
+
+   varlistentry
 termreplaceable class=PARAMETERname/replaceable/term
 listitem
  para
diff --git a/src/backend/catalog/dependency.c b/src/backend/catalog/dependency.c
index db86262..58d8f5c 100644
--- a/src/backend/catalog/dependency.c
+++ b/src/backend/catalog/dependency.c
@@ -173,8 +173,8 @@ static void reportDependentObjects(const ObjectAddresses *targetObjects,
 	   const ObjectAddress *origObject);
 static void deleteOneObject(const ObjectAddress *object,
 			Relation depRel, int32 flags);
-static void doDeletion(const ObjectAddress *object);
-static void AcquireDeletionLock(const ObjectAddress *object);
+static void doDeletion(const ObjectAddress *object, int flags);
+static void AcquireDeletionLock(const ObjectAddress *object, int flags);
 static void ReleaseDeletionLock(const ObjectAddress *object);
 static bool find_expr_references_walker(Node *node,
 			find_expr_references_context *context);
@@ -232,7 +232,7 @@ performDeletion(const ObjectAddress *object,
 	 * Acquire deletion lock on the target object.	(Ideally the caller has
 	 * done this already, but many places are sloppy about it.)
 	 */
-	AcquireDeletionLock(object);
+	AcquireDeletionLock(object, 0);
 
 	/*
 	 * Construct a list of objects to delete (ie, the given object plus
@@ -316,7 +316,7 @@ performMultipleDeletions(const ObjectAddresses *objects,
 		 * Acquire deletion lock on each target object.  (Ideally the caller
 		 * has done this already, but many places are sloppy about it.)
 		 */
-		AcquireDeletionLock(thisobj);
+	

Re: [HACKERS] CLOG contention, part 2

2012-01-29 Thread Jeff Janes
On Sun, Jan 29, 2012 at 1:41 PM, Jeff Janes jeff.ja...@gmail.com wrote:
 On Sun, Jan 29, 2012 at 12:18 PM, Simon Riggs si...@2ndquadrant.com wrote:
 On Fri, Jan 27, 2012 at 10:05 PM, Jeff Janes jeff.ja...@gmail.com wrote:
 On Sat, Jan 21, 2012 at 7:31 AM, Simon Riggs si...@2ndquadrant.com wrote:

 Yes, it was. Sorry about that. New version attached, retesting while
 you read this.

 In my hands I could never get this patch to do anything.  The new
 cache was never used.

 I think that that was because RecentXminPageno never budged from -1.

 I think that that, in turn, is because the comparison below can never
 return true, because the comparison is casting both sides to uint, and
 -1 cast to uint is very large

        /* When we commit advance ClogCtl's shared RecentXminPageno if 
 needed */
        if (ClogCtl-shared-RecentXminPageno  
 TransactionIdToPage(RecentXmin))
                 ClogCtl-shared-RecentXminPageno =
 TransactionIdToPage(RecentXmin);

 Thanks for looking at the patch.

 The patch works fine. RecentXminPageno does move forwards as it is
 supposed to and there are no uints anywhere in that calculation.

 Maybe it is system dependent.  Or, are you running this patch on top
 of some other uncommitted patch (other than the pgbench one)?

 RecentXmin is a TransactionID, which is a uint32.
 I think the TransactionIdToPage macro preserves that.

 If I cast to a int, then I see advancement:

 if (ClogCtl-shared-RecentXminPageno  (int) TransactionIdToPage(RecentXmin))

And to clarify, if I don't do the cast, I don't see advancement, using
this code:

elog(LOG, JJJ RecentXminPageno %d, %d,
ClogCtl-shared-RecentXminPageno , TransactionIdToPage(RecentXmin));
if (ClogCtl-shared-RecentXminPageno 
TransactionIdToPage(RecentXmin))
ClogCtl-shared-RecentXminPageno =
TransactionIdToPage(RecentXmin);

Then using your pgbench -I -s 100 -c 8 -j8, I get tons of log entries like:

LOG:  JJJ RecentXminPageno -1, 149
STATEMENT:  INSERT INTO pgbench_accounts (aid, bid, abalance) VALUES
(nextval('pgbench_accounts_load_seq'), 1 + (lastval()/(10)), 0);

Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] CLOG contention, part 2

2012-01-29 Thread Simon Riggs
On Sun, Jan 29, 2012 at 9:41 PM, Jeff Janes jeff.ja...@gmail.com wrote:

 If I cast to a int, then I see advancement:

I'll initialise it as 0, rather than -1 and then we don't have a
problem in any circumstance.


 I've specifically designed the pgbench changes required to simulate
 conditions of clog contention to help in the evaluation of this patch.

 Yep, I've used that one for the testing.

Most of the current patch is just bookkeeping to keep track of the
point when we can look at history in read only manner.

I've isolated the code better to allow you to explore various
implementation options. I don't see any performance difference between
any of them really, but you're welcome to look.

Please everybody note that the clog history doesn't even become active
until the first checkpoint, so this is dead code until we've hit the
first checkpoint cycle and completed a million transactions since
startup. So its designed to tune for real world situations, and is not
easy to benchmark. (Maybe we could start earlier, but having extra
code just for first few minutes seems waste of energy, especially
since we must hit million xids also).

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services
diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index 69b6ef3..8ab1b3c 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -37,6 +37,7 @@
 #include access/transam.h
 #include miscadmin.h
 #include pg_trace.h
+#include utils/snapmgr.h
 
 /*
  * Defines for CLOG page sizes.  A page is the same BLCKSZ as is used
@@ -70,12 +71,19 @@
 
 /*
  * Link to shared-memory data structures for CLOG control
+ *
+ * As of 9.2, we have 2 structures for commit log data.
+ * ClogCtl manages the main read/write part of the commit log, while
+ * the ClogHistoryCtl manages the now read-only, older part. ClogHistory
+ * removes contention from the path of transaction commits.
  */
 static SlruCtlData ClogCtlData;
+static SlruCtlData ClogHistoryCtlData;
 
-#define ClogCtl (ClogCtlData)
-
+#define ClogCtl 		(ClogCtlData)
+#define ClogHistoryCtl	(ClogHistoryCtlData)
 
+static XidStatus TransactionIdGetStatusHistory(TransactionId xid);
 static int	ZeroCLOGPage(int pageno, bool writeXlog);
 static bool CLOGPagePrecedes(int page1, int page2);
 static void WriteZeroPageXlogRec(int pageno);
@@ -296,6 +304,10 @@ TransactionIdSetPageStatus(TransactionId xid, int nsubxids,
 
 		/* ... then the main transaction */
 		TransactionIdSetStatusBit(xid, status, lsn, slotno);
+
+		/* When we commit advance ClogCtl's shared RecentXminPageno if needed */
+		if (ClogCtl-shared-RecentXminPageno  TransactionIdToPage(RecentXmin))
+			ClogCtl-shared-RecentXminPageno = TransactionIdToPage(RecentXmin);
 	}
 
 	/* Set the subtransactions */
@@ -387,6 +399,7 @@ TransactionIdSetStatusBit(TransactionId xid, XidStatus status, XLogRecPtr lsn, i
 XidStatus
 TransactionIdGetStatus(TransactionId xid, XLogRecPtr *lsn)
 {
+	bool		useClogHistory = true;
 	int			pageno = TransactionIdToPage(xid);
 	int			byteno = TransactionIdToByte(xid);
 	int			bshift = TransactionIdToBIndex(xid) * CLOG_BITS_PER_XACT;
@@ -397,15 +410,64 @@ TransactionIdGetStatus(TransactionId xid, XLogRecPtr *lsn)
 
 	/* lock is acquired by SimpleLruReadPage_ReadOnly */
 
-	slotno = SimpleLruReadPage_ReadOnly(ClogCtl, pageno, xid);
-	byteptr = ClogCtl-shared-page_buffer[slotno] + byteno;
+	/*
+	 * Decide whether to use main Clog or read-only ClogHistory.
+	 *
+	 * Our knowledge of the boundary between the two may be a little out
+	 * of date, so if we try Clog and can't find it we need to try again
+	 * against ClogHistory.
+	 */
+	if (pageno = ClogCtl-recent_oldest_active_page_number)
+	{
+		slotno = SimpleLruReadPage_ReadOnly(ClogCtl, pageno, xid);
+		if (slotno = 0)
+			useClogHistory = false;
+	}
+
+	if (useClogHistory)
+		return TransactionIdGetStatusHistory(xid);
+
+	byteptr = clog-shared-page_buffer[slotno] + byteno;
 
 	status = (*byteptr  bshift)  CLOG_XACT_BITMASK;
 
 	lsnindex = GetLSNIndex(slotno, xid);
-	*lsn = ClogCtl-shared-group_lsn[lsnindex];
+	*lsn = clog-shared-group_lsn[lsnindex];
 
-	LWLockRelease(CLogControlLock);
+	LWLockRelease(clog-shared-ControlLock);
+
+	return status;
+}
+
+/*
+ * Get state of a transaction from the read-only portion of the clog,
+ * which we refer to as the clog history.
+ *
+ * Code isolated here to more easily allow various implementation options.
+ */
+static XidStatus
+TransactionIdGetStatusHistory(TransactionId xid)
+{
+	SlruCtl		clog = ClogHistoryCtl;
+	int			pageno = TransactionIdToPage(xid);
+	int			byteno = TransactionIdToByte(xid);
+	int			bshift = TransactionIdToBIndex(xid) * CLOG_BITS_PER_XACT;
+	int			slotno;
+	char	   *byteptr;
+	XidStatus	status;
+
+	slotno = SimpleLruReadPage_ReadOnly(clog, pageno, xid);
+
+	byteptr = clog-shared-page_buffer[slotno] + byteno;
+	status = (*byteptr  bshift)  CLOG_XACT_BITMASK;
+
+	/*
+	

Re: [HACKERS] Vacuum rate limit in KBps

2012-01-29 Thread Jeff Janes
On Sun, Jan 15, 2012 at 12:24 AM, Greg Smith g...@2ndquadrant.com wrote:

 If you then turn that equation around, making the maximum write rate the
 input, for any given cost delay and dirty page cost you can solve for the
 cost limit--the parameter in fictitious units everyone hates.  It works like
 this, with the computation internals logged every time they run for now:

 #vacuum_cost_rate_limit = 4000      # maximum write rate in kilobytes/second
 LOG:  cost limit=200 based on rate limit=4000 KB/s delay=20 dirty cost=20

The computation seems to be suffering from some kind of overflow:

cost limit=50 based on rate limit=1000 KB/s delay=20 dirty cost=20
cost limit=100 based on rate limit=2000 KB/s delay=20 dirty cost=20
cost limit=150 based on rate limit=3000 KB/s delay=20 dirty cost=20
cost limit=200 based on rate limit=4000 KB/s delay=20 dirty cost=20
cost limit=250 based on rate limit=5000 KB/s delay=20 dirty cost=20
cost limit=1 based on rate limit=6000 KB/s delay=20 dirty cost=20
cost limit=1 based on rate limit=7000 KB/s delay=20 dirty cost=20
cost limit=1 based on rate limit=8000 KB/s delay=20 dirty cost=20

Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cursors FOR UPDATE don't return most recent row

2012-01-29 Thread Alvaro Herrera

Excerpts from Tom Lane's message of sáb ene 28 01:35:33 -0300 2012:
 Alvaro Herrera alvhe...@alvh.no-ip.org writes:
  I expected the FETCH to return one row, with the latest data, i.e.
  (1, 3), but instead it's returning empty.
 
 This is the same thing I was complaining about in the bug #6123 thread,
 http://archives.postgresql.org/message-id/9698.1327266...@sss.pgh.pa.us
 
 It looks a bit ticklish to fix.

Hm.  Okay, I hadn't read that.

In my FOR KEY SHARE patch I have added a heap_lock_updated_tuple that
makes heap_lock_tuple follow the update chain forward when the tuple
being locked is being updated by a concurrent transaction.  I haven't
traced through FETCH to see if it makes sense to apply some of that to
it.

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cursors FOR UPDATE don't return most recent row

2012-01-29 Thread Tom Lane
Alvaro Herrera alvhe...@commandprompt.com writes:
 Excerpts from Tom Lane's message of sáb ene 28 01:35:33 -0300 2012:
 This is the same thing I was complaining about in the bug #6123 thread,
 http://archives.postgresql.org/message-id/9698.1327266...@sss.pgh.pa.us

 Hm.  Okay, I hadn't read that.
 In my FOR KEY SHARE patch I have added a heap_lock_updated_tuple that
 makes heap_lock_tuple follow the update chain forward when the tuple
 being locked is being updated by a concurrent transaction.

Um, we do that already, no?  Certainly in READ COMMITTED queries, we
will do so, though it happens at a higher level than heap_lock_tuple.

 I haven't traced through FETCH to see if it makes sense to apply some
 of that to it.

The issue here is what to do when the update came from our *own*
transaction.  In particular I'm a bit worried about avoiding what the
code calls the Halloween problem, namely an infinite loop of re-updating
the same tuple if the scan keeps coming across newer versions.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cursors FOR UPDATE don't return most recent row

2012-01-29 Thread Alvaro Herrera

Excerpts from Tom Lane's message of dom ene 29 22:13:43 -0300 2012:
 
 Alvaro Herrera alvhe...@commandprompt.com writes:
  Excerpts from Tom Lane's message of sáb ene 28 01:35:33 -0300 2012:
  This is the same thing I was complaining about in the bug #6123 thread,
  http://archives.postgresql.org/message-id/9698.1327266...@sss.pgh.pa.us
 
  Hm.  Okay, I hadn't read that.
  In my FOR KEY SHARE patch I have added a heap_lock_updated_tuple that
  makes heap_lock_tuple follow the update chain forward when the tuple
  being locked is being updated by a concurrent transaction.
 
 Um, we do that already, no?  Certainly in READ COMMITTED queries, we
 will do so, though it happens at a higher level than heap_lock_tuple.

Well, it's not quite the same thing.  Consider this isolation spec file:

# When a tuple that has been updated is locked, the locking command
# should traverse the update chain; thus, a DELETE should not be able
# to proceed until the lock has been released.

setup
{
  CREATE TABLE foo (
key int PRIMARY KEY,
value   int
  );

  INSERT INTO foo VALUES (1, 1);
}

teardown
{
  DROP TABLE foo;
}

session s1
step s1b  { BEGIN ISOLATION LEVEL REPEATABLE READ; }
step s1s  { SELECT * FROM foo; }  # obtain snapshot
step s1l  { SELECT * FROM foo FOR KEY SHARE; } # obtain lock
step s1c  { COMMIT; }

session s2
step s2b  { BEGIN; }
step s2u  { UPDATE foo SET value = 2 WHERE key = 1; }
step s2c  { COMMIT; }
step s2d  { DELETE FROM foo WHERE key = 1; }

permutation s1b s2b s1s s2u s1l s2c s2d s1c


Note that session s1 is using repeatable read isolation level, and the
snapshot is older than the update in session s2, so the row it sees is
correctly the old one; however, in order for the delete to honour the
lock (which is necessary for correctness), it has to be propagated up to
tuples that the lock doesn't see itself.  Only the old row is returned;
newer rows are locked too, but not returned.  So they don't get back to
the executor at all.

  I haven't traced through FETCH to see if it makes sense to apply some
  of that to it.
 
 The issue here is what to do when the update came from our *own*
 transaction.  In particular I'm a bit worried about avoiding what the
 code calls the Halloween problem, namely an infinite loop of re-updating
 the same tuple if the scan keeps coming across newer versions.

Hmm.  Since locking rows does not create new versions, I don't quite see
how we could get into such a problem.  A scan should only see each
version once, and will discard all but one due to visibility.  This new
routine of mine only follows the ctids to future versions on updated
tuples; there's no new scan.

If I'm wrong about this, I'd sure like to be aware :-)

The fact that SELECT FOR UPDATE returns empty means that so far I've
been unable to exercise the SelfUpdate case in the new routine.  The
test case I pasted above started working as I intended once I wrote it;
previously, the DELETE would just be allowed to continue immediately
without blocking.

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: GiST for range types (was Re: [HACKERS] Range Types - typo + NULL string constructor)

2012-01-29 Thread Alexander Korotkov
On Mon, Jan 30, 2012 at 1:39 AM, Jeff Davis pg...@j-davis.com wrote:

 Thank you for the updates. I have a small patch attached.

 The only code change I made was very minor: I changed the constants used
 in the penalty function because your version used INFINITE_BOUND_PENALTY
 when adding an empty range, and that didn't quite make sense to me. If
 I'm mistaken you can leave it as-is.

 I also attached range-gist-test.sql, which I used for a performance
 test. I mix various types of ranges together in a larger table of 1.1M
 tuples. And then I create a smaller table that only contains normal
 ranges and empty ranges. There are two tests:
  1. Create an index on the big table
  2. Do a range join (using overlaps rather than equals) where the
 smaller table is on the outer side of a nested loop join and an index
 scan over the larger table on the inner.

 The index creation time reduces by a small amount with the patch, from
 around 16s without the patch to around 13s with the patch. The query
 time, however, dropped from around 26s to around 14s! Almost 2x speedup
 with the patch!

 Moreover, looking at the loop timing in the explain analyze output, it
 goes from about 7..24 ms per loop down to about 1.5..13 ms per loop.
 That seems to indicate that the index distribution is better, with more
 queries returning quickly.

 So, great work Alexander! Very convincing results.

Great! Thank you for reviewing this patch!


 Marking ready for committer, but please apply my comment fixes at your
 discretion.

Patch with your comment fixes is attached.

-
With best regards,
Alexander Korotkov.


rangetypegist-0.7.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers