Re: [HACKERS] Patch: Allow SQL-language functions to reference parameters by parameter name
On 25/01/12 18:37, Hitoshi Harada wrote: I'm still not sure whether to just revise (almost) all the SQL function examples to use parameter names, and declare them the right choice; as it's currently written, named parameters still seem rather second-class. Agreed. I'll try a more comprehensive revision of the examples. The patch seems ok, except an example I've just found. db1=# create function t(a int, t t) returns int as $$ select t.a $$ language sql; CREATE FUNCTION db1=# select t(0, row(1, 2)::t); t --- 1 (1 row) Should we throw an error in such ambiguity? Or did you make it happen intentionally? If latter, we should also mention the rule in the manual. I did consider it, and felt it was the most consistent: # select t.x, t, z from (select 1) t(x), (select 2) z(t); x | t | z ---+---+- 1 | 2 | (2) (1 row) I haven't yet managed to find the above behaviour described in the documentation either, though. To me, it feels like an obscure corner case, whose description would leave the rules seeming more complicated than they generally are. Maybe it'd be better suited to be explicitly discussed in the comments? Thanks, Matthew -- matt...@trebex.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [v9.2] Add GUC sepgsql.client_label
2012/1/28 Kohei KaiGai kai...@kaigai.gr.jp: 2012/1/26 Robert Haas robertmh...@gmail.com: On Thu, Jan 26, 2012 at 2:07 PM, Kohei KaiGai kai...@kaigai.gr.jp wrote: 2012/1/26 Robert Haas robertmh...@gmail.com: I'm wondering if a function would be a better fit than a GUC. I don't think you can really restrict the ability to revert a GUC change - i.e. if someone does a SET and then a RESET, you pretty much have to allow that. I think. But if you expose a function then it can work however you like. One benefit to use GUC is that we can utilize existing mechanism to revert a value set within a transaction block on error. If we implement same functionality with functions, XactCallback allows sepgsql to get control on appropriate timing? Not sure, but I thought the use case was to set this at connection startup time and then hand the connection off to a client. What keeps the client from just issuing RESET? In the use case of Joshua, the original security label does not privileges to reference any tables, and it cannot translate any domains without credential within IC-cards. Thus, RESET is harmless. However, I also assume one other possible use-case; the originator has privileges to translate 10 of different domains, but disallows to revert it. In this case, RESET without permission checks are harmful. (This scenario will be suitable with multi-category-model.) Apart from this issue, I found a problem on implementation using GUC options. It makes backend crash, if it tries to reset sepgsql.client_label without permissions within error handler because of double-errors. I found an idea to avoid this scenario. The problematic case is reset GUC variable because of transaction rollback and permission violation, however, we don't intend to apply permission checks, since it is not committed yet. Thus, I considered to check state of the transaction using IsTransactionState() on the assign_hook of GUC variable. Unlike function based implementation, it is available to utilize existing infrastructure to set the client_label to be transaction-aware. It shall be implemented as: void sepgsql_assign_client_label(const char *newval, void *extra) { if (!IsTransactionState) return; /* check whether valid security context */ /* check permission check to switch current context */ } How about such an implementation? Thanks, -- KaiGai Kohei kai...@kaigai.gr.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] patch for parallel pg_dump
On Fri, Jan 27, 2012 at 4:57 PM, Robert Haas robertmh...@gmail.com wrote: But even if you do know that subclassing is intended, that doesn't prove that the particular Archive object is always going to be an ArchiveHandle under the hood. If it is, why not just pass it as an ArchiveHandle to begin with? I know that you took back some of your comments, but I'm with you here. Archive is allocated as an ArchiveHandle and then casted back to Archive*, so you always know that an Archive is an ArchiveHandle. I'm all for getting rid of Archive and just using ArchiveHandle throughout pg_dump which would get rid of these useless casts. I admit that I might have made it a bit worse by adding a few more of these casts but the fundamental issue was already there and there is precedence for casting between them in both directions :-) Joachim -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] patch for parallel pg_dump
Joachim Wieland j...@mcknight.de writes: I know that you took back some of your comments, but I'm with you here. Archive is allocated as an ArchiveHandle and then casted back to Archive*, so you always know that an Archive is an ArchiveHandle. I'm all for getting rid of Archive and just using ArchiveHandle throughout pg_dump which would get rid of these useless casts. I'd like to see a more thoroughgoing look at the basic structure of pg_dump. Everybody who's ever looked at that code has found it confusing, with the possible exception of the original author who is long gone from the project anyway. I don't know exactly what would make it better, but the useless distinction between Archive and ArchiveHandle seems like a minor annoyance, not the core disease. Not that there'd be anything wrong with starting with that. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CLOG contention, part 2
On Sat, Jan 28, 2012 at 1:52 PM, Simon Riggs si...@2ndquadrant.com wrote: Also, I think the general approach is wrong. The only reason to have these pages in shared memory is that we can control access to them to prevent write/write and read/write corruption. Since these pages are never written, they don't need to be in shared memory. Just read each page into backend-local memory as it is needed, either palloc/pfree each time or using a single reserved block for the lifetime of the session. Let the kernel worry about caching them so that the above mentioned reads are cheap. Will think on that. For me, there are arguments both ways as to whether it should be in shared or local memory. The one factor that makes the answer shared for me is that its much easier to reuse existing SLRU code. We dont need to invent a new way of cacheing/access etc. We just rewire what we already have. So overall, the local/shared debate is much less important that the robustness/code reuse angle. That's what makes this patch fairly simple. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgsql_fdw, FDW for PostgreSQL server
Hi Harada-san, I checked the fdw_helper_funcs_v3.patch, pgsql_fdw_v5.patch and pgsql_fdw_pushdown_v1.patch. My comments are below. [BUG] Even though pgsql_fdw tries to push-down qualifiers being executable on the remove side at the deparseSql(), it does not remove qualifiers being pushed down from the baserel-baserestrictinfo, thus, these qualifiers are eventually executed twice. See the result of this EXPLAIN. postgres=# EXPLAIN SELECT * FROM ft1 WHERE a 2 AND f_leak(b); QUERY PLAN -- Foreign Scan on ft1 (cost=107.43..122.55 rows=410 width=36) Filter: (f_leak(b) AND (a 2)) Remote SQL: DECLARE pgsql_fdw_cursor_0 SCROLL CURSOR FOR SELECT a, b FROM public.t1 WHERE (a 2) (3 rows) My expectation is (a 2) being executed on the remote-side and f_leak(b) being executed on the local-side. But, filter of foreign-scan on ft1 has both of qualifiers. It has to be removed, if a RestrictInfo get pushed-down. [Design comment] I'm not sure the reason why store_result() uses MessageContext to save the Tuplestorestate within PgsqlFdwExecutionState. The source code comment says it is used to avoid memory leaks in error cases. I also have a similar experience on implementation of my fdw module, so, I could understand per-scan context is already cleared at the timing of resource-release-callback, thus, handlers to external resource have to be saved on separated memory context. In my personal opinion, the right design is to declare a memory context for pgsql_fdw itself, instead of the abuse of existing memory context. (More wise design is to define sub-memory-context for each foreign-scan, then, remove the sub-memory-context after release handlers.) [Design comment] When BEGIN should be issued on the remote-side? The connect_pg_server() is an only chance to issue BEGIN command at the remote-side on connection being opened. However, the connection shall be kept unless an error is not raised. Thus, the remote-side will continue to work within a single transaction block, even if local-side iterates a pair of BEGIN and COMMIT. I'd like to suggest to close the transaction block at the timing of either end of the scan, transaction or sub-transaction. [Comment to Improve] Also, which transaction isolation level should be specified in this case? An idea is its isolation level is specified according to the current isolation level on the local-side. (Of course, it is your choice if it is not supported right now.) [Comment to improve] It seems to me the design of exprFunction is not future-proof, if we add a new node type that contains two or more function calls, because it can return an oid of functions. I think, the right design is to handle individual node-types within the large switch statement at foreign_expr_walker(). Of course, it is just my sense. [Comment to improve] The pgsql_fdw_handler() allocates FdwRoutine using makeNode(), then it set function-pointers on each fields. Why don't you return a pointer to statically declared FdwRoutine variable being initialized at compile time, like: static FdwRoutine pgsql_fdw_handler = { .type = T_FdwRoutine, .PlanForeignScan= pgsqlPlanForeignScan, .ExplainForeignScan = pgsqlExplainForeignScan, .BeginForeignScan = pgsqlBeginForeignScan, .IterateForeignScan = pgsqlIterateForeignScan, .ReScanForeignScan = pgsqlReScanForeignScan, .EndForeignScan = pgsqlEndForeignScan, }; Datum pgsql_fdw_handler(PG_FUNCTION_ARGS) { PG_RETURN_POINTER(pgsql_fdw_handler); } [Question to implementation] At pgsqlIterateForeignScan(), it applies null-check on festate-tuples and bool-checks on festete-cursor_opened. Do we have a possible scenario that festate-tuples is not null, but festate-cursor_opened, or an opposite combination? If null-check on festate-tuples is enough to detect the first call of the iterate callback, it is not my preference to have redundant flag. Thanks, 2011年12月14日15:02 Shigeru Hanada shigeru.han...@gmail.com: (2011/12/13 14:46), Tom Lane wrote: Shigeru Hanadashigeru.han...@gmail.com writes: Agreed. How about to add a per-column boolean FDW option, say pushdown, to pgsql_fdw? Users can tell pgsql_fdw that the column can be pushed down safely by setting this option to true. [ itch... ] That doesn't seem like the right level of granularity. ISTM the problem is with whether specific operators have the same meaning at the far end as they do locally. If you try to attach the flag to columns, you have to promise that *every* operator on that column means what it does locally, which is likely to not be the case ever if you look hard enough. Plus, having to set the flag on each individual column of the same datatype seems pretty tedious. Indeed, I too think that labeling on each columns is not the best way, but
Re: [HACKERS] Group commit, revised
On 2012-01-29 01:48, Jeff Janes wrote: I ran three modes, head, head with commit_delay, and the group_commit patch shared_buffers = 600MB wal_sync_method=fsync optionally with: commit_delay=5 commit_siblings=1 pgbench -i -s40 for clients in 1 5 10 15 20 25 30 pgbench -T 30 -M prepared -c $clients -j $clients ran 5 times each, taking maximum tps from the repeat runs. The results are impressive. clients headhead_commit_delay group_commit 1 23.923.023.9 5 31.051.359.9 10 35.056.595.7 15 35.664.9136.4 20 34.368.7159.3 25 36.564.1168.8 30 37.283.871.5 I haven't inspected that deep fall off at 30 clients for the patch. By way of reference, if I turn off synchronous commit, I get tps=1245.8 which is 100% CPU limited. This sets an theoretical upper bound on what could be achieved by the best possible group committing method. If the group_commit patch goes in, would we then rip out commit_delay and commit_siblings? Adding to the list of tests that isn't excactly a real-world system I decided to repeat Jeff's tests on a Intel(R) Core(TM)2 Duo CPU E7500 @ 2.93GHz with 4GB of memory and an Intel X25-M 160GB SSD drive underneath. BaselineCommitdelay Group commit 1 1168.67 1233.33 1212.67 5 2611.33 3022.00 2647.67 10 3044.67 .33 3296.33 15 3153.33 3177.00 3456.67 20 3087.33 3126.33 3618.67 25 2715.00 2359.00 3309.33 30 2736.33 2831.67 2737.67 Numbers are average over 3 runs. I have set checkpoint_segments to 30 .. otherwise same configuration as Jeff. Attached is a graph. Nice conclusion is.. group commit outperforms baseline in all runs (on this system). My purpose was actual more to try to quantify the difference between a single SSD and a single rotating disk. -- Jesper attachment: pgbench.png -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] PGCon 2012 Call for Papers - extension
We apologize that http://www.bsdcan.org/ was offline for 12 hours from early Sunday morning. The deadline for submissions has been extended to Tuesday 31 January. PGCon 2012 will be held 17-18 May 2012, in Ottawa at the University of Ottawa. It will be preceded by two days of tutorials on 15-16 May 2012. We are now accepting proposals for talks. Proposals can be quite simple. We do not require academic-style papers. If you are doing something interesting with PostgreSQL, please submit a proposal. You might be one of the backend hackers or work on a PostgreSQL related project and want to share your know-how with others. You might be developing an interesting system using PostgreSQL as the foundation. Perhaps you migrated from another database to PostgreSQL and would like to share details. These, and other stories are welcome. Both users and developers are encouraged to share their experiences. Here are a some ideas to jump start your proposal process: - novel ways in which PostgreSQL is used - migration of production systems from another database - data warehousing - tuning PostgreSQL for different work loads - replication and clustering - hacking the PostgreSQL code - PostgreSQL derivatives and forks - applications built around PostgreSQL - benchmarking and performance engineering - case studies - location-aware and mapping software with PostGIS - The latest PostgreSQL features and features in development - research and teaching with PostgreSQL - things the PostgreSQL project could do better - integrating PostgreSQL with 3rd-party software Both users and developers are encouraged to share their experiences. The schedule is: 8 Jan 2012 Proposal acceptance begins 31 Jan 2012 Proposal acceptance ends 19 Feb 2012 Confirmation of accepted proposals See also http://www.pgcon.org/2012/papers.php Instructions for submitting a proposal to PGCon 2012 are available from: http://www.pgcon.org/2012/submissions.php -- Dan Langille - http://langille.org -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CLOG contention, part 2
On Fri, Jan 27, 2012 at 10:05 PM, Jeff Janes jeff.ja...@gmail.com wrote: On Sat, Jan 21, 2012 at 7:31 AM, Simon Riggs si...@2ndquadrant.com wrote: Yes, it was. Sorry about that. New version attached, retesting while you read this. In my hands I could never get this patch to do anything. The new cache was never used. I think that that was because RecentXminPageno never budged from -1. I think that that, in turn, is because the comparison below can never return true, because the comparison is casting both sides to uint, and -1 cast to uint is very large /* When we commit advance ClogCtl's shared RecentXminPageno if needed */ if (ClogCtl-shared-RecentXminPageno TransactionIdToPage(RecentXmin)) ClogCtl-shared-RecentXminPageno = TransactionIdToPage(RecentXmin); Thanks for looking at the patch. The patch works fine. RecentXminPageno does move forwards as it is supposed to and there are no uints anywhere in that calculation. The pageno only moves forwards every 32,000 transactions, so I'm guessing that your testing didn't go on for long enough to show it working correctly. As regards to effectiveness, you need to execute more than 1 million transactions before the main clog cache fills, which might sound a lot, but its approximately 1 minute of heavy transactions at the highest rate Robert has published. I've specifically designed the pgbench changes required to simulate conditions of clog contention to help in the evaluation of this patch. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Group commit, revised
On 01/28/2012 07:48 PM, Jeff Janes wrote: Others are going to test this out on high-end systems. I wanted to try it out on the other end of the scale. I've used a Pentium 4, 3.2GHz, with 2GB of RAM and with a single IDE drive running ext4. ext4 is amazingly bad on IDE, giving about 25 fsync's per second (and it lies about fdatasync, but apparently not about fsync) Fantastic, I had to stop for a minute to check the date on your message for a second there, make sure it hadn't come from some mail server that's been backed up on delivery the last five years. I'm cleaning house toward testing this out here, and I was going to test on the same system using both fast and horribly slow drives. Both ends of the scale are important, and they benefit in a very different way from these changes. I haven't inspected that deep fall off at 30 clients for the patch. By way of reference, if I turn off synchronous commit, I get tps=1245.8 which is 100% CPU limited. This sets an theoretical upper bound on what could be achieved by the best possible group committing method. This sort of thing is why I suspect that to completely isolate some results, we're going to need a moderately high end server--with lots of cores--combined with an intentionally mismatched slow drive. It's too easy to get pgbench and/or PostgreSQL to choke on something other than I/O when using smaller core counts. I don't think I have anything where the floor is 24 TPS per client though. Hmmm...I think I can connect an IDE drive to my MythTV box and format it with ext4. Thanks for the test idea. One thing you could try on this system is using the -N Do not update pgbench_tellers and pgbench_branches. That eliminates a lot of the contention that might be pulling down your higher core count tests, while still giving a completely valid test of whether the group commit mechanism works. Not sure whether that will push up the top-end usefully for you, worth a try if you have time to test again. If the group_commit patch goes in, would we then rip out commit_delay and commit_siblings? The main reason those are still hanging around at all are to allow pushing on the latency vs. throughput trade-off on really busy systems. The use case is that you expect, say, 10 clients to constantly be committing at a high rate. So if there's just one committing so far, assume it's the leading edge of a wave and pause a bit for the rest to come in. I don't think the cases where this is useful behavior--people both want it and the current mechanism provides it--are very common in the real world. It can be useful for throughput oriented benchmarks though, which is why I'd say it hasn't killed off yet. We'll have to see whether the final form this makes sense in will usefully replace that sort of thing. I'd certainly be in favor of nuking commit_delay and commit_siblings with a better option; it would be nice if we don't eliminate this tuning option in the process though. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: GiST for range types (was Re: [HACKERS] Range Types - typo + NULL string constructor)
On Tue, 2012-01-24 at 16:07 +0400, Alexander Korotkov wrote: Hi! New version of patch is attached. Thank you for the updates. I have a small patch attached. The only code change I made was very minor: I changed the constants used in the penalty function because your version used INFINITE_BOUND_PENALTY when adding an empty range, and that didn't quite make sense to me. If I'm mistaken you can leave it as-is. I also attached range-gist-test.sql, which I used for a performance test. I mix various types of ranges together in a larger table of 1.1M tuples. And then I create a smaller table that only contains normal ranges and empty ranges. There are two tests: 1. Create an index on the big table 2. Do a range join (using overlaps rather than equals) where the smaller table is on the outer side of a nested loop join and an index scan over the larger table on the inner. The index creation time reduces by a small amount with the patch, from around 16s without the patch to around 13s with the patch. The query time, however, dropped from around 26s to around 14s! Almost 2x speedup with the patch! Moreover, looking at the loop timing in the explain analyze output, it goes from about 7..24 ms per loop down to about 1.5..13 ms per loop. That seems to indicate that the index distribution is better, with more queries returning quickly. So, great work Alexander! Very convincing results. Marking ready for committer, but please apply my comment fixes at your discretion. Regards, Jeff Davis PS: the test was run on my workstation (Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz) with work_mem=512MB, shared_buffers=512MB, and checkpoint_segments=32. The rest of the settings were default. range-gist-comments.patch.gz Description: GNU Zip compressed data \timing on drop table big; drop table small; create temp table tmp_foo(i int, ir int8range); insert into tmp_foo select g % 100, 'empty'::int8range from generate_series(1,5) g; insert into tmp_foo select g % 100, int8range(NULL,NULL) from generate_series(1,1) g; insert into tmp_foo select g % 100, int8range(NULL,((random()-0.5)*g*10)::int8) from generate_series(1,2) g; insert into tmp_foo select g % 100, int8range(((random()-0.5)*g*10)::int8,NULL) from generate_series(1,2) g; insert into tmp_foo select g % 100, int8range( (g*10 + 10*(random()-0.5))::int8, (g*10 + 10 + 10*(random()-0.5))::int8 ) from generate_series(1,100) g; create table big as select * from tmp_foo order by random(); drop table tmp_foo; create table tmp_foo(i int, ir int8range); insert into tmp_foo select g*1000 % 100, 'empty'::int8range from generate_series(1,50) g; insert into tmp_foo select g*1000 % 100, int8range( (g*10*1000 + 10*(random()-0.5))::int8, (g*10*1000 + 10 + 10*(random()-0.5))::int8 ) from generate_series(1,1000) g; create table small as select * from tmp_foo order by random(); drop table tmp_foo; vacuum; vacuum; vacuum; create index big_idx on big using gist (ir); analyze; set enable_bitmapscan=false; explain analyze select sum(upper(intersection) - lower(intersection)) from ( select small.ir * big.ir as intersection from small,big where small.ir big.ir and not lower_inf(big.ir) and not upper_inf(big.ir) ) s; set enable_bitmapscan to default; set enable_indexscan=false; explain analyze select sum(upper(intersection) - lower(intersection)) from ( select small.ir * big.ir as intersection from small,big where small.ir big.ir and not lower_inf(big.ir) and not upper_inf(big.ir) ) s; set enable_indexscan to default; -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CLOG contention, part 2
On Sun, Jan 29, 2012 at 12:18 PM, Simon Riggs si...@2ndquadrant.com wrote: On Fri, Jan 27, 2012 at 10:05 PM, Jeff Janes jeff.ja...@gmail.com wrote: On Sat, Jan 21, 2012 at 7:31 AM, Simon Riggs si...@2ndquadrant.com wrote: Yes, it was. Sorry about that. New version attached, retesting while you read this. In my hands I could never get this patch to do anything. The new cache was never used. I think that that was because RecentXminPageno never budged from -1. I think that that, in turn, is because the comparison below can never return true, because the comparison is casting both sides to uint, and -1 cast to uint is very large /* When we commit advance ClogCtl's shared RecentXminPageno if needed */ if (ClogCtl-shared-RecentXminPageno TransactionIdToPage(RecentXmin)) ClogCtl-shared-RecentXminPageno = TransactionIdToPage(RecentXmin); Thanks for looking at the patch. The patch works fine. RecentXminPageno does move forwards as it is supposed to and there are no uints anywhere in that calculation. Maybe it is system dependent. Or, are you running this patch on top of some other uncommitted patch (other than the pgbench one)? RecentXmin is a TransactionID, which is a uint32. I think the TransactionIdToPage macro preserves that. If I cast to a int, then I see advancement: if (ClogCtl-shared-RecentXminPageno (int) TransactionIdToPage(RecentXmin)) ... I've specifically designed the pgbench changes required to simulate conditions of clog contention to help in the evaluation of this patch. Yep, I've used that one for the testing. Cheers, Jeff -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Should I implement DROP INDEX CONCURRENTLY?
On Thu, Jan 26, 2012 at 2:25 AM, Robert Haas robertmh...@gmail.com wrote: On Sat, Jan 21, 2012 at 10:11 AM, Simon Riggs si...@2ndquadrant.com wrote: Your caution is wise. All users of an index have already checked whether the index is usable at plan time, so although there is much code that assumes they can look at the index itself, that is not executed until after the correct checks. I'll look at VACUUM and other utilities, so thanks for that. I don't see much point in having the higher level lock, except perhaps as a test this code works. I thought of another way this can cause a problem: suppose that while we're dropping the relation with only ShareUpdateExclusiveLock, we get as far as calling DropRelFileNodeBuffers. Just after we finish, some other process that holds AccessShareLock or RowShareLock or RowExclusiveLock reads and perhaps even dirties a page in the relation. I can't see any way that situation can occur. The patch *explicitly* waits until all people that can see the index as usable have dropped their lock. So I don't think this is necessary. Having said that, since we are talking about the index and not the whole table, if I believe the above statement then I can't have any reasonable objection to doing as you suggest. Patch now locks index in AccessExclusiveLock in final stage of drop. v3 attached. If you have suggestions to improve grammar issues, they;re most welcome. Otherwise this seems good to go. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services diff --git a/doc/src/sgml/ref/drop_index.sgml b/doc/src/sgml/ref/drop_index.sgml index 7177ef2..aeb1531 100644 --- a/doc/src/sgml/ref/drop_index.sgml +++ b/doc/src/sgml/ref/drop_index.sgml @@ -21,7 +21,9 @@ PostgreSQL documentation refsynopsisdiv synopsis -DROP INDEX [ IF EXISTS ] replaceable class=PARAMETERname/replaceable [, ...] [ CASCADE | RESTRICT ] +DROP INDEX + [ IF EXISTS ] replaceable class=PARAMETERname/replaceable [, ...] [ CASCADE | RESTRICT ] + CONCURRENTLY replaceable class=PARAMETERname/replaceable /synopsis /refsynopsisdiv @@ -50,6 +52,30 @@ DROP INDEX [ IF EXISTS ] replaceable class=PARAMETERname/replaceable [, .. /varlistentry varlistentry +termliteralCONCURRENTLY/literal/term +listitem + para + When this option is used, productnamePostgreSQL/ will drop the + index without taking any locks that prevent concurrent selects, inserts, + updates, or deletes on the table; whereas a standard index drop + waits for a lock that locks out everything on the table until it's done. + Concurrent drop index is a two stage process. First, we mark the index + both invalid and not ready then commit the change. Next we wait until + there are no users locking the table who can see the index. + /para + para + There are several caveats to be aware of when using this option. + Only one index name can be specified if the literalCONCURRENTLY/literal + parameter is specified. Only one concurrent index drop can occur on a + table at a time and no modifications on the table are allowed meanwhile. + Regular commandDROP INDEX/ command can be performed within + a transaction block, but commandDROP INDEX CONCURRENTLY/ cannot. + There is no CASCADE option when dropping an index concurrently. + /para +/listitem + /varlistentry + + varlistentry termreplaceable class=PARAMETERname/replaceable/term listitem para diff --git a/src/backend/catalog/dependency.c b/src/backend/catalog/dependency.c index db86262..58d8f5c 100644 --- a/src/backend/catalog/dependency.c +++ b/src/backend/catalog/dependency.c @@ -173,8 +173,8 @@ static void reportDependentObjects(const ObjectAddresses *targetObjects, const ObjectAddress *origObject); static void deleteOneObject(const ObjectAddress *object, Relation depRel, int32 flags); -static void doDeletion(const ObjectAddress *object); -static void AcquireDeletionLock(const ObjectAddress *object); +static void doDeletion(const ObjectAddress *object, int flags); +static void AcquireDeletionLock(const ObjectAddress *object, int flags); static void ReleaseDeletionLock(const ObjectAddress *object); static bool find_expr_references_walker(Node *node, find_expr_references_context *context); @@ -232,7 +232,7 @@ performDeletion(const ObjectAddress *object, * Acquire deletion lock on the target object. (Ideally the caller has * done this already, but many places are sloppy about it.) */ - AcquireDeletionLock(object); + AcquireDeletionLock(object, 0); /* * Construct a list of objects to delete (ie, the given object plus @@ -316,7 +316,7 @@ performMultipleDeletions(const ObjectAddresses *objects, * Acquire deletion lock on each target object. (Ideally the caller * has done this already, but many places are sloppy about it.) */ - AcquireDeletionLock(thisobj); +
Re: [HACKERS] CLOG contention, part 2
On Sun, Jan 29, 2012 at 1:41 PM, Jeff Janes jeff.ja...@gmail.com wrote: On Sun, Jan 29, 2012 at 12:18 PM, Simon Riggs si...@2ndquadrant.com wrote: On Fri, Jan 27, 2012 at 10:05 PM, Jeff Janes jeff.ja...@gmail.com wrote: On Sat, Jan 21, 2012 at 7:31 AM, Simon Riggs si...@2ndquadrant.com wrote: Yes, it was. Sorry about that. New version attached, retesting while you read this. In my hands I could never get this patch to do anything. The new cache was never used. I think that that was because RecentXminPageno never budged from -1. I think that that, in turn, is because the comparison below can never return true, because the comparison is casting both sides to uint, and -1 cast to uint is very large /* When we commit advance ClogCtl's shared RecentXminPageno if needed */ if (ClogCtl-shared-RecentXminPageno TransactionIdToPage(RecentXmin)) ClogCtl-shared-RecentXminPageno = TransactionIdToPage(RecentXmin); Thanks for looking at the patch. The patch works fine. RecentXminPageno does move forwards as it is supposed to and there are no uints anywhere in that calculation. Maybe it is system dependent. Or, are you running this patch on top of some other uncommitted patch (other than the pgbench one)? RecentXmin is a TransactionID, which is a uint32. I think the TransactionIdToPage macro preserves that. If I cast to a int, then I see advancement: if (ClogCtl-shared-RecentXminPageno (int) TransactionIdToPage(RecentXmin)) And to clarify, if I don't do the cast, I don't see advancement, using this code: elog(LOG, JJJ RecentXminPageno %d, %d, ClogCtl-shared-RecentXminPageno , TransactionIdToPage(RecentXmin)); if (ClogCtl-shared-RecentXminPageno TransactionIdToPage(RecentXmin)) ClogCtl-shared-RecentXminPageno = TransactionIdToPage(RecentXmin); Then using your pgbench -I -s 100 -c 8 -j8, I get tons of log entries like: LOG: JJJ RecentXminPageno -1, 149 STATEMENT: INSERT INTO pgbench_accounts (aid, bid, abalance) VALUES (nextval('pgbench_accounts_load_seq'), 1 + (lastval()/(10)), 0); Cheers, Jeff -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CLOG contention, part 2
On Sun, Jan 29, 2012 at 9:41 PM, Jeff Janes jeff.ja...@gmail.com wrote: If I cast to a int, then I see advancement: I'll initialise it as 0, rather than -1 and then we don't have a problem in any circumstance. I've specifically designed the pgbench changes required to simulate conditions of clog contention to help in the evaluation of this patch. Yep, I've used that one for the testing. Most of the current patch is just bookkeeping to keep track of the point when we can look at history in read only manner. I've isolated the code better to allow you to explore various implementation options. I don't see any performance difference between any of them really, but you're welcome to look. Please everybody note that the clog history doesn't even become active until the first checkpoint, so this is dead code until we've hit the first checkpoint cycle and completed a million transactions since startup. So its designed to tune for real world situations, and is not easy to benchmark. (Maybe we could start earlier, but having extra code just for first few minutes seems waste of energy, especially since we must hit million xids also). -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c index 69b6ef3..8ab1b3c 100644 --- a/src/backend/access/transam/clog.c +++ b/src/backend/access/transam/clog.c @@ -37,6 +37,7 @@ #include access/transam.h #include miscadmin.h #include pg_trace.h +#include utils/snapmgr.h /* * Defines for CLOG page sizes. A page is the same BLCKSZ as is used @@ -70,12 +71,19 @@ /* * Link to shared-memory data structures for CLOG control + * + * As of 9.2, we have 2 structures for commit log data. + * ClogCtl manages the main read/write part of the commit log, while + * the ClogHistoryCtl manages the now read-only, older part. ClogHistory + * removes contention from the path of transaction commits. */ static SlruCtlData ClogCtlData; +static SlruCtlData ClogHistoryCtlData; -#define ClogCtl (ClogCtlData) - +#define ClogCtl (ClogCtlData) +#define ClogHistoryCtl (ClogHistoryCtlData) +static XidStatus TransactionIdGetStatusHistory(TransactionId xid); static int ZeroCLOGPage(int pageno, bool writeXlog); static bool CLOGPagePrecedes(int page1, int page2); static void WriteZeroPageXlogRec(int pageno); @@ -296,6 +304,10 @@ TransactionIdSetPageStatus(TransactionId xid, int nsubxids, /* ... then the main transaction */ TransactionIdSetStatusBit(xid, status, lsn, slotno); + + /* When we commit advance ClogCtl's shared RecentXminPageno if needed */ + if (ClogCtl-shared-RecentXminPageno TransactionIdToPage(RecentXmin)) + ClogCtl-shared-RecentXminPageno = TransactionIdToPage(RecentXmin); } /* Set the subtransactions */ @@ -387,6 +399,7 @@ TransactionIdSetStatusBit(TransactionId xid, XidStatus status, XLogRecPtr lsn, i XidStatus TransactionIdGetStatus(TransactionId xid, XLogRecPtr *lsn) { + bool useClogHistory = true; int pageno = TransactionIdToPage(xid); int byteno = TransactionIdToByte(xid); int bshift = TransactionIdToBIndex(xid) * CLOG_BITS_PER_XACT; @@ -397,15 +410,64 @@ TransactionIdGetStatus(TransactionId xid, XLogRecPtr *lsn) /* lock is acquired by SimpleLruReadPage_ReadOnly */ - slotno = SimpleLruReadPage_ReadOnly(ClogCtl, pageno, xid); - byteptr = ClogCtl-shared-page_buffer[slotno] + byteno; + /* + * Decide whether to use main Clog or read-only ClogHistory. + * + * Our knowledge of the boundary between the two may be a little out + * of date, so if we try Clog and can't find it we need to try again + * against ClogHistory. + */ + if (pageno = ClogCtl-recent_oldest_active_page_number) + { + slotno = SimpleLruReadPage_ReadOnly(ClogCtl, pageno, xid); + if (slotno = 0) + useClogHistory = false; + } + + if (useClogHistory) + return TransactionIdGetStatusHistory(xid); + + byteptr = clog-shared-page_buffer[slotno] + byteno; status = (*byteptr bshift) CLOG_XACT_BITMASK; lsnindex = GetLSNIndex(slotno, xid); - *lsn = ClogCtl-shared-group_lsn[lsnindex]; + *lsn = clog-shared-group_lsn[lsnindex]; - LWLockRelease(CLogControlLock); + LWLockRelease(clog-shared-ControlLock); + + return status; +} + +/* + * Get state of a transaction from the read-only portion of the clog, + * which we refer to as the clog history. + * + * Code isolated here to more easily allow various implementation options. + */ +static XidStatus +TransactionIdGetStatusHistory(TransactionId xid) +{ + SlruCtl clog = ClogHistoryCtl; + int pageno = TransactionIdToPage(xid); + int byteno = TransactionIdToByte(xid); + int bshift = TransactionIdToBIndex(xid) * CLOG_BITS_PER_XACT; + int slotno; + char *byteptr; + XidStatus status; + + slotno = SimpleLruReadPage_ReadOnly(clog, pageno, xid); + + byteptr = clog-shared-page_buffer[slotno] + byteno; + status = (*byteptr bshift) CLOG_XACT_BITMASK; + + /* +
Re: [HACKERS] Vacuum rate limit in KBps
On Sun, Jan 15, 2012 at 12:24 AM, Greg Smith g...@2ndquadrant.com wrote: If you then turn that equation around, making the maximum write rate the input, for any given cost delay and dirty page cost you can solve for the cost limit--the parameter in fictitious units everyone hates. It works like this, with the computation internals logged every time they run for now: #vacuum_cost_rate_limit = 4000 # maximum write rate in kilobytes/second LOG: cost limit=200 based on rate limit=4000 KB/s delay=20 dirty cost=20 The computation seems to be suffering from some kind of overflow: cost limit=50 based on rate limit=1000 KB/s delay=20 dirty cost=20 cost limit=100 based on rate limit=2000 KB/s delay=20 dirty cost=20 cost limit=150 based on rate limit=3000 KB/s delay=20 dirty cost=20 cost limit=200 based on rate limit=4000 KB/s delay=20 dirty cost=20 cost limit=250 based on rate limit=5000 KB/s delay=20 dirty cost=20 cost limit=1 based on rate limit=6000 KB/s delay=20 dirty cost=20 cost limit=1 based on rate limit=7000 KB/s delay=20 dirty cost=20 cost limit=1 based on rate limit=8000 KB/s delay=20 dirty cost=20 Cheers, Jeff -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] cursors FOR UPDATE don't return most recent row
Excerpts from Tom Lane's message of sáb ene 28 01:35:33 -0300 2012: Alvaro Herrera alvhe...@alvh.no-ip.org writes: I expected the FETCH to return one row, with the latest data, i.e. (1, 3), but instead it's returning empty. This is the same thing I was complaining about in the bug #6123 thread, http://archives.postgresql.org/message-id/9698.1327266...@sss.pgh.pa.us It looks a bit ticklish to fix. Hm. Okay, I hadn't read that. In my FOR KEY SHARE patch I have added a heap_lock_updated_tuple that makes heap_lock_tuple follow the update chain forward when the tuple being locked is being updated by a concurrent transaction. I haven't traced through FETCH to see if it makes sense to apply some of that to it. -- Álvaro Herrera alvhe...@commandprompt.com The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] cursors FOR UPDATE don't return most recent row
Alvaro Herrera alvhe...@commandprompt.com writes: Excerpts from Tom Lane's message of sáb ene 28 01:35:33 -0300 2012: This is the same thing I was complaining about in the bug #6123 thread, http://archives.postgresql.org/message-id/9698.1327266...@sss.pgh.pa.us Hm. Okay, I hadn't read that. In my FOR KEY SHARE patch I have added a heap_lock_updated_tuple that makes heap_lock_tuple follow the update chain forward when the tuple being locked is being updated by a concurrent transaction. Um, we do that already, no? Certainly in READ COMMITTED queries, we will do so, though it happens at a higher level than heap_lock_tuple. I haven't traced through FETCH to see if it makes sense to apply some of that to it. The issue here is what to do when the update came from our *own* transaction. In particular I'm a bit worried about avoiding what the code calls the Halloween problem, namely an infinite loop of re-updating the same tuple if the scan keeps coming across newer versions. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] cursors FOR UPDATE don't return most recent row
Excerpts from Tom Lane's message of dom ene 29 22:13:43 -0300 2012: Alvaro Herrera alvhe...@commandprompt.com writes: Excerpts from Tom Lane's message of sáb ene 28 01:35:33 -0300 2012: This is the same thing I was complaining about in the bug #6123 thread, http://archives.postgresql.org/message-id/9698.1327266...@sss.pgh.pa.us Hm. Okay, I hadn't read that. In my FOR KEY SHARE patch I have added a heap_lock_updated_tuple that makes heap_lock_tuple follow the update chain forward when the tuple being locked is being updated by a concurrent transaction. Um, we do that already, no? Certainly in READ COMMITTED queries, we will do so, though it happens at a higher level than heap_lock_tuple. Well, it's not quite the same thing. Consider this isolation spec file: # When a tuple that has been updated is locked, the locking command # should traverse the update chain; thus, a DELETE should not be able # to proceed until the lock has been released. setup { CREATE TABLE foo ( key int PRIMARY KEY, value int ); INSERT INTO foo VALUES (1, 1); } teardown { DROP TABLE foo; } session s1 step s1b { BEGIN ISOLATION LEVEL REPEATABLE READ; } step s1s { SELECT * FROM foo; } # obtain snapshot step s1l { SELECT * FROM foo FOR KEY SHARE; } # obtain lock step s1c { COMMIT; } session s2 step s2b { BEGIN; } step s2u { UPDATE foo SET value = 2 WHERE key = 1; } step s2c { COMMIT; } step s2d { DELETE FROM foo WHERE key = 1; } permutation s1b s2b s1s s2u s1l s2c s2d s1c Note that session s1 is using repeatable read isolation level, and the snapshot is older than the update in session s2, so the row it sees is correctly the old one; however, in order for the delete to honour the lock (which is necessary for correctness), it has to be propagated up to tuples that the lock doesn't see itself. Only the old row is returned; newer rows are locked too, but not returned. So they don't get back to the executor at all. I haven't traced through FETCH to see if it makes sense to apply some of that to it. The issue here is what to do when the update came from our *own* transaction. In particular I'm a bit worried about avoiding what the code calls the Halloween problem, namely an infinite loop of re-updating the same tuple if the scan keeps coming across newer versions. Hmm. Since locking rows does not create new versions, I don't quite see how we could get into such a problem. A scan should only see each version once, and will discard all but one due to visibility. This new routine of mine only follows the ctids to future versions on updated tuples; there's no new scan. If I'm wrong about this, I'd sure like to be aware :-) The fact that SELECT FOR UPDATE returns empty means that so far I've been unable to exercise the SelfUpdate case in the new routine. The test case I pasted above started working as I intended once I wrote it; previously, the DELETE would just be allowed to continue immediately without blocking. -- Álvaro Herrera alvhe...@commandprompt.com The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: GiST for range types (was Re: [HACKERS] Range Types - typo + NULL string constructor)
On Mon, Jan 30, 2012 at 1:39 AM, Jeff Davis pg...@j-davis.com wrote: Thank you for the updates. I have a small patch attached. The only code change I made was very minor: I changed the constants used in the penalty function because your version used INFINITE_BOUND_PENALTY when adding an empty range, and that didn't quite make sense to me. If I'm mistaken you can leave it as-is. I also attached range-gist-test.sql, which I used for a performance test. I mix various types of ranges together in a larger table of 1.1M tuples. And then I create a smaller table that only contains normal ranges and empty ranges. There are two tests: 1. Create an index on the big table 2. Do a range join (using overlaps rather than equals) where the smaller table is on the outer side of a nested loop join and an index scan over the larger table on the inner. The index creation time reduces by a small amount with the patch, from around 16s without the patch to around 13s with the patch. The query time, however, dropped from around 26s to around 14s! Almost 2x speedup with the patch! Moreover, looking at the loop timing in the explain analyze output, it goes from about 7..24 ms per loop down to about 1.5..13 ms per loop. That seems to indicate that the index distribution is better, with more queries returning quickly. So, great work Alexander! Very convincing results. Great! Thank you for reviewing this patch! Marking ready for committer, but please apply my comment fixes at your discretion. Patch with your comment fixes is attached. - With best regards, Alexander Korotkov. rangetypegist-0.7.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers