Re: [HACKERS] WIP: dynahash replacement for buffer table
On 2014-10-14 17:53:10 -0400, Robert Haas wrote: On Tue, Oct 14, 2014 at 4:42 PM, Andres Freund and...@2ndquadrant.com wrote: The code in CHashSearch shows the problem there: you need to STORE the hazard pointer before you begin to do the LOAD operations to scan the bucket, and you must finish all of those LOADs before you STORE the NULL hazard pointer. A read or write barrier won't provide store/load or load/store ordering. I'm not sure I understand the problem here - but a read + write barrier should suffice? To avoid falling back to two full barriers, we'd need a separate define pg_read_write_barrier(), but that's not a problem. IIUC that should allow us to avoid emitting any actual code on x86. Well, thinking about x86 specifically, it definitely needs at least one mfence, after setting the hazard pointer and before beginning to read the bucket chain. Yes, I can see that now. I do wonder if that's not going to prove quite expensive... I think I can see some ways around needing that. But I think that needs some benchmarking first - no need to build a even more complex solution if not needed. The solution I'm thinking of is to essentially move away from hazard pointers and store something like a generation counter per backend. Which is updated less often, and in combination with other operations. When a backend need to clean up and sees that there's a straggler with a old generation it sends that backend a signal to ensure it sets the latest generation. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Better support of exported snapshots with pg_dump
On Wed, Oct 15, 2014 at 2:46 PM, Andres Freund and...@2ndquadrant.com wrote: This seems more user-friendly. But well I agree that we could do a larger set of things that could be used for even other purposes: - Ability to define snapshot name with pg_dump - Take system or database-wide lock - Extra client application running the whole Now is this set of features worth doing knowing that export snapshot has been designed for multi-threaded closed applications? Not much sure. Regards, What do you mean with designed for multi-threaded closed applications? External snapshots creation and control should be localized within dedicated client applications only. At least that's what I understand from it as that's how it is used now. -- Michael
Re: [HACKERS] Scaling shared buffer eviction
On Tue, Oct 14, 2014 at 3:32 PM, Andres Freund and...@2ndquadrant.com wrote: On 2014-10-14 15:24:57 +0530, Amit Kapila wrote: After that I observed that contention for LW_SHARED has reduced for this load, but it didn't help much in terms of performance, so I again rechecked the profile and this time most of the contention is moved to spinlock used in dynahash for buf mapping tables, please refer the profile (for 128 client count; Read only load) below: bgreclaimer patch + wait free lw_shared acquisition patches - -- This profile is without -O2 again. I really don't think it makes much sense to draw much inference from an unoptimized build. Profile data with -O2 is below. This shows that top contributors are calls for BufTableLookup and spin lock caused by BufTableInsert and BufTableDelete. To resolve spin lock contention, patch like above might prove to be useful (although I have to still evaluate the same). I would like to once take LWLOCK_STATS data as well before proceeding further. Do you have any other ideas? 11.17% swapper [unknown] [H] 0x011e0328 + 4.62% postgres postgres[.] hash_search_with_hash_value + 4.35% pgbench [kernel.kallsyms] [k] .find_busiest_group + 3.71% postgres postgres[.] s_lock 2.56% postgres [unknown] [H] 0x01500120 + 2.23% pgbench [kernel.kallsyms] [k] .idle_cpu + 1.97% postgres postgres[.] LWLockAttemptLock + 1.73% postgres postgres[.] LWLockRelease + 1.47% postgres [kernel.kallsyms] [k] .__copy_tofrom_user_power7 + 1.44% postgres postgres[.] GetSnapshotData + 1.28% postgres postgres[.] _bt_compare + 1.04% swapper [kernel.kallsyms] [k] .int_sqrt + 1.04% postgres postgres[.] AllocSetAlloc + 0.97% pgbench [kernel.kallsyms] [k] .default_wake_function Detailed Data - 4.62% postgres postgres[.] hash_search_with_hash_value - hash_search_with_hash_value - 2.19% BufTableLookup - 2.15% BufTableLookup ReadBuffer_common - ReadBufferExtended - 1.32% _bt_relandgetbuf - 0.73% BufTableDelete - 0.71% BufTableDelete ReadBuffer_common ReadBufferExtended - 0.69% BufTableInsert - 0.68% BufTableInsert ReadBuffer_common ReadBufferExtended 0.66% hash_search_with_hash_value - 4.35% pgbench [kernel.kallsyms] [k] .find_busiest_group - .find_busiest_group - 4.28% .find_busiest_group - 4.26% .load_balance - 4.26% .idle_balance - .__schedule - 4.26% .schedule_hrtimeout_range_clock .do_select .core_sys_select - 3.71% postgres postgres[.] s_lock - s_lock - 3.19% hash_search_with_hash_value - 3.18% hash_search_with_hash_value - 1.60% BufTableInsert ReadBuffer_common - ReadBufferExtended - 1.57% BufTableDelete ReadBuffer_common - ReadBufferExtended - 0.93% index_fetch_heap With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Buffer Requests Trace
On 14 October 2014 17:08, Lucas Lersch lucasler...@gmail.com wrote: Unfortunately, in the generated trace with over 2 million buffer requests, only ~14k different pages are being accessed, out of the 800k of the whole database. Am I missing something here? We can't tell what you're doing just by knowing the number of unique items in your list. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Support UPDATE table SET(*)=...
Hi All, Please find attached a patch which implements support for UPDATE table1 SET(*)=... The patch supports both UPDATE table SET(*)=(a,b,c) and UPDATE table1 SET(*)=(SELECT a,b,c FROM...). It solves the problem of doing UPDATE from a record variable of the same type as the table e.g. update foo set (*) = (select foorec.*) where ...; The design is simple. It basically expands the * in transformation stage, does the necessary type checking and adds it to the parse tree. This allows for normal execution for the rest of the stages. Thoughts/Comments? Regards, Atri updatestar_ver1.patch Description: application/download -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Column Redaction
On 14 October 2014 17:43, Robert Haas robertmh...@gmail.com wrote: On Sat, Oct 11, 2014 at 3:40 AM, Simon Riggs si...@2ndquadrant.com wrote: As soon as you issue the above query, you have clearly indicated your intention to steal. Receiving information is no longer accidental, it is an explicit act that is logged in the auditing system against your name. This is sufficient to bury you in court and it is now a real deterrent. Redaction has worked. To me, this feels thin. It's true that this might be good enough for some users, but I wouldn't bet on it being good enough for very many users, and I really hope there's a better option. We have an existing method of doing data redaction via security barrier views. I agree with thin. There is a leak in the design, so let me coin the phrase imprecise security. Of course, the leaks reduce the value of such a feature; they just don't reduce it all the way to zero. Security barrier views or views of any kind don't do the required job. We are not able to easily classify people as Trusted or Untrusted. We're seeking to differentiate between the right to use a column for queries and the right to see the value itself. Or put another way, you can read the book, you just can't photocopy it and take the copy home. Or, you can try on the new clothes to see if they fit, but you can't take them home for free. Both of those examples have imprecise security measures in place to control and reduce negative behaviours and in every other industry this is known as security. In IT terms, we're looking at controlling and reducing improper access to data by an otherwise Trusted person. The only problem is that some actions on data items are allowed, others are not. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] group locking: incomplete patch, just for discussion
On 15 October 2014 05:13, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: For parallelism, I think we need a concept of group locking. That is, suppose we have a user backend and N worker backends collaborating to execute some query. For the sake of argument, let's say it's a parallel CLUSTER, requiring a full table lock. We need all of the backends to be able to lock the table even though at least one of them holds AccessExclusiveLock. This suggests that the backends should all be members of a locking group, and that locks within the same locking group should be regarded as mutually non-conflicting. In the background worker case, I imagined that the foreground process would hold a lock and the background processes would just assume they could access the table without holding locks of their own. Aren't you building a mountain where a molehill would do? Yeh. Locks should be made in the name of the main transaction and released by it. When my family goes to a restaurant, any member of the party may ask for a table and the request is granted for the whole family. But the lock is released only when I pay the bill. Once we have the table, any stragglers know we have locked the table and they just come sit at the table without needing to make their own lock request to the Maitre D', though they clearly cache the knowledge that we have the table locked. So all lock requests held until EOX should be made in the name of the top level process. Any child process wanting a lock should request it, but on discovering it is already held at parent level should just update the local lock table. Transient locks, like catalog locks can be made and released locally; I think there is more detail there but it shouldn't affect the generalisation. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Support UPDATE table SET(*)=...
Hi On Wed, Oct 15, 2014 at 11:02 AM, Atri Sharma atri.j...@gmail.com wrote: Please find attached a patch which implements support for UPDATE table1 SET(*)=... I presume you haven't read Tom Lane's proposal and discussion about multiple column assignment in UPDATE: http://www.postgresql.org/message-id/1783.1399054...@sss.pgh.pa.us (Assigning all columns was also discussed there) And there's a WIP patch: http://www.postgresql.org/message-id/20930.1402931...@sss.pgh.pa.us Regards, Marti -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Support UPDATE table SET(*)=...
On Wednesday, October 15, 2014, Marti Raudsepp ma...@juffo.org wrote: Hi On Wed, Oct 15, 2014 at 11:02 AM, Atri Sharma atri.j...@gmail.com javascript:; wrote: Please find attached a patch which implements support for UPDATE table1 SET(*)=... I presume you haven't read Tom Lane's proposal and discussion about multiple column assignment in UPDATE: http://www.postgresql.org/message-id/1783.1399054...@sss.pgh.pa.us (Assigning all columns was also discussed there) And there's a WIP patch: http://www.postgresql.org/message-id/20930.1402931...@sss.pgh.pa.us Thanks for the links, but this patch only targets SET(*) case, which, if I understand correctly, the patch you mentioned doesn't directly handle (If I understand correctly, the target of the two patches is different). Regards, Atri -- Regards, Atri *l'apprenant*
Re: [HACKERS] Support UPDATE table SET(*)=...
On Wed, Oct 15, 2014 at 2:18 PM, Atri Sharma atri.j...@gmail.com wrote: On Wednesday, October 15, 2014, Marti Raudsepp ma...@juffo.org wrote: Hi On Wed, Oct 15, 2014 at 11:02 AM, Atri Sharma atri.j...@gmail.com wrote: Please find attached a patch which implements support for UPDATE table1 SET(*)=... I presume you haven't read Tom Lane's proposal and discussion about multiple column assignment in UPDATE: http://www.postgresql.org/message-id/1783.1399054...@sss.pgh.pa.us (Assigning all columns was also discussed there) And there's a WIP patch: http://www.postgresql.org/message-id/20930.1402931...@sss.pgh.pa.us Thanks for the links, but this patch only targets SET(*) case, which, if I understand correctly, the patch you mentioned doesn't directly handle (If I understand correctly, the target of the two patches is different). Digging more, I figured that the patch I posted builds on top of Tom's patch, since it did not add whole row cases. Regards, Atri
Re: [HACKERS] proposal: plpgsql - Assert statement
2014-09-30 10:04 GMT+07:00 Jim Nasby j...@nasby.net: On 9/17/14, 7:40 PM, Jan Wieck wrote: Exactly. Doing something like ASSERT (select count(*) from foo where fk not in (select pk from bar)) = 0; is a perfectly fine, arbitrary boolean expression. It will probably work well in a development environment too. And I am very sure that it will not scale well once that code gets deployed. And I know how DBAs react to the guaranteed following performance problem. They will disable ALL assert ... or was there some sort of assert class system proposed that I missed? Actually, compared with for example Java or C, in production systems, usually all asserts are disabled for performance (in java removed by JIT, in C we define NDEBUG). We're also putting too much weight on the term assert here. C-style asserts are generally not nearly as useful in a database as general sanity-checking or error handling, especially if you're trying to use the database to enforce data sanity. +1. without any query capability, assert will become much less useful. If we cannot query in assert, we will code: -- perform some query ASSERT WHEN some_check_on_query_result; .. and disabling the query in production system will become another trouble. My wish-list for asserts is: - Works at a SQL level - Unique/clear way to identify asserts (so you're not guessing where the assert came from) +1 - Allows for over-riding individual asserts (so if you need to do something you're not supposed to do you still have the protection of all other asserts) - Less verbose than IF THEN RAISE END IF +1 -- Ali Akbar
Re: [HACKERS] Patch to support SEMI and ANTI join removal
On Thu, Oct 9, 2014 at 12:40 AM, Andres Freund and...@2ndquadrant.com wrote: On 2014-10-09 00:21:44 +1300, David Rowley wrote: Ok, so I've been hacking away at this for a couple of evenings and I think I have a working prototype finally! Cool! Patch attached. So it seems it's not quite as efficient as join removal at planning time, but still a big win when it's possible to perform the join skipping. Have you checked where the overhead is? Is it really just the additional node that the tuples are passed through? I've not checked this yet, but I'd assume that it has to be from the extra node. I'll run some profiles soon. Have you measured the overhead of the plan/execution time checks over master? I did a bit of benchmarking last night, but this was mostly for testing that I've not added too much overhead on the nest loop code. For the merge and hashjoin code I've managed to keep the special skipping code in the EXEC_MJ_INITIALIZE_OUTER and HJ_BUILD_HASHTABLE part of the main switch statement, so the extra checks should only be performed on the first call of the node when skips are not possible. For nested loop I can't see any other way but to pay the small price of setting the skipflags and checking if there are any skip flags on every call to the node. I tested the overhead of this on my laptop by creating 2 tables with 1 million rows each, joining them on an INT column, where each value of the int column was unique. I seem to have added about a 2% overhead to this. :( Which I was quite surprised at, giving it's just 101 million extra settings if the skipflags and checking that the skip flags are not empty, but perhaps the extra local variable is causing something else to not make it into a register. At the moment I can't quite see another way to do it, but I guess it may not be the end of the world as the chances of having to perform a nest loop join on 2 tables of that size is probably not that high. Test case: create table t3 (id int primary key); create table t2 (id int primary key, t3_id int not null references t3); create table t1 (id int primary key, t2_id int not null references t2); insert into t3 select x.x from generate_series(1,100) x(x); insert into t2 select x.x,x.x from generate_series(1,100) x(x); insert into t1 select x.x,x.x from generate_series(1,100) x(x); vacuum; set enable_hashjoin = off; set enable_mergejoin = off; select count(*) from t1 inner join t2 on t1.id=t2.id; Unpatched: duration: 120 s number of transactions actually processed: 45 latency average: 2666.667 ms tps = 0.371901 (including connections establishing) tps = 0.371965 (excluding connections establishing) Patched: Master duration: 120 s number of transactions actually processed: 44 latency average: 2727.273 ms tps = 0.363933 (including connections establishing) tps = 0.363987 (excluding connections establishing) 102.19% Of course if we do the join on the column that has the foreign key, then this is much faster. test=# select count(*) from t1 inner join t2 on t1.t2_id=t2.id; count - 100 (1 row) Time: 105.206 ms The explain analyze from the above query looks like: test=# explain (analyze, costs off, timing off) select count(*) from t1 inner join t2 on t1.t2_id=t2.id; QUERY PLAN -- Aggregate (actual rows=1 loops=1) - Nested Loop (actual rows=100 loops=1) - Seq Scan on t1 (actual rows=100 loops=1) - Index Only Scan using t2_pkey on t2 (never executed) Index Cond: (id = t1.t2_id) Heap Fetches: 0 Execution time: 124.990 ms (7 rows) As you can see the scan on t2 never occurred. I've so far only managed to come up with 1 useful regression test for this new code. It's not possible to tell if the removal has taken place at plan time, as the plan looks the same as if it didn't get removed. The only way to tell us from the explain analyze, but the problem is that the execution time is shown and can't be removed. Perhaps I should modify the explain to tag join nodes that the planner has marked to say skipping may be possible? But this is not really ideal as it's only the join nodes that know about skipping and they just don't bother executing the child nodes, so it's really up to the executor to decide which child nodes don't get called, so to add something to explain might require making it more smart than it needs to be. Right now I'm not quite sure if I should modify any costs. Quite possibly hash joins could have the costing reduced a little to try to encourage hashjoins over merge joins, as with merge joins we can't skip sort operations, but with hash joins we can skip the hash table build. Regards David Rowley inner_join_removals_2014-10-15_0f3f1ea.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription:
[HACKERS] Locking for Rename To new_name works differently for different objects
I have observed that for renaming some of the objects AccessExclusiveLock is taken on object whereas for other kind of objects no lock is taken on object before renaming the object. The object's that are renamed via AlterObjectRename_internal() takes the lock (during get_object_address() call) whereas for other objects, there is no lock. I think there is exception to it i.e for Rename table, it also takes lock. Refer below function: ExecRenameStmt() { .. .. case OBJECT_AGGREGATE: case OBJECT_COLLATION: case OBJECT_CONVERSION: case OBJECT_EVENT_TRIGGER: case OBJECT_FDW: ... address = get_object_address(stmt-renameType, stmt-object, stmt-objarg, relation, AccessExclusiveLock, false); Assert(relation == NULL); catalog = heap_open(address.classId, RowExclusiveLock); AlterObjectRename_internal(catalog, address.objectId, stmt-newname); .. } Is there a reason for different locking strategy? With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] jsonb generator functions
2014-10-13 17:22 GMT+02:00 Andrew Dunstan and...@dunslane.net: On 10/13/2014 09:37 AM, Andrew Dunstan wrote: On 09/26/2014 04:54 PM, Andrew Dunstan wrote: Here is a patch for the generator and aggregate functions for jsonb that we didn't manage to get done in time for 9.4. They are all equivalents of the similarly names json functions. Included are to_jsonb jsonb_build_object jsonb_build_array jsonb_object jsonb_agg jsonb_object_agg Still to come: documentation. Adding to the next commitfest. Revised patch to fix compiler warnings. And again, initializing an incompletely initialized variable, as found by Pavel Stehule. I checked a code, and I have only two small objection - a name jsonb_object_two_arg is not good - maybe json_object_keys_values ? Next: there are no tests for to_jsonb function. Regards Pavel cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Buffer Requests Trace
Sorry for taking so long to answer. I am sending attached the patch with the changes I did to pgsql code. I followed the steps for compiling and installing pgsql from: http://www.postgresql.org/docs/current/static/install-short.html In summary, the page_id of the page being released in ReleaseBuffer() and ReleaseAndReadBuffer() is written to the file: /usr/loca/pgsql/data/trace. This file is created manually. I have also created a PrivateDirtyFlag for each backend, in analogy to the PrivateRefCount. I use this to keep track if the current backend performed an update operation in a page in the buffer pool or simply a read operation (it is not relevant now). The trace file consists of one line for each ReleaseBuffer() or ReleaseAndReadBuffer() call. The line has the format: operation,tblSpace,dbNode,relNode,blockNumber Once the trace file is complete after the execution of the tpcc benchmark, I use the following bash script to get only unique pages: cut -d',' -f2-5 trace | sort -n -t',' -k1 -k2 -k3 -k4 | uniq Today I realized that I was making a mistake in executing the oltpbenchmark application. From the 64 warehouses created for tpcc, only 1 was being accessed (the 14k distinct pages that I mentioned). I increased the terminal option of the tpcc benchmark from 1 to 64, resulting in one terminal for each warehouse. This provided me with a higher number of distinct pages being accessed. Unfortunately, from the 800k pages in the database (64 warehouses), executing tpcc for 10min resulted in 400k distinct pages being accessed. This number is much better than the previous results, but I think it is still not realistic. I would like to thank you guys for all the attention given to my problem :) On Wed, Oct 15, 2014 at 9:49 AM, Simon Riggs si...@2ndquadrant.com wrote: On 14 October 2014 17:08, Lucas Lersch lucasler...@gmail.com wrote: Unfortunately, in the generated trace with over 2 million buffer requests, only ~14k different pages are being accessed, out of the 800k of the whole database. Am I missing something here? We can't tell what you're doing just by knowing the number of unique items in your list. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Lucas Lersch diff -rupN ./postgresql-9.3.5_original/config.log ./postgresql-9.3.5_trace/config.log --- ./postgresql-9.3.5_original/config.log 2014-10-15 10:30:11.552923099 +0200 +++ ./postgresql-9.3.5_trace/config.log 2014-10-15 10:35:45.786580479 +0200 @@ -349,7 +349,7 @@ configure:7734: $? = 0 configure:7755: result: yes configure:7766: checking for library containing setproctitle configure:7807: gcc -o conftest -O2 -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -D_GNU_SOURCEconftest.c -lm 5 -/tmp/ccf4os1l.o: In function `main': +/tmp/ccMWb4Lv.o: In function `main': conftest.c:(.text.startup+0x7): undefined reference to `setproctitle' collect2: ld returned 1 exit status configure:7814: $? = 1 @@ -389,7 +389,7 @@ configure: failed program was: | return 0; | } configure:7807: gcc -o conftest -O2 -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -D_GNU_SOURCEconftest.c -lutil -lm 5 -/tmp/ccSriijn.o: In function `main': +/tmp/ccxsb80w.o: In function `main': conftest.c:(.text.startup+0x7): undefined reference to `setproctitle' collect2: ld returned 1 exit status configure:7814: $? = 1 @@ -431,7 +431,7 @@ configure: failed program was: configure:7845: result: no configure:7853: checking for library containing dlopen configure:7894: gcc -o conftest -O2 -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -D_GNU_SOURCEconftest.c -lm 5 -/tmp/ccZgGSzt.o: In function `main': +/tmp/ccBvyipD.o: In function `main': conftest.c:(.text.startup+0x7): undefined reference to `dlopen' collect2: ld returned 1 exit status configure:7901: $? = 1 @@ -479,7 +479,7 @@ configure:7988: $? = 0 configure:8019: result: none required configure:8027: checking for library containing shl_load configure:8068: gcc -o conftest -O2 -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -D_GNU_SOURCEconftest.c -ldl -lm 5 -/tmp/cc2JE0qC.o: In function `main': +/tmp/ccSVMb9F.o: In function `main': conftest.c:(.text.startup+0x7): undefined reference to `shl_load' collect2: ld returned 1 exit status configure:8075: $? = 1 @@ -564,7 +564,7 @@ configure:8254: $? = 0 configure:8285: result: none required
Re: [HACKERS] [9.4 bug] The database server hangs with write-heavy workload on Windows
From: MauMau maumau...@gmail.com Thank you very much. I didn't anticipate such a difficult complicated cause. The user agreed to try the patch tonight. I'll report back the result as soon as I got it from him. The test ran successfully without hang for 24 hours. It was run with your patch + the following: BTW, in LWLockWaitForVar(), the first line of the following code fragment is not necessary, because lwWaitLink is set to head immediately. I think it would be good to eliminate as much unnecessary code as possible from the spinlock section. proc-lwWaitLink = NULL; /* waiters are added to the front of the queue */ proc-lwWaitLink = lock-head; Regards MauMau -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] WIP: Access method extendability
Hackers, Postgres was initially designed to support access methods extendability. This extendability lives to present day. However, this is mostly internal in-core extendability. One can quite easily add new access method into PostgreSQL core. But if one try to implement access method as external module, he will be faced with following difficulties: 1. Need to directly insert into pg_am, because of no CREATE ACCESS METHOD command. And no support of dependencies between am and opclasses etc. 2. Module can't define xlog records. So, new am would be not WAL-logged. The first problem is purely mechanical. Nothing prevents us to implement CREATE ACCESS METHOD and DROP ACCESS METHOD commands and support all required dependencies. Problem of WAL is a bit more complex. According to previous discussions, we don't want to let extensions declare their own xlog records. If we let them then recovery process will depend on extensions. That is much violates reliability. Solution is to implement some generic xlog record which is able to represent difference between blocks in some general manner. 3 patches are attached: 1. CREATE/DROP ACCESS METHOD commands. With support of dependencies. 2. New generic xlog record type. 3. Bloom contrib module as example of usage of previous two features. This module was posted few years ago by Teodor Sigaev. Now, it's wrapped as an extension and WAL-logged. Patches are in WIP state. No documentation and very little of comments. However, it believe that it's enough to do some general concept review. Some notes about generic xlog format. Generic xlog represent difference between pages as operations set of two kinds: 1. Move memory inside the page. Optional flag is to zero gap on a previous memory location. 2. Copy memory fragment of memory from xlog record to page. As an option bitwise logical operations are supported as well as plain copy. Generic xlog can open page in two modes: 1. Create mode: page is zeroed independent on its lsn. 2. Update mode: page is updated only if it's lsn is lower than record lsn Usually, xlog record is filled in critical sections when memory allocations is prohibited. Thus, user have to previously initialize it with knowledge of pages count, total operations count and total length of data. -- With best regards, Alexander Korotkov. create-am.1.patch.gz Description: GNU Zip compressed data generic-xlog.1.patch.gz Description: GNU Zip compressed data bloom-contrib.1.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Buffer Requests Trace
On 15 October 2014 12:49, Lucas Lersch lucasler...@gmail.com wrote: Sorry for taking so long to answer. I am sending attached the patch with the changes I did to pgsql code. I followed the steps for compiling and installing pgsql from: http://www.postgresql.org/docs/current/static/install-short.html Are you recording the bufferid or the blockid? -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal for better support of time-varying timezone abbreviations
On 15.10.2014 00:26, Tom Lane wrote: * I've not touched ecpg except for cosmetic changes to keep the struct definitions in sync, and to fix the previously-mentioned bogus free() attempt. I doubt that it would be worth teaching ecpg how to access the zic timezone database --- the problem of configuring where to find those files seems like more trouble than it's worth given the lack of complaints. I'm not sure what we should do about the obsolete timezone abbreviations in its table. Maybe we should just remove thme for the new release. Yes, that might break some applications, but then the server doesn't know these either, so the applications might break anyway. Michael -- Michael Meskes Michael at Fam-Meskes dot De, Michael at Meskes dot (De|Com|Net|Org) Michael at BorussiaFan dot De, Meskes at (Debian|Postgresql) dot Org Jabber: michael.meskes at gmail dot com VfL Borussia! Força Barça! Go SF 49ers! Use Debian GNU/Linux, PostgreSQL -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Buffer Requests Trace
I am recording the BufferDesc.tag.blockNum for the buffer along with the spcNode, dbNode, relNode, also present in the tag. On Wed, Oct 15, 2014 at 2:27 PM, Simon Riggs si...@2ndquadrant.com wrote: On 15 October 2014 12:49, Lucas Lersch lucasler...@gmail.com wrote: Sorry for taking so long to answer. I am sending attached the patch with the changes I did to pgsql code. I followed the steps for compiling and installing pgsql from: http://www.postgresql.org/docs/current/static/install-short.html Are you recording the bufferid or the blockid? -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Lucas Lersch
Re: [HACKERS] Buffer Requests Trace
On 15 October 2014 13:44, Lucas Lersch lucasler...@gmail.com wrote: I am recording the BufferDesc.tag.blockNum for the buffer along with the spcNode, dbNode, relNode, also present in the tag. The TPC-C I/O is random, so if you run it for longer you should see a wider set. Cacheing isn't possible as a way to improve txn rates. Check that you're touching all tables. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Buffer Requests Trace
So is it a possible normal behavior that running tpcc for 10min only access 50% of the database? Furthermore, is there a guideline of parameters for tpcc (# of warehouses, execution time, operations weight)? On Wed, Oct 15, 2014 at 3:09 PM, Simon Riggs si...@2ndquadrant.com wrote: On 15 October 2014 13:44, Lucas Lersch lucasler...@gmail.com wrote: I am recording the BufferDesc.tag.blockNum for the buffer along with the spcNode, dbNode, relNode, also present in the tag. The TPC-C I/O is random, so if you run it for longer you should see a wider set. Cacheing isn't possible as a way to improve txn rates. Check that you're touching all tables. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Lucas Lersch
Re: [HACKERS] narwhal and PGDLLIMPORT
On 10/15/2014 01:53 AM, Craig Ringer wrote: On 10/15/2014 12:53 PM, Noah Misch wrote: Windows Server 2003 isn't even EOL yet. I'd welcome a buildfarm member with that OS and a modern toolchain. It's possible to run multiple buildfarm animals on a single Windows instance, each with a different toolchain. Indeed, and even using the same buildroot. That's been built into the buildfarm for a long time. For example, nightjar and friarbird do this. There's the chance of interactions that can't occur if each SDK is used in isolation on its own machine, but it should be pretty minimal. Make sure each has a distinct base_port. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] group locking: incomplete patch, just for discussion
On Wed, Oct 15, 2014 at 4:18 AM, Simon Riggs si...@2ndquadrant.com wrote: On 15 October 2014 05:13, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: For parallelism, I think we need a concept of group locking. That is, suppose we have a user backend and N worker backends collaborating to execute some query. For the sake of argument, let's say it's a parallel CLUSTER, requiring a full table lock. We need all of the backends to be able to lock the table even though at least one of them holds AccessExclusiveLock. This suggests that the backends should all be members of a locking group, and that locks within the same locking group should be regarded as mutually non-conflicting. In the background worker case, I imagined that the foreground process would hold a lock and the background processes would just assume they could access the table without holding locks of their own. Aren't you building a mountain where a molehill would do? Yeh. Locks should be made in the name of the main transaction and released by it. When my family goes to a restaurant, any member of the party may ask for a table and the request is granted for the whole family. But the lock is released only when I pay the bill. Once we have the table, any stragglers know we have locked the table and they just come sit at the table without needing to make their own lock request to the Maitre D', though they clearly cache the knowledge that we have the table locked. So all lock requests held until EOX should be made in the name of the top level process. Any child process wanting a lock should request it, but on discovering it is already held at parent level should just update the local lock table. Transient locks, like catalog locks can be made and released locally; I think there is more detail there but it shouldn't affect the generalisation. Hmm, interesting idea. Suppose, though, that the child process requests a lock that can't immediately be granted, because the catalog it's trying to access is locked in AccessExclusiveLock mode by an unrelated transaction. The unrelated transaction, in turn, is blocked trying to acquire some resource, which the top level parallelism process. Assuming the top level parallelism process is waiting for the child (or will eventually wait), this is a deadlock, but without some modification to the deadlock detector, it can't see one of the edges. Figuring out what to do about that is really the heart of this project, I think, and there are a variety of designs possible. One of the early ideas that I had was to the parallel workers directly twaddle the main processes' PGPROC and lock table state. In other words, instead of taking locks using their own PGPROCs, everybody uses a single PGPROC. I made several attempts at getting designs along these lines off the ground, but it got complicated and invasive: (1) The processes need to coordinate to make sure that you don't have two people twaddling the lock state at the same time; (2) The existing data structures won't support more than one process waiting at a time, but there's no reason why one parallel worker couldn't be trying to lock one catalog while another one is trying to lock a different catalog; (3) On a related note, when a lock wait ends, you can't just wake up the process that owns the PGPROC, but rather the one that's actually waiting; (4) the LWLockReleaseAll algorithm just falls apart in this environment, as far as I can see. The alternative design which I've been experimenting with is to have each process use its own PGPROC and PROCLOCK structures, but to tag each PROCLOCK with not only the owning PGPROC but also the group leader's PGPROC. This has not been entirely smooth sailing, but it sees to break much less code than trying to have everybody use one PGPROC. Most of the changes that seem to be needed to make things work are pretty well-isolated; rather than totally rearranging the lock manager, you're just adding extra code that runs only in the parallel case. I'm definitely open to the idea that there's a better, simpler design out there, but I haven't been able to think of one that doesn't break deadlock detection. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Buffer Requests Trace
* Lucas Lersch (lucasler...@gmail.com) wrote: So is it a possible normal behavior that running tpcc for 10min only access 50% of the database? Furthermore, is there a guideline of parameters for tpcc (# of warehouses, execution time, operations weight)? Depends- you may be aware that we support index-only scans in certain situations. This means that only the index page for a given relation (and the visibility map) are accessed, and the heap is not. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] Proposal for better support of time-varying timezone abbreviations
Michael Meskes mes...@postgresql.org writes: On 15.10.2014 00:26, Tom Lane wrote: * I've not touched ecpg except for cosmetic changes to keep the struct definitions in sync, and to fix the previously-mentioned bogus free() attempt. I doubt that it would be worth teaching ecpg how to access the zic timezone database --- the problem of configuring where to find those files seems like more trouble than it's worth given the lack of complaints. I'm not sure what we should do about the obsolete timezone abbreviations in its table. Maybe we should just remove thme for the new release. Yes, that might break some applications, but then the server doesn't know these either, so the applications might break anyway. The same thought had occurred to me. Probably the main use of the datetime parsing code in ecpg is for interpreting outputs from the server, and (at least by default) the server doesn't use timezone abbreviations when printing timestamps. So maybe that's largely dead code anyhow. I would not propose back-patching such a change, but we could try it in 9.5 and see if anyone complains. A less drastic remedy would be to remove just those abbreviations whose meaning has actually changed over time. Eventually that might be all of them ... but in the meantime, we could at least argue that we weren't breaking any case that worked well before. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] How to make ResourceOwnerForgetBuffer() O(1), instead of O(N^2) scale
On 10/03/2014 07:08 AM, Kouhei Kaigai wrote: Hello, I recently got a trouble on development of my extension that utilizes the shared buffer when it released each buffer page. This extension transfers contents of the shared buffers to GPU device using DMA feature, then kicks a device kernel code. Wow, that sounds crazy. Once backend/extension calls ReadBuffer(), resowner.c tracks which buffer was referenced by the current resource owner, to ensure these buffers being released at end of the transaction. However, it seems to me implementation of resowner.c didn't assume many buffers are referenced by a particular resource owner simultaneously. It manages the buffer index using an expandable array, then looks up the target buffer by sequential walk but from the tail because recently pinned buffer tends to be released first. It made a trouble in my case. My extension pinned multiple thousands buffers, so owner-buffers[] were enlarged and takes expensive cost to walk on. In my measurement, ResourceOwnerForgetBuffer() takes 36 seconds in total during hash-joining 2M rows; even though hash-joining itself takes less than 16 seconds. What is the best way to solve the problem? How about creating a separate ResourceOwner for these buffer pins, and doing a wholesale ResourceOwnerRelease() on it when you're done? Let me clarify your idea. 1. Create a separate ResourceOwner under the CurrentResourceOwner. 2. Switch CurrentResourceOwner to the self-constructed resource owner during ReadBuffer() on thousands buffers. 3. Switch back to the original CurrentResourceOwner. 4. Kick DMA and run GPU device kernel (actual job running...) 5. Switch CurrentResourceOwner to the self-constructed resource owner again, during ReleaseBuffer() in reverse order. 6. Switch back to the original CurrentResourceOwner, then calls ResourceOwnerDelete(). Hmm... at this moment, I cannot find something not easy to implement. I'd like to try this idea, then report it later. Let me share the result. The above approach could eliminated my problem. Individual resource-owner for each set of shared-buffer and ReleaseBuffer() in reverse order reduced time to release pinned buffers from 36sec-67.3msec when I run hash-joining on 20M records; that involves 59168 buffers are pinned concurrently in maximum (32 asynchronous requests, a request packs 1849 buffers). Thanks for the suggestion! postgres=# explain analyze select * from t0 natural join t1 natural join t2; INFO: time to release buffers: 67.289ms QUERY PLAN --- Custom (GpuHashJoin) (cost=3468.00..471640.11 rows=19740924 width=139) (actual time=193.086..5913.459 rows=2000 loops=1) pseudo scan tlist: 1:(t0.bid), 2:(t0.aid), 3:t0.id, 4:t0.cat, 5:t0.cid, 6:t0.did, 7:t0.x, 8:t0.y, 9:t0.z, 11:t2.btext, 12:[t2.bid], 10:t1.atext, 13:[t1.aid] hash clause 1: (t0.bid = t2.bid) hash clause 2: (t0.aid = t1.aid) - Custom (GpuScan) on t0 (cost=500.00..467167.24 rows=2024 width=73) (actual time=7.757..1056.426 rows=2000 loops=1) - Custom (MultiHash) (cost=734.00..734.00 rows=4 width=37) (actual time=23.382..23.382 rows=4 loops=1) hash keys: bid - Seq Scan on t2 (cost=0.00..734.00 rows=4 width=37) (actual time=0.007..5.124 rows=4 loops=1) - Custom (MultiHash) (cost=734.00..734.00 rows=4 width=37) (actual time=11.919..11.919 rows=4 loops=1) hash keys: aid - Seq Scan on t1 (cost=0.00..734.00 rows=4 width=37) (actual time=0.010..5.299 rows=4 loops=1) Planning time: 0.904 ms Execution time: 6667.986 ms (13 rows) -- NEC OSS Promotion Center / PG-Strom Project KaiGai Kohei kai...@ak.jp.nec.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Locking for Rename To new_name works differently for different objects
Amit Kapila amit.kapil...@gmail.com writes: I have observed that for renaming some of the objects AccessExclusiveLock is taken on object whereas for other kind of objects no lock is taken on object before renaming the object. The usual theory for DDL updates of all types (not just rename) is that an explicit lock is only needed for objects whose catalog representation comprises more than one row. Otherwise, the implicit locking involved in updating that row is sufficient to serialize different updates. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] group locking: incomplete patch, just for discussion
On 15 October 2014 14:46, Robert Haas robertmh...@gmail.com wrote: When my family goes to a restaurant, any member of the party may ask for a table and the request is granted for the whole family. But the lock is released only when I pay the bill. Once we have the table, any stragglers know we have locked the table and they just come sit at the table without needing to make their own lock request to the Maitre D', though they clearly cache the knowledge that we have the table locked. Hmm, interesting idea. Suppose, though, that the child process requests a lock that can't immediately be granted, because the catalog it's trying to access is locked in AccessExclusiveLock mode by an unrelated transaction. The unrelated transaction, in turn, is blocked trying to acquire some resource, which the top level parallelism process. Assuming the top level parallelism process is waiting for the child (or will eventually wait), this is a deadlock, but without some modification to the deadlock detector, it can't see one of the edges. Family disputes are fairly easily resolved ;-) The first and basic point is that in most cases the parent should already hold the required locks. This can only happen for briefly held locks and/or more complex stuff. In the first case, getting parallelism to work without that complex stuff would be useful. I'd be happy if the first version simply throws an error if a child can't acquire a lock immediately. Don't overthink the first version. Knowing you'll disagree, lets take a further step... Second point, the relationship between parent and children is clear. If we do a deadlock detection, we should be able to search for that as a special case, since we will know that we are a child and that such a situation might occur. So just add in an edge so the rest of the deadlock code works fine. If that doesn't work, use a heurisic. If parent is waiting when child does deadlock test, assume its a deadlock and abort the child speculatively just in case. You can work out how to do that better in the future, since it won't happen that often. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal : REINDEX SCHEMA
On Mon, Oct 13, 2014 at 11:16 PM, Robert Haas robertmh...@gmail.com wrote: On Sun, Oct 12, 2014 at 1:27 PM, Stephen Frost sfr...@snowman.net wrote: * Alvaro Herrera (alvhe...@2ndquadrant.com) wrote: Sawada Masahiko wrote: Attached WIP patch adds new syntax REINEX SCHEMA which does reindexing all table of specified schema. There are syntax dose reindexing specified index, per table and per database, but we can not do reindexing per schema for now. It seems doubtful that there really is much use for this feature, but if there is, I think a better syntax precedent is the new ALTER TABLE ALL IN TABLESPACE thingy, rather than your proposed REINDEX SCHEMA. Something like REINDEX TABLE ALL IN SCHEMA perhaps. Yeah, I tend to agree that we should be looking at the 'ALL IN TABLESPACE' and 'ALL IN SCHEMA' type of commands to keep things consistent. This might be an alternative for the vacuum / analyze / reindex database commands also.. Urgh. I don't have a problem with that syntax in general, but it clashes pretty awfully with what we're already doing for REINDEX otherwise. Attached patches are latest version patch. I changed syntax to REINDEX ALL IN SCHEMA, but I felt a sense of discomfort a little as Robert mentioned. Anyway, you can apply these patches in numerical order, can use REINDEX ALL IN SCHEMA feature and -S/--schema option in reindexdb. 000_reindex_all_in_schema_v2.patch : It contains REINDEX ALL IN SCHEMA feature 001_Add_schema_option_to_reindexdb_v1.patch : It contains reindexdb -S/--schema supporting Please review and comments. Regards, --- Sawada Masahiko 000_reindex_all_in_schema_v2.patch Description: Binary data 001_Add_schema_option_to_reindexdb_v1.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [BUGS] BUG #10823: Better REINDEX syntax.
Stephen Frost wrote: Yes, I will update the patch. Still planning to do this..? Marking this back to waiting-for-author. Yes, but probably not for this commitfest unfortunately. Fair enough, I'll mark it 'returned with feedback'. We lost this patch for the October commitfest, didn't we? -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [BUGS] BUG #10823: Better REINDEX syntax.
* Alvaro Herrera (alvhe...@2ndquadrant.com) wrote: We lost this patch for the October commitfest, didn't we? I'm guessing you missed that a new version just got submitted..? I'd be fine with today's being added to the october commitfest.. Of course, there's a whole independent discussion to be had about how there wasn't any break between last commitfest and this one, but that probably deserves its own thread. THanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] Buffer Requests Trace
I got the following numbers from my tpcc database: Data size: ~6059MB Index size: ~1390MB Total size: ~7400MB Even considering index-only scans, the ratio of around 50% of the database pages being accessed seems unrealistic to me. On Wed, Oct 15, 2014 at 3:50 PM, Stephen Frost sfr...@snowman.net wrote: * Lucas Lersch (lucasler...@gmail.com) wrote: So is it a possible normal behavior that running tpcc for 10min only access 50% of the database? Furthermore, is there a guideline of parameters for tpcc (# of warehouses, execution time, operations weight)? Depends- you may be aware that we support index-only scans in certain situations. This means that only the index page for a given relation (and the visibility map) are accessed, and the heap is not. Thanks, Stephen -- Lucas Lersch
Re: [HACKERS] [BUGS] BUG #10823: Better REINDEX syntax.
Stephen Frost wrote: * Alvaro Herrera (alvhe...@2ndquadrant.com) wrote: We lost this patch for the October commitfest, didn't we? I'm guessing you missed that a new version just got submitted..? Which one, reindex schema? Isn't that a completely different patch? I'd be fine with today's being added to the october commitfest.. Of course, there's a whole independent discussion to be had about how there wasn't any break between last commitfest and this one, but that probably deserves its own thread. It's not the first that that happens, and honestly I don't see all that much cause for concern. Heikki did move pending patches to the current one, and closed a lot of inactive ones as 'returned with feedback'. Attentive patch authors should have submitted new versions ... if they don't, then someone else with an interest in the patch should do so. If no one update the patches, what do we want them for? -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] group locking: incomplete patch, just for discussion
On Wed, Oct 15, 2014 at 10:12 AM, Simon Riggs si...@2ndquadrant.com wrote: On 15 October 2014 14:46, Robert Haas robertmh...@gmail.com wrote: When my family goes to a restaurant, any member of the party may ask for a table and the request is granted for the whole family. But the lock is released only when I pay the bill. Once we have the table, any stragglers know we have locked the table and they just come sit at the table without needing to make their own lock request to the Maitre D', though they clearly cache the knowledge that we have the table locked. Hmm, interesting idea. Suppose, though, that the child process requests a lock that can't immediately be granted, because the catalog it's trying to access is locked in AccessExclusiveLock mode by an unrelated transaction. The unrelated transaction, in turn, is blocked trying to acquire some resource, which the top level parallelism process. Assuming the top level parallelism process is waiting for the child (or will eventually wait), this is a deadlock, but without some modification to the deadlock detector, it can't see one of the edges. Family disputes are fairly easily resolved ;-) The first and basic point is that in most cases the parent should already hold the required locks. This can only happen for briefly held locks and/or more complex stuff. In the first case, getting parallelism to work without that complex stuff would be useful. I'd be happy if the first version simply throws an error if a child can't acquire a lock immediately. Don't overthink the first version. Knowing you'll disagree, lets take a further step... Well, I'm fervently in agreement with you on one point: the first version of all this needs to be as simple as possible, or the time to get to the first version will be longer than we can afford to wait. I think what we're discussing here is which things are important enough that it makes sense to have them in the first version, and which things can wait. I also think we are in agreement that at least SOME thought about lock management is needed; the question we're trying to hash out is whether what I'm proposing to try to do here is significantly more ambitious than what's really necessary for V1. Which is a good question. Second point, the relationship between parent and children is clear. If we do a deadlock detection, we should be able to search for that as a special case, since we will know that we are a child and that such a situation might occur. So just add in an edge so the rest of the deadlock code works fine. If that doesn't work, use a heurisic. If parent is waiting when child does deadlock test, assume its a deadlock and abort the child speculatively just in case. You can work out how to do that better in the future, since it won't happen that often. Well, the deadlock checker needs to work not only when one of the parallel processes invokes it, but also when somebody else invokes it. For example, suppose we have parallel processes A and B. A holds lock 1 and awaits lock 2, while B holds awaits lock 3. No problem! Now process X, which is holding lock 2 tries to grab lock 1. X must be able to detect the deadlock. Now, that's not necessarily a problem with what you said: it just means that the parent-child relationship has to be clear from the contents of shared memory. Which is pretty much the direction I'm aiming with the incomplete patch I posted. For the sake of simplicity, I want to assume that every process in a locking group waits for every other process in the same locking group, even though that might not be technically true in every case. If the parent is waiting for a lock and the guy who has that lock is waiting for a parallel worker in the same group, my plan (which I think matches your suggestion) is to call that a deadlock. There are a few sticky points here: sometimes, the deadlock checker doesn't actually abort transactions, but just rearranges the wait queue, so something at least somewhat sensible needs to happen in that case. The thing that I'm aiming to do in the patch which I think you are suggesting might not be necessary is to make it possible for the child go ahead and request AccessShareLock on the scan relation even though the parent might already hold some other lock (perhaps even AccessExclusiveLock). I want to make the lock manager smart enough to recognize that those locks are mutually non-conflicting because of the fact that the two processes are in close cooperation. Clearly, we could get by without that, but it adds complexity in other places: the parent has to never release its lock (even if killed) until the child is done with the relation; and the scan code itself needs to be conditional, passing NoLock from children and some other mode in the parent. That's all manageable, but it looks to me like doing the necessary surgery on the lock manager isn't actually going to be that hard; most of the necessary logic is present in
Re: [HACKERS] Locking for Rename To new_name works differently for different objects
On Wed, Oct 15, 2014 at 10:04 AM, Tom Lane t...@sss.pgh.pa.us wrote: Amit Kapila amit.kapil...@gmail.com writes: I have observed that for renaming some of the objects AccessExclusiveLock is taken on object whereas for other kind of objects no lock is taken on object before renaming the object. The usual theory for DDL updates of all types (not just rename) is that an explicit lock is only needed for objects whose catalog representation comprises more than one row. Otherwise, the implicit locking involved in updating that row is sufficient to serialize different updates. That's an interesting point that I hadn't considered, but I'm willing to believe that at least some of the differences might also be haphazard. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Maximum number of WAL files in the pg_xlog directory
On Fri, Aug 8, 2014 at 4:08 PM, Guillaume Lelarge guilla...@lelarge.info wrote: Hi, As part of our monitoring work for our customers, we stumbled upon an issue with our customers' servers who have a wal_keep_segments setting higher than 0. We have a monitoring script that checks the number of WAL files in the pg_xlog directory, according to the setting of three parameters (checkpoint_completion_target, checkpoint_segments, and wal_keep_segments). We usually add a percentage to the usual formula: greatest( (2 + checkpoint_completion_target) * checkpoint_segments + 1, checkpoint_segments + wal_keep_segments + 1 ) And we have lots of alerts from the script for customers who set their wal_keep_segments setting higher than 0. So we started to question this sentence of the documentation: There will always be at least one WAL segment file, and will normally not be more than (2 + checkpoint_completion_target) * checkpoint_segments + 1 or checkpoint_segments + wal_keep_segments + 1 files. (http://www.postgresql.org/docs/9.3/static/wal-configuration.html) While doing some tests, it appears it would be more something like: wal_keep_segments + (2 + checkpoint_completion_target) * checkpoint_segments + 1 But after reading the source code (src/backend/access/transam/xlog.c), the right formula seems to be: wal_keep_segments + 2 * checkpoint_segments + 1 Here is how we went to this formula... CreateCheckPoint(..) is responsible, among other things, for deleting and recycling old WAL files. From src/backend/access/transam/xlog.c, master branch, line 8363: /* * Delete old log files (those no longer needed even for previous * checkpoint or the standbys in XLOG streaming). */ if (_logSegNo) { KeepLogSeg(recptr, _logSegNo); _logSegNo--; RemoveOldXlogFiles(_logSegNo, recptr); } KeepLogSeg(...) function takes care of wal_keep_segments. From src/backend/access/transam/xlog.c, master branch, line 8792: /* compute limit for wal_keep_segments first */ if (wal_keep_segments 0) { /* avoid underflow, don't go below 1 */ if (segno = wal_keep_segments) segno = 1; else segno = segno - wal_keep_segments; } IOW, the segment number (segno) is decremented according to the setting of wal_keep_segments. segno is then sent back to CreateCheckPoint(...) via _logSegNo. The RemoveOldXlogFiles() gets this segment number so that it can remove or recycle all files before this segment number. This function gets the number of WAL files to recycle with the XLOGfileslop constant, which is defined as: /* * XLOGfileslop is the maximum number of preallocated future XLOG segments. * When we are done with an old XLOG segment file, we will recycle it as a * future XLOG segment as long as there aren't already XLOGfileslop future * segments; else we'll delete it. This could be made a separate GUC * variable, but at present I think it's sufficient to hardwire it as * 2*CheckPointSegments+1. Under normal conditions, a checkpoint will free * no more than 2*CheckPointSegments log segments, and we want to recycle all * of them; the +1 allows boundary cases to happen without wasting a * delete/create-segment cycle. */ #define XLOGfileslop(2*CheckPointSegments + 1) (in src/backend/access/transam/xlog.c, master branch, line 100) IOW, PostgreSQL will keep wal_keep_segments WAL files before the current WAL file, and then there may be 2*CheckPointSegments + 1 recycled ones. Hence the formula: wal_keep_segments + 2 * checkpoint_segments + 1 And this is what we usually find in our customers' servers. We may find more WAL files, depending on the write activity of the cluster, but in average, we get this number of WAL files. AFAICT, the documentation is wrong about the usual number of WAL files in the pg_xlog directory. But I may be wrong, in which case, the documentation isn't clear enough for me, and should be fixed so that others can't misinterpret it like I may have done. Any comments? did I miss something, or should we fix the documentation? I think you're right. The correct formula of the number of WAL files in pg_xlog seems to be (3 + checkpoint_completion_target) * checkpoint_segments + 1 or wal_keep_segments + 2 * checkpoint_segments + 1 Why? At the end of checkpoint, the WAL files which were generated since the start of previous checkpoint cannot be removed and must remain in pg_xlog. The number of them is (1 + checkpoint_completion_target) * checkpoint_segments or wal_keep_segments Also, at the end of checkpoint, as you pointed out, if there are *many* enough old WAL files, 2 * checkpoint_segments + 1 WAL files will be recycled. Then checkpoint_segments WAL files will be consumed till the end of next checkpoint. But since there are already 2 * checkpoint_segments + 1 recycled WAL files, no more files are increased. So, WAL files that we cannot remove and can recycle at the
Re: [HACKERS] WIP: dynahash replacement for buffer table
On Tue, Oct 14, 2014 at 6:19 PM, Robert Haas robertmh...@gmail.com wrote: With regard for using a hash table for the buffer mapping lock I'm doubtful that any form of separate chaining is the right one. We currently have a quite noticeable problem with the number of cache misses in the buffer mapping hash (and also the heavyweight lock mgr) - if we stay with hashes that's only going to be fundamentally lower than dynahash if we change the type of hashing. I've had good, *preliminary*, results using an open addressing + linear probing approach. I'm very skeptical of open addressing + linear probing. Such hash tables tend to degrade over time, because you have to leave tombstones behind to ensure that future scans advance far enough. There's really no way to recover without rebuilding the whole hash table, and eventually it degrades to linear search. If we're spending too much time walking hash chains, I think the solution is to increase the number of buckets so that the chains get shorter. What about cuckoo hashing? There was a recent paper on how to do fine grained locking with cuckoo hashes. [1] I'm imagining a bucketized cuckoo hash with 5 item buckets (5-way associativity). This allows us to fit the bucket onto 2 regular sized cache lines and have 8 bytes left over. Buckets would be protected by seqlocks stored in the extra space. On the read side we would only need 2 read barriers (basically free on x86), and we are guaranteed to have an answer by fetching two pairs of cache lines. We can trade memory bandwidth for latency by issuing prefetches (once we add primitives for this). Alternatively we can trade insert speed for lookup speed by using asymmetrically sized tables. [1] https://www.cs.princeton.edu/~mfreed/docs/cuckoo-eurosys14.pdf Regards, Ants Aasma -- Cybertec Schönig Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: dynahash replacement for buffer table
On 15/10/2014 10:32 AM, Ants Aasma wrote: On Tue, Oct 14, 2014 at 6:19 PM, Robert Haas robertmh...@gmail.com wrote: With regard for using a hash table for the buffer mapping lock I'm doubtful that any form of separate chaining is the right one. We currently have a quite noticeable problem with the number of cache misses in the buffer mapping hash (and also the heavyweight lock mgr) - if we stay with hashes that's only going to be fundamentally lower than dynahash if we change the type of hashing. I've had good, *preliminary*, results using an open addressing + linear probing approach. I'm very skeptical of open addressing + linear probing. Such hash tables tend to degrade over time, because you have to leave tombstones behind to ensure that future scans advance far enough. There's really no way to recover without rebuilding the whole hash table, and eventually it degrades to linear search. If we're spending too much time walking hash chains, I think the solution is to increase the number of buckets so that the chains get shorter. What about cuckoo hashing? There was a recent paper on how to do fine grained locking with cuckoo hashes. [1] I'm imagining a bucketized cuckoo hash with 5 item buckets (5-way associativity). This allows us to fit the bucket onto 2 regular sized cache lines and have 8 bytes left over. Buckets would be protected by seqlocks stored in the extra space. On the read side we would only need 2 read barriers (basically free on x86), and we are guaranteed to have an answer by fetching two pairs of cache lines. We can trade memory bandwidth for latency by issuing prefetches (once we add primitives for this). Alternatively we can trade insert speed for lookup speed by using asymmetrically sized tables. [1] https://www.cs.princeton.edu/~mfreed/docs/cuckoo-eurosys14.pdf Actually, I'd go with second-chance hashing [2], same number of hash functions but it's more stable (no infinite loops, for example). Most probably the techniques from [1] would apply equally well. [2] http://www.eecs.harvard.edu/~michaelm/postscripts/infocom_hardware_submit.pdf Ryan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: dynahash replacement for buffer table
On Wed, Oct 15, 2014 at 2:03 AM, Andres Freund and...@2ndquadrant.com wrote: On 2014-10-14 17:53:10 -0400, Robert Haas wrote: On Tue, Oct 14, 2014 at 4:42 PM, Andres Freund and...@2ndquadrant.com wrote: The code in CHashSearch shows the problem there: you need to STORE the hazard pointer before you begin to do the LOAD operations to scan the bucket, and you must finish all of those LOADs before you STORE the NULL hazard pointer. A read or write barrier won't provide store/load or load/store ordering. I'm not sure I understand the problem here - but a read + write barrier should suffice? To avoid falling back to two full barriers, we'd need a separate define pg_read_write_barrier(), but that's not a problem. IIUC that should allow us to avoid emitting any actual code on x86. Well, thinking about x86 specifically, it definitely needs at least one mfence, after setting the hazard pointer and before beginning to read the bucket chain. Yes, I can see that now. I do wonder if that's not going to prove quite expensive... I think I can see some ways around needing that. But I think that needs some benchmarking first - no need to build a even more complex solution if not needed. The solution I'm thinking of is to essentially move away from hazard pointers and store something like a generation counter per backend. Which is updated less often, and in combination with other operations. When a backend need to clean up and sees that there's a straggler with a old generation it sends that backend a signal to ensure it sets the latest generation. It's possible that might work ... but on the timescale we're talking about here, that's asking the garbage collecting process to wait for practically geologic time. Back when I first wrote this code, I spent a fair amount of time looking at existing work in the area of lock-free hash tables. Essentially all of the work I was able to find on this topic assumes a threaded model - or more precisely, it assumes that shared memory allocation is cheap and easy and you'll have no problem getting as much of it as you need whenever you need it. This assumption often isn't even spelled out explicitly: it's just assumed that that's the programming environment you're working in. Finding highly parallel algorithms that don't rely on memory allocation as a primitive is hard. Hazard pointers are one of the few techniques I found that seems like it can work in our architecture. I'm quite reluctant to leave aside techniques that have been proven to work well in favor of inventing something entirely novel to PostgreSQL. That having been said, there is some literature on generation numbers, and I think something like that could be made to work. It does have some significant disadvantages, though. One is that a single process which fails to update its generation number prevents all reclamation, system-wide.In my algorithm, a process whose execution is suspended while a hazard pointer is set prevents recycling of only one of many garbage lists. A process searching for a reusable element can mostly likely find some other garbage list to reclaim instead. Also, a generation number implies that we update the value periodically, rather than before and after each critical section. Anything that forces garbage collection to be postponed longer than absolutely necessary seems to me likely to be a loser. It might be a little faster as long as we have free elements to allocate, but as soon as we're out and have to wait longer than we otherwise would for garbage collection, and all system activity halts as a result, even for a few milliseconds, it's going to be a whole lot slower. Or at least, I think. That having been said, I don't know what to do about the fact that the fence is too expensive. I don't know that we've really established that that's the true root of the problem rather than some other pedestrian optimization failure. But the existing code is using an atomic operation to acquire a spinlock, then releasing it, walking the bucket chain, and then using another atomic operation to acquire a spinlock and then releasing it again. Surely a pure fence shouldn't cost more than a spinlock cycle? Even with arguably one extra cache line touch, that seems like it ought to be a win. But my intuitions in this area are shaky. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Buffer Requests Trace
On Wed, Oct 15, 2014 at 6:22 AM, Lucas Lersch lucasler...@gmail.com wrote: So is it a possible normal behavior that running tpcc for 10min only access 50% of the database? Furthermore, is there a guideline of parameters for tpcc (# of warehouses, execution time, operations weight)? I'm not familiar with your benchmarking tool. With the one I am most familiar with, pgbench, if you run it against a database which is too big to fit in memory, it can take a very long time to touch each page once, because the constant random disk reads makes it run very slowly. Maybe that is something to consider here--how many transactions were actually executed during your 10 min run? Also, the tool might build tables that are only used under certain run options. Perhaps you just aren't choosing the options which invoke usage of those tables. Since you have the trace data, it should be pretty easy to count how many distinct blocks are accessed from each relation, and compare that to the size of the relations to see which relations are unused or lightly used. Cheers, Jeff
Re: [HACKERS] Column Redaction
On Wed, Oct 15, 2014 at 4:04 AM, Simon Riggs si...@2ndquadrant.com wrote: On 14 October 2014 17:43, Robert Haas robertmh...@gmail.com wrote: On Sat, Oct 11, 2014 at 3:40 AM, Simon Riggs si...@2ndquadrant.com wrote: As soon as you issue the above query, you have clearly indicated your intention to steal. Receiving information is no longer accidental, it is an explicit act that is logged in the auditing system against your name. This is sufficient to bury you in court and it is now a real deterrent. Redaction has worked. To me, this feels thin. It's true that this might be good enough for some users, but I wouldn't bet on it being good enough for very many users, and I really hope there's a better option. We have an existing method of doing data redaction via security barrier views. I agree with thin. There is a leak in the design, so let me coin the phrase imprecise security. Of course, the leaks reduce the value of such a feature; they just don't reduce it all the way to zero. Security barrier views or views of any kind don't do the required job. We are not able to easily classify people as Trusted or Untrusted. We're seeking to differentiate between the right to use a column for queries and the right to see the value itself. Or put another way, you can read the book, you just can't photocopy it and take the copy home. Or, you can try on the new clothes to see if they fit, but you can't take them home for free. Both of those examples have imprecise security measures in place to control and reduce negative behaviours and in every other industry this is known as security. In IT terms, we're looking at controlling and reducing improper access to data by an otherwise Trusted person. The only problem is that some actions on data items are allowed, others are not. Sure, I don't disagree with any of that as a general principle. I just think we should look for some ways of shoring up your proposal against some of the more obvious attacks, so as to have more good and less bad. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] group locking: incomplete patch, just for discussion
On 15 October 2014 17:03, Robert Haas robertmh...@gmail.com wrote: Well, I'm fervently in agreement with you on one point: the first version of all this needs to be as simple as possible, or the time to get to the first version will be longer than we can afford to wait. I think what we're discussing here is which things are important enough that it makes sense to have them in the first version, and which things can wait. I'm guessing we might differ slightly on what constitutes as simple as possible. Something usable, with severe restrictions, is actually better than we have now. I understand the journey this work represents, so don't be embarrassed by submitting things with heuristics and good-enoughs in it. Our mentor, Mr.Lane, achieved much by spreading work over many releases, leaving others to join in the task. Might I gently enquire what the something usable we are going to see in this release? I'm not up on current plans. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Column Redaction
On 15 October 2014 19:46, Robert Haas robertmh...@gmail.com wrote: In IT terms, we're looking at controlling and reducing improper access to data by an otherwise Trusted person. The only problem is that some actions on data items are allowed, others are not. Sure, I don't disagree with any of that as a general principle. I just think we should look for some ways of shoring up your proposal against some of the more obvious attacks, so as to have more good and less bad. Suggestions welcome. I'm not in a rush to implement this, so we have time to mull it over. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Column Redaction
On Sat, Oct 11, 2014 at 4:40 AM, Simon Riggs si...@2ndquadrant.com wrote: On 10 October 2014 16:45, Rod Taylor rod.tay...@gmail.com wrote: Redaction prevents accidental information loss only, forcing any loss that occurs to be explicit. It ensures that loss of information can be tied clearly back to an individual, like an ink packet that stains the fingers of a thief. That is not true. It can only be tied to a session. That's very far from an individual in court terms, if you ask a lawyer. You need a helluva lot more to tie that to an individual. Redaction clearly relies completely on auditing before it can have any additional effect. And the effectiveness of redaction needs to be understood next to Rod's example. It forces you to audit all of the queries issued by the otherwise trusted user. That is, I believe, a far from optimal design. When you have to audit everything, you end up auditing nothing, a haystack of false positives can easily hide the needle that is the true positive. What you want, is something that allows selective auditing of leak-prone queries. But we've seen that joining is already a leak-prone query, so clearly you cannot allow simple joining if you want the above. What I propose, needs a schema change and some preparedness from the DBA. But, how can you assume that to be asking too much and not say the same from thorough auditing? So, what I propose, is to require explicit separation of concepts at the schema level. On Sat, Oct 11, 2014 at 10:43 AM, Bruce Momjian br...@momjian.us wrote: For example, for a credit card type, you would output the last four digits, but is there any value to storing the non-visible digits? You can check the checksum of the digits, but that can be done on input and doesn't require the storage of the digits. Is there some function we could provide that would make that data type useful? Could we provide comparison functions with delays or increasing delays? Basically, as said above, the point is to provide a data type that is nigh-useless. Imagine a redacted card number as a tuple (full_value_id, suffix). Suffix is in cleartext, and prefix_id is just an id pointing to a lookup table for the type. Regular users can read any redacted_number column, but will only get the id (useless unless they already know what that prefix is), and suffix. Format for that type would be suffix and would serve the purpose on the OP: it can be joined (equal value = equal id). Moreover, the type can be design in one of two ways: equal values contain equal id, or salted-values, where even equal values generated from different computations (ie: not copied) have different ids. This second mode would be the most secure, albeit a tad hard to use perhaps. But it would allow joining and everything. Only users that have access to the lookup table would be allowed to resolve the full value, with a non-security-defining function like: extract_full_value(redacted_number) Then you can audit all queries against the lookup table, and you have rather strong security IMHO. This can all be done without any new features to postgres. Maybe you can add syntactic sugar, but you don't really need anything on the core to accomplish the above. The syntactic sugar can take the form of a new data type family (like enum?) where you specify the redaction function, redacted data type, output format, and from there everything else works atomagically, with a extract_full(any) - any function that somehow knows what to do. On Wed, Oct 15, 2014 at 3:57 PM, Simon Riggs si...@2ndquadrant.com wrote: On 15 October 2014 19:46, Robert Haas robertmh...@gmail.com wrote: In IT terms, we're looking at controlling and reducing improper access to data by an otherwise Trusted person. The only problem is that some actions on data items are allowed, others are not. Sure, I don't disagree with any of that as a general principle. I just think we should look for some ways of shoring up your proposal against some of the more obvious attacks, so as to have more good and less bad. Suggestions welcome. I'm not in a rush to implement this, so we have time to mull it over. Does the above work for your intended purposes? Hard to know from what you've posted until now, but I believe it does. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal for better support of time-varying timezone abbreviations
... and here is a draft patch for the timezone abbreviation data files. I changed all the abbreviations for which the parent zone had used more than one GMT offset since 1970. That seemed like a good cutoff to avoid wasting cycles on ancient history, especially since the IANA people themselves don't make any large promises about the quality of their data before 1970. Although zones in the Russian Federation are the majority of zones that had abbreviation changes, a quick look at this patch shows that they're hardly the only ones. We've been sticking our heads in the sand about this problem for quite a while :-( regards, tom lane diff --git a/src/timezone/tznames/America.txt b/src/timezone/tznames/America.txt index 54b51fe..9e62732 100644 *** a/src/timezone/tznames/America.txt --- b/src/timezone/tznames/America.txt *** AMT-14400# Amazon Time *** 47,53 # (America/Cuiaba) # (America/Manaus) # (America/Porto_Velho) ! ART-10800# Argentina Time # (America/Argentina/Buenos_Aires) # (America/Argentina/Cordoba) # (America/Argentina/Tucuman) --- 47,53 # (America/Cuiaba) # (America/Manaus) # (America/Porto_Velho) ! ARTAmerica/Argentina/Buenos_Aires # Argentina Time # (America/Argentina/Buenos_Aires) # (America/Argentina/Cordoba) # (America/Argentina/Tucuman) *** ART-10800# Argentina Time *** 58,64 # (America/Argentina/Mendoza) # (America/Argentina/Rio_Gallegos) # (America/Argentina/Ushuaia) ! ARST-7200 D # Argentina Summer Time # CONFLICT! AST is not unique # Other timezones: # - AST: Arabic Standard Time (Asia) --- 58,64 # (America/Argentina/Mendoza) # (America/Argentina/Rio_Gallegos) # (America/Argentina/Ushuaia) ! ARSTAmerica/Argentina/Buenos_Aires # Argentina Summer Time # CONFLICT! AST is not unique # Other timezones: # - AST: Arabic Standard Time (Asia) *** GMT 0# Greenwich Mean Time *** 228,234 # (Etc/GMT) # (Europe/Dublin) # (Europe/London) ! GYT-14400# Guyana Time # (America/Guyana) HADT -32400 D # Hawaii-Aleutian Daylight Time # (America/Adak) --- 228,234 # (Etc/GMT) # (Europe/Dublin) # (Europe/London) ! GYTAmerica/Guyana # Guyana Time # (America/Guyana) HADT -32400 D # Hawaii-Aleutian Daylight Time # (America/Adak) *** PST-28800# Pacific Standard Time *** 285,299 # (Pacific/Pitcairn) PYST -10800 D # Paraguay Summer Time # (America/Asuncion) ! PYT-14400# Paraguay Time # (America/Asuncion) ! SRT-10800# Suriname Time # (America/Paramaribo) UYST-7200 D # Uruguay Summer Time # (America/Montevideo) UYT-10800# Uruguay Time # (America/Montevideo) ! VET-16200# Venezuela Time (caution: this used to mean -14400) # (America/Caracas) WGST-7200 D # Western Greenland Summer Time # (America/Godthab) --- 285,299 # (Pacific/Pitcairn) PYST -10800 D # Paraguay Summer Time # (America/Asuncion) ! PYTAmerica/Asuncion # Paraguay Time # (America/Asuncion) ! SRTAmerica/Paramaribo # Suriname Time # (America/Paramaribo) UYST-7200 D # Uruguay Summer Time # (America/Montevideo) UYT-10800# Uruguay Time # (America/Montevideo) ! VETAmerica/Caracas # Venezuela Time # (America/Caracas) WGST-7200 D # Western Greenland Summer Time # (America/Godthab) diff --git a/src/timezone/tznames/Antarctica.txt b/src/timezone/tznames/Antarctica.txt index 5a03250..2359020 100644 *** a/src/timezone/tznames/Antarctica.txt --- b/src/timezone/tznames/Antarctica.txt *** CLST -10800 D # Chile Summer Time *** 16,26 CLT-14400# Chile Time # (America/Santiago) # (Antarctica/Palmer) ! DAVT25200# Davis Time (Antarctica) # (Antarctica/Davis) DDUT36000# Dumont-d`Urville Time (Antarctica) # (Antarctica/DumontDUrville) ! MAWT18000
Re: [HACKERS] replicating DROP commands across servers
On 2014-10-04 21:12:24 -0400, Robert Haas wrote: On Fri, Oct 3, 2014 at 4:58 PM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: Robert Haas wrote: I'm not really very convinced that it's a good idea to expose this instead of just figuring out a way to parse the object identity. That's the first thing I tried. But it's not pretty: you have to extract schema names by splitting at a period (and what if a schema name contains a period?), Please tell me that the literals are escaped if necessary. If so, this is pretty easy. quote_literal() is not a hard transformation to reverse, and splitting on a unquoted period is not hard... split out on ON for certain object types, ...nor is splitting on any other fixed text string, such as ON . I'm not following here. Maybe just because I'm misunderstanding your position. The patch imo consists out of the following parts: 1) Addition of dependency information to the dropped object list 2) Actual get_object_address() handling for default values - the current behaviour looks pretty borked to me. 3) The reverse of getObjectTypeDescription() 4) getObjectIdentityParts() - a slightly more detailed version of getObjectIdentity() that requires less string parsing 5) drop even trigger support for a few types. Are you saying you want to add a function to do 4) via parsing inside postgres or are you suggesting to do that in every user of this facility? If the former, why would it be preferrable to do string parsing if we have access to the source data? That part of the patch looks trivial to me? If the latter, I don't see the advantage either - this is complicated enough, why should different users repeat the work? Am I misunderstanding you here? Having reread the patch just now I basically see two things to criticize: a) why isn't this accessible at SQL level? That seems easy to address. b) Arguably some of this could well be done in separate commits. It's just not sane to try to parse such text strings. But this is a pretty ridiculous argument. We have an existing parser that does it just fine Which happens to be the part of postgres that's copied most often. So it's certainly not something appearing to be trivial. ,and a special-purpose parser that does just that (and not all of the other stuff that the main parser does) would be a great deal simpler. That parser also happens to be far from trivial (if we're talking about parseNameAndArgTypes() - which just solves one of the cases. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] replicating DROP commands across servers
Andres Freund wrote: Having reread the patch just now I basically see two things to criticize: a) why isn't this accessible at SQL level? That seems easy to address. b) Arguably some of this could well be done in separate commits. Fair comments. I will split it up. FWIW, I spent some time today fighting with this stuff but the OBJECT_POLICY stuff has been causing me some trouble. I'm not sure that what's there currently in get_object_address is completely sane -- for one thing I'm unsure that tables are the only object kind in the system that are subject to policies. In that case, we have a shortage of abstraction here, it seems, which will need some additional fixes. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb generator functions
On 10/15/2014 07:38 AM, Pavel Stehule wrote: 2014-10-13 17:22 GMT+02:00 Andrew Dunstan and...@dunslane.net mailto:and...@dunslane.net: On 10/13/2014 09:37 AM, Andrew Dunstan wrote: On 09/26/2014 04:54 PM, Andrew Dunstan wrote: Here is a patch for the generator and aggregate functions for jsonb that we didn't manage to get done in time for 9.4. They are all equivalents of the similarly names json functions. Included are to_jsonb jsonb_build_object jsonb_build_array jsonb_object jsonb_agg jsonb_object_agg Still to come: documentation. Adding to the next commitfest. Revised patch to fix compiler warnings. And again, initializing an incompletely initialized variable, as found by Pavel Stehule. I checked a code, and I have only two small objection - a name jsonb_object_two_arg is not good - maybe json_object_keys_values ? It's consistent with the existing json_object_two_arg. In all cases I think I kept the names the same except for changing json to jsonb. Note that these _two_arg functions are not visible at the SQL level - they are only visible in the C code. I'm happy to be guided by others in changing or keeping these names. Next: there are no tests for to_jsonb function. Oh, my bad. I'll add some. Thank for the review. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Column Redaction
On 15 October 2014 20:41, Claudio Freire klaussfre...@gmail.com wrote: On Sat, Oct 11, 2014 at 4:40 AM, Simon Riggs si...@2ndquadrant.com wrote: On 10 October 2014 16:45, Rod Taylor rod.tay...@gmail.com wrote: Redaction prevents accidental information loss only, forcing any loss that occurs to be explicit. It ensures that loss of information can be tied clearly back to an individual, like an ink packet that stains the fingers of a thief. That is not true. It can only be tied to a session. That's very far from an individual in court terms, if you ask a lawyer. You need a helluva lot more to tie that to an individual. So you're familiar then with this process? So you know that an auditor would trigger an investigation, resulting in deeper surveillance and gathering of evidence that ends with various remedial actions, such as court. How would that process start then, if not this way? -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Column Redaction
On Wed, Oct 15, 2014 at 4:59 PM, Simon Riggs si...@2ndquadrant.com wrote: On 15 October 2014 20:41, Claudio Freire klaussfre...@gmail.com wrote: On Sat, Oct 11, 2014 at 4:40 AM, Simon Riggs si...@2ndquadrant.com wrote: On 10 October 2014 16:45, Rod Taylor rod.tay...@gmail.com wrote: Redaction prevents accidental information loss only, forcing any loss that occurs to be explicit. It ensures that loss of information can be tied clearly back to an individual, like an ink packet that stains the fingers of a thief. That is not true. It can only be tied to a session. That's very far from an individual in court terms, if you ask a lawyer. You need a helluva lot more to tie that to an individual. So you're familiar then with this process? So you know that an auditor would trigger an investigation, resulting in deeper surveillance and gathering of evidence that ends with various remedial actions, such as court. How would that process start then, if not this way? I've seen lots of such investigations fail because the evidence wasn't strong enough to link to a particular person, but rather a computer terminal or something like that. Unless you also physically restrict access to such terminal to a single person through other means (which is quite uncommon practice except perhaps in banks), that evidence is barely circumstantial. But you'd have to ask a lawyer in your country to be sure. I can only speak for my own experiences in my own country which is probably not yours nor has the same laws. Law is a complex beast. So, you really want actual information security in addition to that deterrent you speak of. I don't say the deterrent is bad, I only say it's not good enough on its own. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Maximum number of WAL files in the pg_xlog directory
On Fri, Aug 8, 2014 at 12:08 AM, Guillaume Lelarge guilla...@lelarge.info wrote: Hi, As part of our monitoring work for our customers, we stumbled upon an issue with our customers' servers who have a wal_keep_segments setting higher than 0. We have a monitoring script that checks the number of WAL files in the pg_xlog directory, according to the setting of three parameters (checkpoint_completion_target, checkpoint_segments, and wal_keep_segments). We usually add a percentage to the usual formula: greatest( (2 + checkpoint_completion_target) * checkpoint_segments + 1, checkpoint_segments + wal_keep_segments + 1 ) I think the first bug is even having this formula in the documentation to start with, and in trying to use it. and will normally not be more than... This may be normal for a toy system. I think that the normal state for any system worth monitoring is that it has had load spikes at some point in the past. So it is the next part of the doc, which describes how many segments it climbs back down to upon recovering from a spike, which is the important one. And that doesn't mention wal_keep_segments at all, which surely cannot be correct. I will try to independently derive the correct formula from the code, as you did, without looking too much at your derivation first, and see if we get the same answer. Cheers, Jeff
Re: [HACKERS] WIP: dynahash replacement for buffer table
On 2014-10-15 13:41:25 -0400, Robert Haas wrote: On Wed, Oct 15, 2014 at 2:03 AM, Andres Freund and...@2ndquadrant.com wrote: The solution I'm thinking of is to essentially move away from hazard pointers and store something like a generation counter per backend. Which is updated less often, and in combination with other operations. When a backend need to clean up and sees that there's a straggler with a old generation it sends that backend a signal to ensure it sets the latest generation. It's possible that might work ... but on the timescale we're talking about here, that's asking the garbage collecting process to wait for practically geologic time. I think it depends on what we're tying the generation increase to. We very well could add a implict generation increment to, say, lwlock acquisition - which are full barriers anyway. And I don't think we'll normally have a that high rate of garbage collection anyway - there should be plenty of headroom. Back when I first wrote this code, I spent a fair amount of time looking at existing work in the area of lock-free hash tables. Essentially all of the work I was able to find on this topic assumes a threaded model - or more precisely, it assumes that shared memory allocation is cheap and easy and you'll have no problem getting as much of it as you need whenever you need it. FWIW, I think many of the solutions that are actually used in practice don't rely on that heavily. I know at least some that require memory to be reserved beforehand, in special contexts. This assumption often isn't even spelled out explicitly: it's just assumed that that's the programming environment you're working in. Finding highly parallel algorithms that don't rely on memory allocation as a primitive is hard. Hazard pointers are one of the few techniques I found that seems like it can work in our architecture. I'm quite reluctant to leave aside techniques that have been proven to work well in favor of inventing something entirely novel to PostgreSQL. I don't think something like generation numbers is really that new and unproven - it's essentially a more trivial version of RCU. Which is used quite heavily, and under different names. That said, I'm far from convinced that they are the solution - they just were the first thing that came to my mind thinking about the problem. That having been said, there is some literature on generation numbers, and I think something like that could be made to work. It does have some significant disadvantages, though. One is that a single process which fails to update its generation number prevents all reclamation, system-wide.In my algorithm, a process whose execution is suspended while a hazard pointer is set prevents recycling of only one of many garbage lists. A process searching for a reusable element can mostly likely find some other garbage list to reclaim instead. Hm. Couldn't a similar scheme with several lists be used with generation numbers? Also, a generation number implies that we update the value periodically, rather than before and after each critical section. Hm. Don't think it necessarily has to do that. Anything that forces garbage collection to be postponed longer than absolutely necessary seems to me likely to be a loser. It might be a little faster as long as we have free elements to allocate, but as soon as we're out and have to wait longer than we otherwise would for garbage collection, and all system activity halts as a result, even for a few milliseconds, it's going to be a whole lot slower. Or at least, I think. I think it really depends on the user of the facility. If the generation were e.g. also tied to lwlocks I couldn't see bufmgr.c outrunning that realistically. That having been said, I don't know what to do about the fact that the fence is too expensive. I'm far from sure that it's the problem at hand here - the reason I'm wondering about the barriers is primarily that the buffer mapping hash table is one of the top profile entries in in all concurrent workloads. The top one often even, after removing some locking bottleneck. Nearly all of the time is spent there due to cache misses. While I think we can get the table smaller and more efficient for the same NBuffers value, realistically we need to cope with much bigger NBuffers values. Since cache misses are a problem that we're going to have to deal with, restricting the processor's tool for efficiently dealing with that (out of order execution, SMT), doesn't seem like a wise choice. I don't know that we've really established that that's the true root of the problem rather than some other pedestrian optimization failure. Absolutely. But the existing code is using an atomic operation to acquire a spinlock, then releasing it, walking the bucket chain, and then using another atomic operation to acquire a spinlock and then releasing it again. Well, we don't do that for lookups right now.
Re: [HACKERS] Maximum number of WAL files in the pg_xlog directory
2014-10-15 22:11 GMT+02:00 Jeff Janes jeff.ja...@gmail.com: On Fri, Aug 8, 2014 at 12:08 AM, Guillaume Lelarge guilla...@lelarge.info wrote: Hi, As part of our monitoring work for our customers, we stumbled upon an issue with our customers' servers who have a wal_keep_segments setting higher than 0. We have a monitoring script that checks the number of WAL files in the pg_xlog directory, according to the setting of three parameters (checkpoint_completion_target, checkpoint_segments, and wal_keep_segments). We usually add a percentage to the usual formula: greatest( (2 + checkpoint_completion_target) * checkpoint_segments + 1, checkpoint_segments + wal_keep_segments + 1 ) I think the first bug is even having this formula in the documentation to start with, and in trying to use it. I agree. But we have customers asking how to compute the right size for their WAL file system partitions. Right size is usually a euphemism for smallest size, and they usually tend to get it wrong, leading to huge issues. And I'm not even speaking of monitoring, and alerting. A way to avoid this issue is probably to erase the formula from the documentation, and find a new way to explain them how to size their partitions for WALs. Monitoring is another matter, and I don't really think a monitoring solution should count the WAL files. What actually really matters is the database availability, and that is covered with having enough disk space in the WALs partition. and will normally not be more than... This may be normal for a toy system. I think that the normal state for any system worth monitoring is that it has had load spikes at some point in the past. Agreed. So it is the next part of the doc, which describes how many segments it climbs back down to upon recovering from a spike, which is the important one. And that doesn't mention wal_keep_segments at all, which surely cannot be correct. Agreed too. I will try to independently derive the correct formula from the code, as you did, without looking too much at your derivation first, and see if we get the same answer. Thanks. I look forward reading what you found. What seems clear to me right now is that no one has a sane explanation of the formula. Though yours definitely made sense, it didn't seem to be what the code does. -- Guillaume. http://blog.guillaume.lelarge.info http://www.dalibo.com
Re: [HACKERS] jsonb generator functions
On 10/15/2014 03:54 PM, I wrote: On 10/15/2014 07:38 AM, Pavel Stehule wrote: I checked a code, and I have only two small objection - a name jsonb_object_two_arg is not good - maybe json_object_keys_values ? It's consistent with the existing json_object_two_arg. In all cases I think I kept the names the same except for changing json to jsonb. Note that these _two_arg functions are not visible at the SQL level - they are only visible in the C code. I'm happy to be guided by others in changing or keeping these names. If we really want to change the name of json_object_two_arg, it would probably be best to change it NOW in 9.4 before it gets out into a production release at all. Thoughts? cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Maximum number of WAL files in the pg_xlog directory
On 10/15/2014 01:25 PM, Guillaume Lelarge wrote: Monitoring is another matter, and I don't really think a monitoring solution should count the WAL files. What actually really matters is the database availability, and that is covered with having enough disk space in the WALs partition. If we don't count the WAL files, though, that eliminates the best way to detecting when archiving is failing. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb generator functions
Andrew Dunstan wrote: If we really want to change the name of json_object_two_arg, it would probably be best to change it NOW in 9.4 before it gets out into a production release at all. Doesn't it require initdb? If so, I think it's too late now. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb generator functions
On 10/15/2014 05:47 PM, Alvaro Herrera wrote: Andrew Dunstan wrote: If we really want to change the name of json_object_two_arg, it would probably be best to change it NOW in 9.4 before it gets out into a production release at all. Doesn't it require initdb? If so, I think it's too late now. Yeah, you're right, it would. OK, forget that. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] replicating DROP commands across servers
Alvaro Herrera wrote: Andres Freund wrote: Having reread the patch just now I basically see two things to criticize: a) why isn't this accessible at SQL level? That seems easy to address. b) Arguably some of this could well be done in separate commits. Fair comments. I will split it up. Here's a split version. The last part is still missing some polish -- in particular handling for OBJECT_POLICY, and the SQL interface which would also allow us to get something in the regression tests. Note: in this patch series you can find the ObjectTypeMap thing that you thought was obsolete in the DDL deparse patch ... -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services From 92816868c1c717519a76a83e65cd0b48e3726fbf Mon Sep 17 00:00:00 2001 From: Alvaro Herrera alvhe...@alvh.no-ip.org Date: Fri, 4 Apr 2014 16:05:48 -0300 Subject: [PATCH 1/3] add normal/original flags to pg_event_trigger_dropped_objects --- doc/src/sgml/func.sgml | 13 ++ src/backend/catalog/dependency.c| 21 ++- src/backend/commands/event_trigger.c| 16 +--- src/include/catalog/pg_proc.h | 2 +- src/include/commands/event_trigger.h| 3 ++- src/test/regress/expected/event_trigger.out | 40 + src/test/regress/sql/event_trigger.sql | 30 ++ 7 files changed, 114 insertions(+), 11 deletions(-) diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index 7e5bcd9..45f3efa 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -17583,6 +17583,19 @@ FOR EACH ROW EXECUTE PROCEDURE suppress_redundant_updates_trigger(); entryObject sub-id (e.g. attribute number for columns)/entry /row row +entryliteraloriginal/literal/entry +entrytypebool/type/entry +entryFlag used to identify the root object of the deletion/entry + /row + row +entryliteralnormal/literal/entry +entrytypebool/type/entry +entry + Flag indicating that there's a normal dependency relationship + in the dependency graph leading to this object +/entry + /row + row entryliteralobject_type/literal/entry entrytypetext/type/entry entryType of the object/entry diff --git a/src/backend/catalog/dependency.c b/src/backend/catalog/dependency.c index 256486c..6485e3d 100644 --- a/src/backend/catalog/dependency.c +++ b/src/backend/catalog/dependency.c @@ -205,16 +205,25 @@ deleteObjectsInList(ObjectAddresses *targetObjects, Relation *depRel, /* * Keep track of objects for event triggers, if necessary. */ - if (trackDroppedObjectsNeeded()) + if (trackDroppedObjectsNeeded() !(flags PERFORM_DELETION_INTERNAL)) { for (i = 0; i targetObjects-numrefs; i++) { - ObjectAddress *thisobj = targetObjects-refs + i; - - if ((!(flags PERFORM_DELETION_INTERNAL)) -EventTriggerSupportsObjectClass(getObjectClass(thisobj))) + const ObjectAddress *thisobj = targetObjects-refs[i]; + const ObjectAddressExtra *extra = targetObjects-extras[i]; + bool original = false; + bool normal = false; + + if (extra-flags DEPFLAG_ORIGINAL) +original = true; + if (extra-flags DEPFLAG_NORMAL) +normal = true; + if (extra-flags DEPFLAG_REVERSE) +normal = true; + + if (EventTriggerSupportsObjectClass(getObjectClass(thisobj))) { -EventTriggerSQLDropAddObject(thisobj); +EventTriggerSQLDropAddObject(thisobj, original, normal); } } } diff --git a/src/backend/commands/event_trigger.c b/src/backend/commands/event_trigger.c index 1b8c94b..0bab971 100644 --- a/src/backend/commands/event_trigger.c +++ b/src/backend/commands/event_trigger.c @@ -112,6 +112,8 @@ typedef struct SQLDropObject const char *objname; const char *objidentity; const char *objecttype; + bool original; + bool normal; slist_node next; } SQLDropObject; @@ -1105,7 +1107,7 @@ trackDroppedObjectsNeeded(void) * Register one object as being dropped by the current command. */ void -EventTriggerSQLDropAddObject(ObjectAddress *object) +EventTriggerSQLDropAddObject(const ObjectAddress *object, bool original, bool normal) { SQLDropObject *obj; MemoryContext oldcxt; @@ -1124,6 +1126,8 @@ EventTriggerSQLDropAddObject(ObjectAddress *object) obj = palloc0(sizeof(SQLDropObject)); obj-address = *object; + obj-original = original; + obj-normal = normal; /* * Obtain schema names from the object's catalog tuple, if one exists; @@ -1251,8 +1255,8 @@ pg_event_trigger_dropped_objects(PG_FUNCTION_ARGS) { SQLDropObject *obj; int i = 0; - Datum values[7]; - bool nulls[7]; + Datum values[9]; + bool nulls[9]; obj = slist_container(SQLDropObject, next, iter.cur); @@ -1268,6 +1272,12 @@ pg_event_trigger_dropped_objects(PG_FUNCTION_ARGS) /* objsubid */ values[i++] =
Re: [HACKERS] Column Redaction
On 15 October 2014 21:03, Claudio Freire klaussfre...@gmail.com wrote: So you're familiar then with this process? So you know that an auditor would trigger an investigation, resulting in deeper surveillance and gathering of evidence that ends with various remedial actions, such as court. How would that process start then, if not this way? I've seen lots of such investigations fail because the evidence wasn't strong enough to link to a particular person, but rather a computer terminal or something like that. So your solution to the evidence problem is to do nothing? Or you have a better suggestion? Nothing is certain, apart from doing nothing. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [BUGS] BUG #10823: Better REINDEX syntax.
* Alvaro Herrera (alvhe...@2ndquadrant.com) wrote: Stephen Frost wrote: * Alvaro Herrera (alvhe...@2ndquadrant.com) wrote: We lost this patch for the October commitfest, didn't we? I'm guessing you missed that a new version just got submitted..? Which one, reindex schema? Isn't that a completely different patch? Err, sorry, wasn't looking close enough, evidently. :/ I'd be fine with today's being added to the october commitfest.. Of course, there's a whole independent discussion to be had about how there wasn't any break between last commitfest and this one, but that probably deserves its own thread. It's not the first that that happens, and honestly I don't see all that much cause for concern. Heikki did move pending patches to the current one, and closed a lot of inactive ones as 'returned with feedback'. Inactive due to lack of review is the concern, but there is also a concern that it's intended as a way to ensure committers have time to work on their own patches instead of just working on patches submitted through the commitfest process. Now, I think we all end up trying to balance making progress on our own patches while also providing help to the commitfest, but that's the situation we were in constantly before the commitfest process was put in place because it didn't scale very well. If we're always in 'commitfest' mode then we might as well eliminate the notion of timing them. Attentive patch authors should have submitted new versions ... if they don't, then someone else with an interest in the patch should do so. If no one update the patches, what do we want them for? As for this, sure, if there's a review and no response then it's fair to mark the patch as returned with feedback. The issue is both when no patch gets a review and when the commitfest never ends. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] Column Redaction
On Wed, Oct 15, 2014 at 8:59 PM, Simon Riggs si...@2ndquadrant.com wrote: On 15 October 2014 21:03, Claudio Freire klaussfre...@gmail.com wrote: So you're familiar then with this process? So you know that an auditor would trigger an investigation, resulting in deeper surveillance and gathering of evidence that ends with various remedial actions, such as court. How would that process start then, if not this way? I've seen lots of such investigations fail because the evidence wasn't strong enough to link to a particular person, but rather a computer terminal or something like that. So your solution to the evidence problem is to do nothing? Or you have a better suggestion? Nothing is certain, apart from doing nothing. Is solving the evidence problem in scope of the postgresql project? The solution is to not require evidence in order to be protected from data theft. Having evidence is nice, you can punish effective attacks, which is a deterrent to any attacker as you pointed out, and may even include financial compensation. It requires physical security as well as software security, and I'm not qualified to solve that problem without help from a lawyer (but I do know you need help from a lawyer to make sure the evidence you gather is usable). Not having usable evidence, however, could fail to deter knowledgeable attackers (remember, in this setting, it would be an inside job, so it would be a very knowledgeable attacker). But in any case, if the deterrence isn't enough, and you get attacked, anything involving redaction as fleshed out in the OP is good for nothing. The damage has been done already. The feature doesn't meaningfully slow down extraction of data, so anything you do can only punish the attacker, not prevent further data theft or damaged reputation/business. Something that requires superuser privilege (or specially granted privilege) in order to gain access to the unredacted value, on the other hand, would considerably slow down the attacker. From my proposal, only the second form (unnormalized redacted tuples) would provide any meaningful data security in this sense, but even in the other, less limiting form, it would still prevent unauthorized users from extracting the value: you can no longer do binary search with unredacted data, only a full brute-force search would work. That's because the full value id (that I called prefix id, sorry, leftover from an earlier draft) doesn't relate to the unredacted value, so sorting comparisons ( = =) don't provide usable information about value space. So, if there is a chance to implement redaction in a way that truly protects redacted data... even if it costs a bit of performance sometimes. Is avoiding the performance hit worth the risk? I guess the potential users of such a feature are the only ones qualified to answer, and the answer has great weight on how the feature could be implemented. Well, and of course, the quality of the implementation. If my proposal has weaknesses I did not realize yet, it may be worthless. But that's true of all proposals that aim for any meaningful level of security: it's worth a lengthy look. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg
On Fri, 2014-08-29 at 10:12 -0400, Tom Lane wrote: What about removing the callback per se and just keeping the argument, as it were. That is, a MemoryContext has a field of type size_t * that is normally NULL, but if set to non-NULL, then we increment the pointed-to value for pallocs and decrement for pfrees. I like that idea; patch attached. Two differences: * At block level, not chunk level. * I use a uint64, because a size_t is only guaranteed to hold one allocation, and we need to sum up many allocations. When unused, it does still appear to have a little overhead in Robert's test on his power machine. It seems to be between a 0.5-1.0% regression. There's not much extra code on that path, just a branch, pointer dereference, and an addition; so I don't think it will get much cheaper than it is. This regression seems harder to reproduce on my workstation, or perhaps it's just noisier. One thought in either case is that we don't have to touch the API for MemoryContextCreate: rather, the feature can be enabled in a separate call after making the context. That seems fairly awkward to me because the pointer needs to be inherited from the parent context when not specified. I left the extra API call in. The inheritance is awkward anyway, though. If you create a tracked context as a child of an already-tracked context, allocations in the newer one won't count against the original. I don't see a way around that without introducing even more performance problems. If this patch is unacceptable, my only remaining idea is to add the memory only for the current context with no inheritance (thereby eliminating the branch, also). That will require HashAgg to traverse all the child contexts to check whether the memory limit has been exceeded. As Tomas pointed out, that could be a lot of work in the case of array_agg with many groups. Regards, Jeff Davis *** a/src/backend/utils/mmgr/aset.c --- b/src/backend/utils/mmgr/aset.c *** *** 438,451 AllocSetContextCreate(MemoryContext parent, Size initBlockSize, Size maxBlockSize) { ! AllocSet context; /* Do the type-independent part of context creation */ ! context = (AllocSet) MemoryContextCreate(T_AllocSetContext, ! sizeof(AllocSetContext), ! AllocSetMethods, ! parent, ! name); /* * Make sure alloc parameters are reasonable, and save them. --- 438,482 Size initBlockSize, Size maxBlockSize) { ! /* inherit 'track_mem' from parent context, if available */ ! uint64 *track_mem = (parent == NULL) ? NULL : parent-track_mem; ! ! return AllocSetContextCreateTracked(parent, name, minContextSize, ! initBlockSize, maxBlockSize, ! track_mem); ! } ! ! /* ! * AllocSetContextCreateTracked ! * Create a new AllocSet context, tracking total memory usage. ! * ! * parent: parent context, or NULL if top-level context ! * name: name of context (for debugging --- string will be copied) ! * minContextSize: minimum context size ! * initBlockSize: initial allocation block size ! * maxBlockSize: maximum allocation block size ! * track_mem: caller-supplied variable to use for memory tracking ! */ ! MemoryContext ! AllocSetContextCreateTracked(MemoryContext parent, ! const char *name, ! Size minContextSize, ! Size initBlockSize, ! Size maxBlockSize, ! uint64 *track_mem) ! { ! AllocSet set; ! MemoryContext context; /* Do the type-independent part of context creation */ ! context = MemoryContextCreate(T_AllocSetContext, ! sizeof(AllocSetContext), ! AllocSetMethods, ! parent, ! name, ! track_mem); ! ! set = (AllocSet) context; /* * Make sure alloc parameters are reasonable, and save them. *** *** 459,467 AllocSetContextCreate(MemoryContext parent, if (maxBlockSize initBlockSize) maxBlockSize = initBlockSize; Assert(AllocHugeSizeIsValid(maxBlockSize)); /* must be safe to double */ ! context-initBlockSize = initBlockSize; ! context-maxBlockSize = maxBlockSize; ! context-nextBlockSize = initBlockSize; /* * Compute the allocation chunk size limit for this context. It can't be --- 490,498 if (maxBlockSize initBlockSize) maxBlockSize = initBlockSize; Assert(AllocHugeSizeIsValid(maxBlockSize)); /* must be safe to double */ ! set-initBlockSize = initBlockSize; ! set-maxBlockSize = maxBlockSize; ! set-nextBlockSize = initBlockSize; /* * Compute the allocation chunk size limit for this context. It can't be *** *** 477,486 AllocSetContextCreate(MemoryContext parent, * and actually-allocated sizes of any chunk must be on the same side of * the limit, else we get confused about whether the chunk is big. */ ! context-allocChunkLimit = ALLOC_CHUNK_LIMIT; ! while ((Size) (context-allocChunkLimit + ALLOC_CHUNKHDRSZ)
Re: [HACKERS] TODO : Allow parallel cores to be used by vacuumdb [ WIP ]
On 27 September 2014 03:55, Jeff Janes jeff.ja...@gmail.com wrote: On Fri, Sep 26, 2014 at 11:47 AM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: Gavin Flower wrote: Curious: would it be both feasible and useful to have multiple workers process a 'large' table, without complicating things too much? The could each start at a different position in the file. Feasible: no. Useful: maybe, we don't really know. (You could just as well have a worker at double the speed, i.e. double vacuum_cost_limit). Vacuum_cost_delay is already 0 by default. So unless you changed that, vacuum_cost_limit does not take effect under vacuumdb. It is pretty easy for vacuum to be CPU limited, and even easier for analyze to be CPU limited (It does a lot of sorting). I think analyzing is the main use case for this patch, to shorten the pg_upgrade window. At least, that is how I anticipate using it. I've been trying to review this thread with the thought what does this give me?. I am keen to encourage contributions and also keen to extend our feature set, but I do not wish to complicate our code base. Dilip's developments do seem to be good quality; what I question is whether we want this feature. This patch seems to allow me to run multiple VACUUMs at once. But I can already do this, with autovacuum. Is there anything this patch can do that cannot be already done with autovacuum? -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Improve automatic analyze messages for inheritance trees
On 6 October 2014 11:07, Etsuro Fujita fujita.ets...@lab.ntt.co.jp wrote: I noticed that analyze messages shown by autovacuum don't discriminate between non-inherited cases and inherited cases, as shown in the below example: LOG: automatic analyze of table postgres.public.pt system usage: CPU 0.00s/0.01u sec elapsed 0.06 sec LOG: automatic analyze of table postgres.public.pt system usage: CPU 0.00s/0.02u sec elapsed 0.11 sec (The first one is for table postgres.public.pt and the second one is for table inheritance tree postgres.public.pt.) So, I'd like to propose improving the messages for inherited cases, in order to easily distinguish such cases from non-inherited cases. Please find attached a patch. I'll add this to the upcoming CF. Thanks for the suggestion. It seems like a useful addition. Existing log analysis may wish to see the automatic analyze of table on each row. So it would be good to keep automatic analyze of table \%s.%s.%s\ Can we add some words after this to indicate inheritance? (I have no suggestions at present) e.g. automatic analyze of table \%s.%s.%s\ (new words go here) -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Directory/File Access Permissions for COPY and Generic File Access Functions
All, The attached patch for review implements a directory permission system that allows for providing a directory read/write capability to directories for COPY TO/FROM and Generic File Access Functions to non-superusers. This is not a complete solution as it does not currently contain documentation or regression tests. Though it is my hopes to get some feedback as I am sure there are some aspects of it that need to be reworked. So I thought I'd put it out there for comments/review. The approach taken is to create directory aliases that have a unique name and path, as well as an associated ACL list. A superuser can create a new alias to any directory on the system and then provide READ or WRITE permissions to any non-superuser. When a non-superuser then attempts to execute a COPY TO/FROM or any one of the generic file access functions, a permission check is performed against the aliases for the user and target directory. Superusers are allowed to bypass all of these checks. All alias paths must be an absolute path in order to avoid potential risks. However, in the generic file access functions superusers are still allowed to execute the functions with a relative path where non-superusers are required to provide an absolute path. - Implementation Details - System Catalog: pg_diralias - dirname - the name of the directory alias - dirpath - the directory path - must be absolute - diracl - the ACL for the directory Syntax: CREATE DIRALIAS name AS 'path' ALTER DIRALIAS name AS 'path' ALTER DIRALIAS name RENAME TO new_name DROP DIRALIAS name This is probably the area that I would really appreciate your thoughts and recommendations. To GRANT permissions to a directory alias, I had to create a special variant of GRANT since READ and WRITE are not reserved keywords and causes grammar issues. Therefore, I chose to start with the following syntax: GRANT ON DIRALIAS name permissions TO roles where permissions is either READ, WRITE or ALL. Any comments, suggestions or feedback would be greatly appreciated. Thanks, Adam -- Adam Brightwell - adam.brightw...@crunchydatasolutions.com Database Engineer - www.crunchydatasolutions.com diff --git a/src/backend/catalog/Makefile b/src/backend/catalog/Makefile index b257b02..8cdc5cb 100644 --- a/src/backend/catalog/Makefile +++ b/src/backend/catalog/Makefile @@ -41,7 +41,7 @@ POSTGRES_BKI_SRCS = $(addprefix $(top_srcdir)/src/include/catalog/,\ pg_foreign_data_wrapper.h pg_foreign_server.h pg_user_mapping.h \ pg_foreign_table.h pg_rowsecurity.h \ pg_default_acl.h pg_seclabel.h pg_shseclabel.h pg_collation.h pg_range.h \ - toasting.h indexing.h \ + pg_diralias.h toasting.h indexing.h \ ) # location of Catalog.pm diff --git a/src/backend/catalog/aclchk.c b/src/backend/catalog/aclchk.c index d30612c..3717bf5 100644 --- a/src/backend/catalog/aclchk.c +++ b/src/backend/catalog/aclchk.c @@ -30,6 +30,7 @@ #include catalog/pg_collation.h #include catalog/pg_conversion.h #include catalog/pg_database.h +#include catalog/pg_diralias.h #include catalog/pg_default_acl.h #include catalog/pg_event_trigger.h #include catalog/pg_extension.h @@ -48,6 +49,7 @@ #include catalog/pg_ts_config.h #include catalog/pg_ts_dict.h #include commands/dbcommands.h +#include commands/diralias.h #include commands/proclang.h #include commands/tablespace.h #include foreign/foreign.h @@ -3183,6 +3185,190 @@ ExecGrant_Type(InternalGrant *istmt) heap_close(relation, RowExclusiveLock); } +/* + * ExecuteGrantDirAliasStmt + * handles the execution of the GRANT/REVOKE ON DIRALIAS command. + * + * stmt - the GrantDirAliasStmt that describes the directory aliases and + *permissions to be granted/revoked. + */ +void +ExecuteGrantDirAliasStmt(GrantDirAliasStmt *stmt) +{ + Relation pg_diralias_rel; + Oidgrantor; + List *grantee_ids = NIL; + AclMode permissions; + ListCell *item; + + /* Must be superuser to grant directory alias permissions */ + if (!superuser()) + ereport(ERROR, +(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), + errmsg(must be superuser to grant directory alias permissions))); + + /* + * Grantor is optional. If it is not provided then set it to the current + * user. + */ + if (stmt-grantor) + grantor = get_role_oid(stmt-grantor, false); + else + grantor = GetUserId(); + + /* Convert grantee names to oids */ + foreach(item, stmt-grantees) + { + PrivGrantee *grantee = (PrivGrantee *) lfirst(item); + + if (grantee-rolname == NULL) + grantee_ids = lappend_oid(grantee_ids, ACL_ID_PUBLIC); + else + { + Oid roleid = get_role_oid(grantee-rolname, false); + grantee_ids = lappend_oid(grantee_ids, roleid); + } + } + + permissions = ACL_NO_RIGHTS; + + /* If ALL was provided then set permissions to ACL_ALL_RIGHTS_DIRALIAS */ + if (stmt-permissions == NIL) + permissions = ACL_ALL_RIGHTS_DIRALIAS; + else + { + /* Condense all permissions */ + foreach(item, stmt-permissions) + { + AccessPriv *priv = (AccessPriv *)
Re: [HACKERS] Directory/File Access Permissions for COPY and Generic File Access Functions
Brightwell, Adam adam.brightw...@crunchydatasolutions.com writes: The attached patch for review implements a directory permission system that allows for providing a directory read/write capability to directories for COPY TO/FROM and Generic File Access Functions to non-superusers. TBH, this sounds like it's adding a lot of mechanism and *significant* risk of unforeseen security issues in order to solve a problem that we do not need to solve. The field demand for such a feature is just about indistinguishable from zero. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] CREATE POLICY and RETURNING
Hi, While I was checking the behavior of RLS, I found that the policy for SELECT doesn't seem to be applied to RETURNING. Is this intentional? Please see the following example. CREATE ROLE foo LOGIN NOSUPERUSER; CREATE TABLE hoge AS SELECT col FROM generate_series(1,10) col; ALTER TABLE hoge ENABLE ROW LEVEL SECURITY; GRANT SELECT, DELETE ON hoge TO foo; CREATE POLICY hoge_select_policy ON hoge FOR SELECT TO foo USING (col 4); CREATE POLICY hoge_delete_policy ON hoge FOR DELETE TO foo USING (col 8); \c - foo DELETE FROM hoge WHERE col = 6 RETURNING *; The policy hoge_select_policy should disallow the user foo to see the row with col = 6. But the last DELETE RETURNING returns that row. One minor suggestion is: what about changing the message as follows? There are two more similar messages in policy.c, and they use the word row-policy instead of policy. For the consistency, I think that the following also should use the word row-policy. diff --git a/src/backend/commands/policy.c b/src/backend/commands/policy.c index 3627539..df45385 100644 --- a/src/backend/commands/policy.c +++ b/src/backend/commands/policy.c @@ -551,7 +551,7 @@ CreatePolicy(CreatePolicyStmt *stmt) if (HeapTupleIsValid(rsec_tuple)) ereport(ERROR, (errcode(ERRCODE_DUPLICATE_OBJECT), -errmsg(policy \%s\ for relation \%s\ already exists, +errmsg(row-policy \%s\ for relation \%s\ already exists, Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Locking for Rename To new_name works differently for different objects
On Wed, Oct 15, 2014 at 9:34 PM, Robert Haas robertmh...@gmail.com wrote: On Wed, Oct 15, 2014 at 10:04 AM, Tom Lane t...@sss.pgh.pa.us wrote: Amit Kapila amit.kapil...@gmail.com writes: I have observed that for renaming some of the objects AccessExclusiveLock is taken on object whereas for other kind of objects no lock is taken on object before renaming the object. The usual theory for DDL updates of all types (not just rename) is that an explicit lock is only needed for objects whose catalog representation comprises more than one row. Otherwise, the implicit locking involved in updating that row is sufficient to serialize different updates. I am not sure if this rule is followed for all DDL's, as an example, I tried to compare for FUNCTION and SCHEMA. Function has entries in pg_proc, pg_depend (schema, language, returntype, etc.), so by above theory any update should take explicit lock which I think holds good whereas if we see for Schema, it has entries in pg_namespace, pg_depend (owner), but it doesn't take explicit lock during Rename. Yet another anomaly is for Comment on Object .. there is only one row in pg_description, however it still takes explicit lock to avoid concurrent activity which looks right to me. That's an interesting point that I hadn't considered, but I'm willing to believe that at least some of the differences might also be haphazard. Yeah, I also think so, is it important enough that we spend energy to find all such differences and fix them. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] TODO : Allow parallel cores to be used by vacuumdb [ WIP ]
On Thu, Oct 16, 2014 at 8:08 AM, Simon Riggs si...@2ndquadrant.com wrote: I've been trying to review this thread with the thought what does this give me?. I am keen to encourage contributions and also keen to extend our feature set, but I do not wish to complicate our code base. Dilip's developments do seem to be good quality; what I question is whether we want this feature. This patch seems to allow me to run multiple VACUUMs at once. But I can already do this, with autovacuum. Is there anything this patch can do that cannot be already done with autovacuum? The difference lies in the fact that vacuumdb (or VACUUM) gives the option to user to control the vacuum activity for cases when autovacuum doesn't suffice the need, one of the example is to perform vacuum via vacuumdb after pg_upgrade or some other maintenance activity (as mentioned by Jeff upthread). So I think in all such cases having parallel option can give benefit in terms of performance which is already shown by Dilip upthread by running some tests (with and without patch). With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
[HACKERS] [Segmentation fault] pg_dump binary-upgrade fail for type without element
Hi All, pg_dump binary-upgrade fail with segmentation fault for type without element. Consider the following testcase: rushabh@postgresql$ ./db/bin/psql postgres psql (9.5devel) Type help for help. postgres=# drop type typ; DROP TYPE postgres=# create type typ as (); CREATE TYPE postgres=# \q rushabh@postgresql$ ./db/bin/pg_dump postgres --binary-upgrade pg_dump: row number 0 is out of range 0..-1 Segmentation fault (core dumped) Stake trace: (gdb) bt #0 0x003a2cc375f2 in strtoull_l_internal () from /lib64/libc.so.6 #1 0x00417a08 in dumpCompositeType (fout=0x1365200, tyinfo=0x13b1340) at pg_dump.c:9356 #2 0x004156a2 in dumpType (fout=0x1365200, tyinfo=0x13b1340) at pg_dump.c:8449 #3 0x00414b08 in dumpDumpableObject (fout=0x1365200, dobj=0x13b1340) at pg_dump.c:8135 #4 0x004041f8 in main (argc=3, argv=0x7fff838ff6e8) at pg_dump.c:812 Into dumpCompositeType(), query fetch the elements for the composite type, but in case there are no elements for the type then it returns zero rows. In the following code block: if (binary_upgrade) { Oidtyprelid = atooid(PQgetvalue(res, 0, i_typrelid)); binary_upgrade_set_type_oids_by_type_oid(fout, q, tyinfo-dobj.catId.oid); binary_upgrade_set_pg_class_oids(fout, q, typrelid, false); } it fetching the typrelid which require for binary_upgrade. But in case query is returning zero rows (for the composite type without element) is failing. Looking further into code I found that rather then fetching typrelid, we can use the already stored typrelid from TypeInfo structure. Following commit added code related to binary_upgrade: commit 28f6cab61ab8958b1a7dfb019724687d92722538 Author: Bruce Momjian br...@momjian.us Date: Wed Jan 6 05:18:18 2010 + binary upgrade: Preserve relfilenodes for views and composite types --- even though we don't store data in, them, they do consume relfilenodes. Bump catalog version. PFA patch to fix the issue. I think this need to fix in the back branch as well because its effecting pg_upgrade. Fix should backport till PG91, as on PG90 it was not allowed to create type without element. postgres=# select version(); version --- PostgreSQL 9.0.18 on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-4), 64-bit (1 row) postgres=# create type typ as (); ERROR: syntax error at or near ) LINE 1: create type typ as (); ^ Regards, Rushabh Lathia www.EnterpriseDB.com diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c index 2915329..64d3856 100644 --- a/src/bin/pg_dump/pg_dump.c +++ b/src/bin/pg_dump/pg_dump.c @@ -9290,7 +9290,6 @@ dumpCompositeType(Archive *fout, TypeInfo *tyinfo) int i_attalign; int i_attisdropped; int i_attcollation; - int i_typrelid; int i; int actual_atts; @@ -9311,8 +9310,7 @@ dumpCompositeType(Archive *fout, TypeInfo *tyinfo) pg_catalog.format_type(a.atttypid, a.atttypmod) AS atttypdefn, a.attlen, a.attalign, a.attisdropped, CASE WHEN a.attcollation at.typcollation - THEN a.attcollation ELSE 0 END AS attcollation, - ct.typrelid + THEN a.attcollation ELSE 0 END AS attcollation FROM pg_catalog.pg_type ct JOIN pg_catalog.pg_attribute a ON a.attrelid = ct.typrelid LEFT JOIN pg_catalog.pg_type at ON at.oid = a.atttypid @@ -9330,8 +9328,7 @@ dumpCompositeType(Archive *fout, TypeInfo *tyinfo) appendPQExpBuffer(query, SELECT a.attname, pg_catalog.format_type(a.atttypid, a.atttypmod) AS atttypdefn, a.attlen, a.attalign, a.attisdropped, - 0 AS attcollation, - ct.typrelid + 0 AS attcollation FROM pg_catalog.pg_type ct, pg_catalog.pg_attribute a WHERE ct.oid = '%u'::pg_catalog.oid AND a.attrelid = ct.typrelid @@ -9349,15 +9346,12 @@ dumpCompositeType(Archive *fout, TypeInfo *tyinfo) i_attalign = PQfnumber(res, attalign); i_attisdropped = PQfnumber(res, attisdropped); i_attcollation = PQfnumber(res, attcollation); - i_typrelid = PQfnumber(res, typrelid); if (binary_upgrade) { - Oid typrelid = atooid(PQgetvalue(res, 0, i_typrelid)); - binary_upgrade_set_type_oids_by_type_oid(fout, q, tyinfo-dobj.catId.oid); - binary_upgrade_set_pg_class_oids(fout, q, typrelid, false); + binary_upgrade_set_pg_class_oids(fout, q, tyinfo-typrelid, false); } qtypname = pg_strdup(fmtId(tyinfo-dobj.name)); -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CREATE POLICY and RETURNING
On 10/16/2014 12:25 PM, Fujii Masao wrote: Hi, While I was checking the behavior of RLS, I found that the policy for SELECT doesn't seem to be applied to RETURNING. Is this intentional? This is why I was opposed to having a SELECT policy at all. It should be VISIBLE, INSERT, UPDATE, DELETE. I say VISIBLE instead of READ because I don't think the rows affected by an UPDATE or DELETE should be affected by whether or not they have a RETURNING clause. That's IMO nonsensical.and violates the usual expectations about which clauses can have filtering effects. So the read-filtering policy should apply to all statements. Not just SELECT. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [Segmentation fault] pg_dump binary-upgrade fail for type without element
PFA patch patch for the master branch. On Thu, Oct 16, 2014 at 11:09 AM, Rushabh Lathia rushabh.lat...@gmail.com wrote: Hi All, pg_dump binary-upgrade fail with segmentation fault for type without element. Consider the following testcase: rushabh@postgresql$ ./db/bin/psql postgres psql (9.5devel) Type help for help. postgres=# drop type typ; DROP TYPE postgres=# create type typ as (); CREATE TYPE postgres=# \q rushabh@postgresql$ ./db/bin/pg_dump postgres --binary-upgrade pg_dump: row number 0 is out of range 0..-1 Segmentation fault (core dumped) Stake trace: (gdb) bt #0 0x003a2cc375f2 in strtoull_l_internal () from /lib64/libc.so.6 #1 0x00417a08 in dumpCompositeType (fout=0x1365200, tyinfo=0x13b1340) at pg_dump.c:9356 #2 0x004156a2 in dumpType (fout=0x1365200, tyinfo=0x13b1340) at pg_dump.c:8449 #3 0x00414b08 in dumpDumpableObject (fout=0x1365200, dobj=0x13b1340) at pg_dump.c:8135 #4 0x004041f8 in main (argc=3, argv=0x7fff838ff6e8) at pg_dump.c:812 Into dumpCompositeType(), query fetch the elements for the composite type, but in case there are no elements for the type then it returns zero rows. In the following code block: if (binary_upgrade) { Oidtyprelid = atooid(PQgetvalue(res, 0, i_typrelid)); binary_upgrade_set_type_oids_by_type_oid(fout, q, tyinfo-dobj.catId.oid); binary_upgrade_set_pg_class_oids(fout, q, typrelid, false); } it fetching the typrelid which require for binary_upgrade. But in case query is returning zero rows (for the composite type without element) is failing. Looking further into code I found that rather then fetching typrelid, we can use the already stored typrelid from TypeInfo structure. Following commit added code related to binary_upgrade: commit 28f6cab61ab8958b1a7dfb019724687d92722538 Author: Bruce Momjian br...@momjian.us Date: Wed Jan 6 05:18:18 2010 + binary upgrade: Preserve relfilenodes for views and composite types --- even though we don't store data in, them, they do consume relfilenodes. Bump catalog version. PFA patch to fix the issue. I think this need to fix in the back branch as well because its effecting pg_upgrade. Fix should backport till PG91, as on PG90 it was not allowed to create type without element. postgres=# select version(); version --- PostgreSQL 9.0.18 on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-4), 64-bit (1 row) postgres=# create type typ as (); ERROR: syntax error at or near ) LINE 1: create type typ as (); ^ Regards, Rushabh Lathia www.EnterpriseDB.com -- Rushabh Lathia diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c index c56a4cb..c528577 100644 --- a/src/bin/pg_dump/pg_dump.c +++ b/src/bin/pg_dump/pg_dump.c @@ -9269,7 +9269,6 @@ dumpCompositeType(Archive *fout, DumpOptions *dopt, TypeInfo *tyinfo) int i_attalign; int i_attisdropped; int i_attcollation; - int i_typrelid; int i; int actual_atts; @@ -9290,8 +9289,7 @@ dumpCompositeType(Archive *fout, DumpOptions *dopt, TypeInfo *tyinfo) pg_catalog.format_type(a.atttypid, a.atttypmod) AS atttypdefn, a.attlen, a.attalign, a.attisdropped, CASE WHEN a.attcollation at.typcollation - THEN a.attcollation ELSE 0 END AS attcollation, - ct.typrelid + THEN a.attcollation ELSE 0 END AS attcollation FROM pg_catalog.pg_type ct JOIN pg_catalog.pg_attribute a ON a.attrelid = ct.typrelid LEFT JOIN pg_catalog.pg_type at ON at.oid = a.atttypid @@ -9309,8 +9307,7 @@ dumpCompositeType(Archive *fout, DumpOptions *dopt, TypeInfo *tyinfo) appendPQExpBuffer(query, SELECT a.attname, pg_catalog.format_type(a.atttypid, a.atttypmod) AS atttypdefn, a.attlen, a.attalign, a.attisdropped, - 0 AS attcollation, - ct.typrelid + 0 AS attcollation FROM pg_catalog.pg_type ct, pg_catalog.pg_attribute a WHERE ct.oid = '%u'::pg_catalog.oid AND a.attrelid = ct.typrelid @@ -9328,15 +9325,12 @@ dumpCompositeType(Archive *fout, DumpOptions *dopt, TypeInfo *tyinfo) i_attalign = PQfnumber(res, attalign); i_attisdropped = PQfnumber(res, attisdropped); i_attcollation = PQfnumber(res, attcollation); - i_typrelid = PQfnumber(res, typrelid); if (dopt-binary_upgrade) { - Oid typrelid = atooid(PQgetvalue(res, 0, i_typrelid)); - binary_upgrade_set_type_oids_by_type_oid(fout, q, tyinfo-dobj.catId.oid); - binary_upgrade_set_pg_class_oids(fout, q, typrelid, false); + binary_upgrade_set_pg_class_oids(fout, q, tyinfo-typrelid, false); } qtypname = pg_strdup(fmtId(tyinfo-dobj.name)); -- Sent via
Re: [HACKERS] Improve automatic analyze messages for inheritance trees
(2014/10/16 11:45), Simon Riggs wrote: On 6 October 2014 11:07, Etsuro Fujita fujita.ets...@lab.ntt.co.jp wrote: I noticed that analyze messages shown by autovacuum don't discriminate between non-inherited cases and inherited cases, as shown in the below example: LOG: automatic analyze of table postgres.public.pt system usage: CPU 0.00s/0.01u sec elapsed 0.06 sec LOG: automatic analyze of table postgres.public.pt system usage: CPU 0.00s/0.02u sec elapsed 0.11 sec (The first one is for table postgres.public.pt and the second one is for table inheritance tree postgres.public.pt.) So, I'd like to propose improving the messages for inherited cases, in order to easily distinguish such cases from non-inherited cases. Please find attached a patch. I'll add this to the upcoming CF. Thanks for the suggestion. It seems like a useful addition. Existing log analysis may wish to see the automatic analyze of table on each row. So it would be good to keep automatic analyze of table \%s.%s.%s\ Agreed. Can we add some words after this to indicate inheritance? (I have no suggestions at present) e.g. automatic analyze of table \%s.%s.%s\ (new words go here) How about this? automatic analyze of table \%s.%s.%s\ as inheritance tree Thank you for the comment. Best regards, Etsuro Fujita -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CREATE POLICY and RETURNING
On 10/16/2014 01:44 PM, Craig Ringer wrote: So the read-filtering policy should apply to all statements. Not just SELECT. Oh, IIRC one wrinkle in the prior discussion about this was that doing this will prevent the implementation of policies that permit users to update/delete rows they cannot otherwise see. That's an argument in favour of only applying a read-filtering policy where a RETURNING clause is present, but that introduces the surprise! the effects of your DELETE changed based on an unrelated clause! issue. Keep in mind, when considering RETURNING, that users don't always add this clause directly. PgJDBC will tack a RETURNING clause on the end of a statement if the user requests generated keys, for example. They will be very surprised if the behaviour of their DML changes based on whether or not they asked to get generated keys. To my mind having behaviour change based on RETURNING is actively wrong, wheras policies that permit rows to be updated/deleted but not selected are a nice-to-have at most. I'd really like to see some more coverage of the details of how these policies apply to inheritance, both the read- and write- sides of DML with RETURNING clauses, etc. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers