Re: [HACKERS] EvalPlanQual behaves oddly for FDW queries involving system columns
Hello, At Tue, 14 Apr 2015 12:10:35 +0900, Etsuro Fujita fujita.ets...@lab.ntt.co.jp wrote in 552c852b.2050...@lab.ntt.co.jp On 2015/04/13 23:25, Jim Nasby wrote: On 4/13/15 4:58 AM, Etsuro Fujita wrote: On 2015/04/10 21:40, Etsuro Fujita wrote: On 2015/04/09 12:07, Etsuro Fujita wrote: I'll update the patch, which will contain only an infrastructure for this in the PG core, and will not contain any postgres_fdw change. I updated the patch based on your comments. Updated patch attached. In the patch the following FDW APIs have been proposed: + RowMarkType + GetForeignRowMarkType (LockClauseStrength strength); + bool + LockForeignRow (EState *estate, + ExecRowMark *erm, + ItemPointer tupleid); + HeapTuple + FetchForeignRow (EState *estate, + ExecRowMark *erm, + ItemPointer tupleid); I think that these APIs allow the FDW that has TIDs to use the rowmark options such as ROW_MARK_REFERENCE, ROW_MARK_SHARE and ROW_MARK_EXCLUSIVE for its foreign tables so as to match the local semantics exactly, for example. As you mentioned, it would be better to add helper functions to see whether the foreign table is referenced by any ExecRowMarks. ISTM that an easy way to do that is to modify ExecFindRowMark() so that it allows for the missing case. I didn't contain such functions in the patch, though. I added that function and modified docs a bit. Please find attached an updated version of the patch. Why aren't we allowing SELECT FOR KEY SHARE? I omitted that mode (and the FOR NO KEY UPDATE mode) for simplicity, but both modes have been allowed. However, I'm not sure if those modes are useful because we don't have information about keys for a remote table. If I understand this correctly, the two lock modes are no relation with key column definitions, and are provided as a weaker lock in exchange for some risks. Like advisory locks. we can FOR_NO_KEY_UPDATE in the trunsactions that evetually update key columns but also should accept the unexpected result. In other words, the key in the context of FOR NO KEY UPDATE and FOR KEY SHARE is only in the programmers' minds. Apart from feasibility, I don't see no resason to omit them if this is correct. As an example, the following operations cause an unexpected result. Ex. 1 Session A=# create table t (a int primary key, b int); Session A=# insert into t (select a, 1 from generate_series(0, 99) a); Session A=# begin; Session A=# select * from t where a = 1 for no key update; Session B=# begin; Session B=# select * from t where a = 1 for key share; Session B=# update t set a = -a where a = 1; session B is blocked Ex. 2 Session A=# create table t (a int, b int); -- a is the key in mind. Session A=# insert into t (select a, 1 from generate_series(0, 99) a); Session A=# begin; Session A=# select * from t where a = 1 for no key update; Session B=# begin; Session B=# select * from t where a = 1 for key share; Session A=# update t set a = -a where a = 1; UPDATE 1 Session A=# commit; Session B=# select * from t where a = 1; (0 rows) -- Woops. regards, -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Custom/Foreign-Join-APIs (Re: [HACKERS] [v9.5] Custom Plan API)
KaiGai-san, 2015/04/14 14:04、Kouhei Kaigai kai...@ak.jp.nec.com のメール: * Fix typos Please review the v11 patch, and mark it as “ready for committer” if it’s ok. It's OK for me, and wants to be reviewed by other people to get it committed. Thanks! In addition to essential features, I tried to implement relation listing in EXPLAIN output. Attached explain_forein_join.patch adds capability to show join combination of a ForeignScan in EXPLAIN output as an additional item “Relations”. I thought that using array to list relations is a good way too, but I chose one string value because users would like to know order and type of joins too. A bit different from my expectation... I expected to display name of the local foreign tables (and its alias), not remote one, because all the local join logic displays local foreign tables name. Is it easy to adjust isn't it? Probably, all you need to do is, putting a local relation name on the text buffer (at deparseSelectSql) instead of the deparsed remote relation. Oops, that’s right. Attached is the revised version. I chose fully qualified name, schema.relname [alias] for the output. It would waste some cycles during planning if that is not for EXPLAIN, but it seems difficult to get a list of name of relations in ExplainForeignScan() phase, because planning information has gone away at that time. -- Shigeru HANADA shigeru.han...@gmail.com explain_foreign_join_v2.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Make more portable TAP tests of initdb
Hi all, I noticed that src/bin/initdb/t/001_initdb.pl uses directly rm via a system() call like that: system_or_bail rm -rf '$tempdir'/*; This way of doing is not portable, particularly on platforms that do not have rm like... Windows where the equivalent is del. And we could actually use remove_tree with its option keep_root to get the same effect in pure perl as mentioned here: http://perldoc.perl.org/File/Path.html With this formulation: remove_tree($tempdir, {keep_root = 1}); Attached is a patch doing that. Regards, -- Michael diff --git a/src/bin/initdb/t/001_initdb.pl b/src/bin/initdb/t/001_initdb.pl index d12be84..85db9ff 100644 --- a/src/bin/initdb/t/001_initdb.pl +++ b/src/bin/initdb/t/001_initdb.pl @@ -1,6 +1,7 @@ use strict; use warnings; use TestLib; +use File::Path qw(remove_tree); use Test::More tests = 19; my $tempdir = TestLib::tempdir; @@ -18,27 +19,27 @@ command_fails([ 'initdb', '-S', $tempdir/data3 ], mkdir $tempdir/data4 or BAIL_OUT($!); command_ok([ 'initdb', $tempdir/data4 ], 'existing empty data directory'); -system_or_bail rm -rf '$tempdir'/*; +remove_tree($tempdir, {keep_root = 1}); command_ok([ 'initdb', '-X', $tempdir/pgxlog, $tempdir/data ], 'separate xlog directory'); -system_or_bail rm -rf '$tempdir'/*; +remove_tree($tempdir, {keep_root = 1}); command_fails( [ 'initdb', $tempdir/data, '-X', 'pgxlog' ], 'relative xlog directory not allowed'); -system_or_bail rm -rf '$tempdir'/*; +remove_tree($tempdir, {keep_root = 1}); mkdir $tempdir/pgxlog; command_ok([ 'initdb', '-X', $tempdir/pgxlog, $tempdir/data ], 'existing empty xlog directory'); -system_or_bail rm -rf '$tempdir'/*; +remove_tree($tempdir, {keep_root = 1}); mkdir $tempdir/pgxlog; mkdir $tempdir/pgxlog/lost+found; command_fails([ 'initdb', '-X', $tempdir/pgxlog, $tempdir/data ], 'existing nonempty xlog directory'); -system_or_bail rm -rf '$tempdir'/*; +remove_tree($tempdir, {keep_root = 1}); command_ok([ 'initdb', '-T', 'german', $tempdir/data ], 'select default dictionary'); -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] inherit support for foreign tables
On 2015/03/23 2:57, Tom Lane wrote: Etsuro Fujita fujita.ets...@lab.ntt.co.jp writes: [ fdw-inh-8.patch ] I've committed this with some substantial rearrangements, notably: * I thought that if we were doing this at all, we should go all the way and allow foreign tables to be both inheritance parents and children. I found that when setting a foreign table to be the parent of an inheritance set that only contains foreign tables, SELECT FOR UPDATE on the inheritance parent fails with a can't-happen error condition. Here is an example: $ createdb mydb $ psql mydb psql (9.5devel) Type help for help. mydb=# create table t1 (c1 int); CREATE TABLE mydb=# create table t2 (c1 int); CREATE TABLE $ psql postgres psql (9.5devel) Type help for help. postgres=# create extension postgres_fdw; CREATE EXTENSION postgres=# create server myserver foreign data wrapper postgres_fdw options (dbname 'mydb'); CREATE SERVER postgres=# create user mapping for current_user server myserver; CREATE USER MAPPING postgres=# create foreign table ft1 (c1 int) server myserver options (table_name 't1'); CREATE FOREIGN TABLE postgres=# create foreign table ft2 (c1 int) server myserver options (table_name 't2'); CREATE FOREIGN TABLE postgres=# alter foreign table ft2 inherit ft1; ALTER FOREIGN TABLE postgres=# select * from ft1 for update; ERROR: could not find junk tableoid1 column I think this is a bug. Attached is a patch fixing this issue. Best regards, Etsuro Fujita *** a/contrib/postgres_fdw/expected/postgres_fdw.out --- b/contrib/postgres_fdw/expected/postgres_fdw.out *** *** 3193,3218 select * from bar where f1 in (select f1 from foo) for update; QUERY PLAN -- LockRows !Output: bar.f1, bar.f2, bar.ctid, bar.tableoid, bar.*, foo.ctid, foo.tableoid, foo.* - Hash Join ! Output: bar.f1, bar.f2, bar.ctid, bar.tableoid, bar.*, foo.ctid, foo.tableoid, foo.* Hash Cond: (bar.f1 = foo.f1) - Append - Seq Scan on public.bar ! Output: bar.f1, bar.f2, bar.ctid, bar.tableoid, bar.* - Foreign Scan on public.bar2 ! Output: bar2.f1, bar2.f2, bar2.ctid, bar2.tableoid, bar2.* Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE - Hash !Output: foo.ctid, foo.tableoid, foo.*, foo.f1 - HashAggregate ! Output: foo.ctid, foo.tableoid, foo.*, foo.f1 Group Key: foo.f1 - Append - Seq Scan on public.foo ! Output: foo.ctid, foo.tableoid, foo.*, foo.f1 - Foreign Scan on public.foo2 ! Output: foo2.ctid, foo2.tableoid, foo2.*, foo2.f1 Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1 (22 rows) --- 3193,3218 QUERY PLAN -- LockRows !Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid - Hash Join ! Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid Hash Cond: (bar.f1 = foo.f1) - Append - Seq Scan on public.bar ! Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid - Foreign Scan on public.bar2 ! Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE - Hash !Output: foo.ctid, foo.*, foo.tableoid, foo.f1 - HashAggregate ! Output: foo.ctid, foo.*, foo.tableoid, foo.f1 Group Key: foo.f1 - Append - Seq Scan on public.foo ! Output: foo.ctid, foo.*, foo.tableoid, foo.f1 - Foreign Scan on public.foo2 ! Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1 Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1 (22 rows) *** *** 3230,3255 select * from bar where f1 in (select f1 from foo) for share; QUERY PLAN -- LockRows !Output: bar.f1, bar.f2, bar.ctid, bar.tableoid, bar.*, foo.ctid, foo.tableoid, foo.*
Re: [HACKERS] FPW compression leaks information
On Tue, Apr 14, 2015 at 4:50 PM, Heikki Linnakangas hlinn...@iki.fi wrote: On 04/14/2015 05:42 AM, Robert Haas wrote: On Sun, Apr 12, 2015 at 8:38 PM, Heikki Linnakangas hlinn...@iki.fi wrote: As to RLS - yeah, that's where I think a lot of the possible covert channel attacks are. But it doesn't have to be RLS per se. It can be, for example, a table some of whose contents you expose via a security barrier view. It can be a security-definer function that you call and it returns some data that you don't have rights to view directly. Basically, it can be any table to which you have some access, but not complete access. Then you can use timing attacks, scrutinize EXPLAIN plans for clues, look at CTIDs, and so on. Basically, I'm worried that it's going to be completely impractical to prevent a certain amount of information leakage when you have partial access to a table, and that in a largely-futile effort to try to prevent it, we'll end up making a whole bunch of things (like the WAL insert position) super-user only, and that this will in fact be a net reduction in security because it'll encourage people to use the superuser account more. Perhaps that concern is ill-founded, but that's what I'm worried about. I'm not a big fan of locking down WAL position information either. If we have to treat the current WAL position is security-sensitive information, we're doing something wrong. But I don't want to just give up either. One option is to make this controllable on a per-table basis, as Amit suggested. Then we could always disable it for pg_authid, add a suitable warning to the docs, and let the DBA enable/disable it for other tables. It's a bit of a cop-out, but would fix the immediate problem. I think it's a fairly narrow attack vector. As long as we document it clearly, and make it easy enough to turn it off, I think that's definitely enough. Most people will not care at all, I'm sure - but we need to cater to those that do. I think we could probably even get away with just documenting the risk and having people turn off the compression *completely* if they care about it, but if we can do it at a table level, that's obviously a lot better. -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/
Re: [HACKERS] FPW compression leaks information
* Magnus Hagander (mag...@hagander.net) wrote: On Tue, Apr 14, 2015 at 4:50 PM, Heikki Linnakangas hlinn...@iki.fi wrote: I'm not a big fan of locking down WAL position information either. If we have to treat the current WAL position is security-sensitive information, we're doing something wrong. I really don't see the downside to limiting who can see that information. I meant to mention it in my email last night with the patch allowing the GRANT system to work for other functions, but I added the functions suggested by Michael to the list of ones which don't have the default ACL of 'execute to PUBLIC' to that set. But I don't want to just give up either. One option is to make this controllable on a per-table basis, as Amit suggested. Then we could always disable it for pg_authid, add a suitable warning to the docs, and let the DBA enable/disable it for other tables. It's a bit of a cop-out, but would fix the immediate problem. I think it's a fairly narrow attack vector. As long as we document it clearly, and make it easy enough to turn it off, I think that's definitely enough. Most people will not care at all, I'm sure - but we need to cater to those that do. I agree that it's something to document, but I don't think it's sensible to have the default be at risk. That said, as long as there's a way to restrict access to these functions (without having to fight with the system on upgrades, etc, to have that be preserved..) then I'm not going to push too hard about the default. We need a proper hardening guide anyway, this would just need to be included in that documentation. I think we could probably even get away with just documenting the risk and having people turn off the compression *completely* if they care about it, but if we can do it at a table level, that's obviously a lot better. I *really* don't like the idea that the fix for such an information disclosure risk is to completely disable an extremely useful feature. In fact, I find the suggestion of it rather ridiculous. I do like the idea of being able to control this on a per-table basis as that's a useful feature, but that's completely independent from this concern. Thanks! Stephen signature.asc Description: Digital signature
Re: [HACKERS] [COMMITTERS] pgsql: Use Intel SSE 4.2 CRC instructions where available.
On 14 April 2015 at 11:34, Heikki Linnakangas hlinn...@iki.fi wrote: Or we can punt and always build the version with the runtime check, unless overridden manually at configure command line. That seems safer as the default, which is what the original patch did. We can optimise that for known good builds later if desired. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, RemoteDBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Make more portable TAP tests of initdb
On Apr 14, 2015, at 9:05 AM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: http://perldoc.perl.org/File/Path.html With this formulation: remove_tree($tempdir, {keep_root = 1}); Does Perl 5.8 have this? Yes, it does. http://cpansearch.perl.org/src/NWCLARK/perl-5.8.9/lib/File/Path.pm Best, David smime.p7s Description: S/MIME cryptographic signature
Re: [HACKERS] deparsing utility commands
On 4/9/15 12:14 PM, Alvaro Herrera wrote: Alvaro Herrera wrote: Executive summary: There is now a CommandDeparse_hook; deparse_utility_command is provided as an extension, intended for 9.6; rest of patch would be pushed to 9.5. Actually here's another approach I like better: use a new pseudotype, pg_ddl_command, which internally is just a pointer to a CollectedCommand struct. We cannot give access to the pointer directly in SQL, so much like type internal or pg_node_tree the in/out functions should just error out. But the type can be returned as a column in pg_event_trigger_ddl_command. An extension can create a function that receives that type as argument, and return a different representation of the command; the third patch creates a function ddl_deparse_to_json() which does that. I've tested this approach and it returns the same results as the previous approach for all calls to pg_event_trigger_ddl_command() made by the pg_audit regression tests. For my testing I was generally only applying patch 0001 and 0002 (although 0001 on it's own also worked for my case). I applied patch 0003 only for the purpose of being sure it did not break anything, and did not specifically test the functionality of 0002 or 0003. However, I've tested 0001 over a very wide range of commands and have not found any anomalous behaviors. I've reviewed the 0001 and 0002 patches and don't have anything to add beyond what has been mentioned in previous reviews. You've applied your pattern consistently across a wide variety of utility commands which is something of a feat. Since 0003 will not be committed to core, I was more interested in the mechanism used to connect 0003 to 0001 and 0002. I like the new approach better than the old one in that there is no connecting hook now. As you say, it provides more flexibility in terms of using the command data in as many ways as you like. Even better, the user has access to three different representations and can choose which one is best for them. And, of course, they can write code for their own representation using pg_ddl_command, JSON, or SQL. I also like the way the patch has been broken up. The core functionality will finally make event triggers truly useful in a high-level language like pl/pgsql. I can see many uses for them now that pg_event_trigger_ddl_command() is present. From my review and testing, I see no barriers to committing patches 0001 and 0002. I'd love to see this functionality in 9.5. Regards, -- - David Steele da...@pgmasters.net signature.asc Description: OpenPGP digital signature
Re: [HACKERS] BRIN range operator class
Judging from a quick look, I think patches 1 and 5 can be committed quickly; they imply no changes to other parts of BRIN. (Not sure why 1 and 5 are separate. Any reason for this?) Also patch 2. Not much reason except that 1 includes only functions, but 5 includes operators. Patch 4 looks like a simple bugfix (or maybe a generalization) of BRIN framework code; should also be committable right away. Needs a closer look of course. Patch 3 is a problem. That code is there because the union proc is only used in a corner case in Minmax, so if we remove it, user-written Union procs are very likely to remain buggy for long. If you have a better idea to test Union in Minmax, or some other way to turn that stuff off for the range stuff, I'm all ears. Just lets make sure the support procs are tested to avoid stupid bugs. Before I introduced that, my Minmax Union proc was all wrong. I removed this test because I don't see a way to support it. I believe any other implementation that is more complicated than minmax will fail in there. It is better to cache them with the regression tests, so I tried to improve them. GiST, SP-GiST and GIN don't have similar checks, but they have more complicated user defined functions. Patch 7 I don't understand. Will have to look closer. Are you saying Minmax will depend on Btree opclasses? I remember thinking in doing it that way at some point, but wasn't convinced for some reason. No, there isn't any additional dependency. It makes minmax operator classes use the procedures from the pg_amop instead of adding them to pg_amproc. It also makes the operator class safer for cross data type usage. Actually, I just checked and find out that we got wrong answers from index on the current master without this patch. You can reproduce it with this query on the regression database: select * from brintest where timestampcol = '1979-01-29 11:05:09'::timestamptz; inclusion-opclasses patch make it possible to add cross type brin regression tests. I will add more of them on the next version. Patch 6 seems the real meat of your own stuff. I think there should be a patch 8 also but it's not attached ... ?? I had another commit not to intended to be sent. Sorry about that. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Re: [COMMITTERS] pgsql: Use Intel SSE 4.2 CRC instructions where available.
On 04/14/2015 07:01 PM, Simon Riggs wrote: On 14 April 2015 at 11:34, Heikki Linnakangas hlinn...@iki.fi wrote: Or we can punt and always build the version with the runtime check, unless overridden manually at configure command line. That seems safer as the default, which is what the original patch did. We can optimise that for known good builds later if desired. I committed the __SSE_4_2__ test for now, so we still build the SSE-only version if the target allows that. We'll see how that works.. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] inherit support for foreign tables
On 4/14/15 5:49 AM, Etsuro Fujita wrote: postgres=# create foreign table ft1 (c1 int) server myserver options (table_name 't1'); CREATE FOREIGN TABLE postgres=# create foreign table ft2 (c1 int) server myserver options (table_name 't2'); CREATE FOREIGN TABLE postgres=# alter foreign table ft2 inherit ft1; ALTER FOREIGN TABLE postgres=# select * from ft1 for update; ERROR: could not find junk tableoid1 column I think this is a bug. Attached is a patch fixing this issue. What happens when the foreign side breaks the inheritance? Does the FDW somehow know to check that fact for each query? What do you gain from having the local table have inheritance? Basically, I think we have to be very careful about implementing features that imply the local side knows something about the persisted state of the remote side (in this case, whether there's inheritance). Anything like that sets us up for synchronization problems. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] EvalPlanQual behaves oddly for FDW queries involving system columns
On 4/14/15 1:05 AM, Kyotaro HORIGUCHI wrote: As an example, the following operations cause an unexpected result. Ex. 1 Session A=# create table t (a int primary key, b int); Session A=# insert into t (select a, 1 from generate_series(0, 99) a); Session A=# begin; Session A=# select * from t where a = 1 for no key update; Session B=# begin; Session B=# select * from t where a = 1 for key share; Session B=# update t set a = -a where a = 1; session B is blocked Ex. 2 Session A=# create table t (a int, b int); -- a is the key in mind. Session A=# insert into t (select a, 1 from generate_series(0, 99) a); Session A=# begin; Session A=# select * from t where a = 1 for no key update; Session B=# begin; Session B=# select * from t where a = 1 for key share; Session A=# update t set a = -a where a = 1; UPDATE 1 Session A=# commit; Session B=# select * from t where a = 1; (0 rows) -- Woops. Those results are indeed surprising, but since we allow it in a direct connection I don't see why we wouldn't allow it in the Postgres FDW... As for the FDW not knowing about keys, why would it need to? If you try to do something illegal it's the remote side that should throw the error, not the FDW. Of course, if you try to do a locking operation on an FDW that doesn't support it, the FDW should throw an error... but that's different. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Make more portable TAP tests of initdb
Michael Paquier wrote: Hi all, I noticed that src/bin/initdb/t/001_initdb.pl uses directly rm via a system() call like that: system_or_bail rm -rf '$tempdir'/*; This way of doing is not portable, particularly on platforms that do not have rm like... Windows where the equivalent is del. And we could actually use remove_tree with its option keep_root to get the same effect in pure perl as mentioned here: http://perldoc.perl.org/File/Path.html With this formulation: remove_tree($tempdir, {keep_root = 1}); Does Perl 5.8 have this? -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] FPW compression leaks information
On 04/14/2015 05:42 AM, Robert Haas wrote: On Sun, Apr 12, 2015 at 8:38 PM, Heikki Linnakangas hlinn...@iki.fi wrote: Care to name some? This is certainly quite cumbersome to exploit, but it's doable. We've talked a lot about covert channels and timing attacks on RLS, but this makes me more worried because you can attack passwords stored in pg_authid. Well, one thing to think about is that, if we're going to take this kind of attack seriously, it could be used on user data, too. I mean, you've just got to be able to get a row into the same block as the row you want to find out something about, and there's no reason that need be true only for pg_authid. Sure. And, even if you ban access to information on the pg_xlog insert location, what's to prevent someone from gleaning similar information via a timing attack on the compression itself? Yep, that's another vector. Even more fiddly than the record size, but nevertheless. You can even get some information from figuring out, by trial and error, how much you need to expand a row to get it to spill over into another block. If there happens to be enough free-space on the page where the row is located to allow a HOT update, you can repeatedly update it until it gets moved to some unrelated page. Granted, knowing the length of an unseen row isn't as good as knowing the contents, but it's still potentially useful information. True as well. That's not an issue for the MD5 hashes stored in pg_authid, they all have the same length, but it can be for application data. As to RLS - yeah, that's where I think a lot of the possible covert channel attacks are. But it doesn't have to be RLS per se. It can be, for example, a table some of whose contents you expose via a security barrier view. It can be a security-definer function that you call and it returns some data that you don't have rights to view directly. Basically, it can be any table to which you have some access, but not complete access. Then you can use timing attacks, scrutinize EXPLAIN plans for clues, look at CTIDs, and so on. Basically, I'm worried that it's going to be completely impractical to prevent a certain amount of information leakage when you have partial access to a table, and that in a largely-futile effort to try to prevent it, we'll end up making a whole bunch of things (like the WAL insert position) super-user only, and that this will in fact be a net reduction in security because it'll encourage people to use the superuser account more. Perhaps that concern is ill-founded, but that's what I'm worried about. I'm not a big fan of locking down WAL position information either. If we have to treat the current WAL position is security-sensitive information, we're doing something wrong. But I don't want to just give up either. One option is to make this controllable on a per-table basis, as Amit suggested. Then we could always disable it for pg_authid, add a suitable warning to the docs, and let the DBA enable/disable it for other tables. It's a bit of a cop-out, but would fix the immediate problem. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Re: [COMMITTERS] pgsql: Use Intel SSE 4.2 CRC instructions where available.
On 04/14/2015 06:28 PM, Simon Riggs wrote: On 14 April 2015 at 10:09, Heikki Linnakangas heikki.linnakan...@iki.fi wrote: Abhijit Menon-Sen, heavily modified by me, reviewed by Andres Freund. Did the heavy modifications have any affect on the patch behaviour, or was this just related to where you would like to put the code? Didn't affect behaviour. Hmm, the buildfarm animals using Intel C compiler didn't like this patch. The problem seems to be that unlike on gcc and clang, icc always has the SSE 4.2 intrinsics (_mm_crc32_u64, _mm_crc32_u8 etc.), even when the target CPU architecture is not SSE 4.2. On gcc/clang, those intrinsics are not defined unless you build with -msse4.2. I'll try to find a fix. I think we could use the __SSE4_2__ define to check whether SSE4.2 is targeted. Or we can punt and always build the version with the runtime check, unless overridden manually at configure command line. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Auditing extension for PostgreSQL (Take 2)
On 4/14/15 7:13 PM, Tatsuo Ishii wrote: This patch does not apply cleanly due to the moving of pgbench (patch to filelist.sgml failed). Thank you for pointing that out! Ironic that it was the commit directly after the one I was testing with that broke the patch. It appears the end of the last CF is a very bad time to be behind HEAD. Fixed in attached v8 patch. -- - David Steele da...@pgmasters.net diff --git a/contrib/Makefile b/contrib/Makefile index d63e441..ed9cf6a 100644 --- a/contrib/Makefile +++ b/contrib/Makefile @@ -28,6 +28,7 @@ SUBDIRS = \ oid2name\ pageinspect \ passwordcheck \ + pg_audit\ pg_buffercache \ pg_freespacemap \ pg_prewarm \ diff --git a/contrib/pg_audit/.gitignore b/contrib/pg_audit/.gitignore new file mode 100644 index 000..a5267cf --- /dev/null +++ b/contrib/pg_audit/.gitignore @@ -0,0 +1,5 @@ +log/ +results/ +tmp_check/ +regression.diffs +regression.out diff --git a/contrib/pg_audit/Makefile b/contrib/pg_audit/Makefile new file mode 100644 index 000..7b36011 --- /dev/null +++ b/contrib/pg_audit/Makefile @@ -0,0 +1,21 @@ +# pg_audit/Makefile + +MODULE = pg_audit +MODULE_big = pg_audit +OBJS = pg_audit.o + +EXTENSION = pg_audit +REGRESS = pg_audit +REGRESS_OPTS = --temp-config=$(top_srcdir)/contrib/pg_audit/pg_audit.conf +DATA = pg_audit--1.0.0.sql + +ifdef USE_PGXS +PG_CONFIG = pg_config +PGXS := $(shell $(PG_CONFIG) --pgxs) +include $(PGXS) +else +subdir = contrib/pg_audit +top_builddir = ../.. +include $(top_builddir)/src/Makefile.global +include $(top_srcdir)/contrib/contrib-global.mk +endif diff --git a/contrib/pg_audit/expected/pg_audit-deparse.out b/contrib/pg_audit/expected/pg_audit-deparse.out new file mode 100644 index 000..66fd20d --- /dev/null +++ b/contrib/pg_audit/expected/pg_audit-deparse.out @@ -0,0 +1,897 @@ +-- Load pg_audit module +create extension pg_audit; +-- +-- Create a superuser role that we know the name of for testing +CREATE USER super SUPERUSER; +\connect contrib_regression super; +-- +-- Create auditor role +CREATE ROLE auditor; +-- +-- Create first test user +CREATE USER user1; +ALTER ROLE user1 SET pg_audit.log = 'ddl, ROLE'; +ALTER ROLE user1 SET pg_audit.log_notice = on; +-- +-- Create, select, drop (select will not be audited) +\connect contrib_regression user1 +CREATE TABLE public.test (id INT); +NOTICE: AUDIT: SESSION,1,1,DDL,CREATE TABLE,TABLE,public.test,CREATE TABLE public.test (id INT); +SELECT * FROM test; + id + +(0 rows) + +DROP TABLE test; +NOTICE: AUDIT: SESSION,2,1,DDL,DROP TABLE,TABLE,public.test,DROP TABLE test; +-- +-- Create second test user +\connect contrib_regression super +CREATE USER user2; +ALTER ROLE user2 SET pg_audit.log = 'Read, writE'; +ALTER ROLE user2 SET pg_audit.log_notice = on; +ALTER ROLE user2 SET pg_audit.role = auditor; +\connect contrib_regression user2 +CREATE TABLE test2 (id INT); +GRANT SELECT ON TABLE public.test2 TO auditor; +-- +-- Role-based tests +CREATE TABLE test3 +( + id INT +); +SELECT count(*) + FROM +( + SELECT relname + FROM pg_class + LIMIT 1 +) SUBQUERY; + count +--- + 1 +(1 row) + +SELECT * + FROM test3, test2; +NOTICE: AUDIT: SESSION,1,1,READ,SELECT,,,SELECT * + FROM test3, test2; +NOTICE: AUDIT: OBJECT,1,1,READ,SELECT,TABLE,public.test2,SELECT * + FROM test3, test2; + id | id ++ +(0 rows) + +GRANT INSERT + ON TABLE public.test3 + TO auditor; +-- +-- Object logged because of: +-- insert on test3 +-- select on test2 +WITH CTE AS +( + SELECT id + FROM test2 +) +INSERT INTO test3 +SELECT id + FROM cte; +NOTICE: AUDIT: SESSION,2,1,WRITE,INSERT,,,WITH CTE AS +( + SELECT id + FROM test2 +) +INSERT INTO test3 +SELECT id + FROM cte; +NOTICE: AUDIT: OBJECT,2,1,WRITE,INSERT,TABLE,public.test3,WITH CTE AS +( + SELECT id + FROM test2 +) +INSERT INTO test3 +SELECT id + FROM cte; +NOTICE: AUDIT: OBJECT,2,1,READ,SELECT,TABLE,public.test2,WITH CTE AS +( + SELECT id + FROM test2 +) +INSERT INTO test3 +SELECT id + FROM cte; +-- +-- Object logged because of: +-- insert on test3 +WITH CTE AS +( + INSERT INTO test3 VALUES (1) + RETURNING id +) +INSERT INTO test2 +SELECT id + FROM cte; +NOTICE: AUDIT: SESSION,3,1,WRITE,INSERT,,,WITH CTE AS +( + INSERT INTO test3 VALUES (1) + RETURNING id +) +INSERT INTO test2 +SELECT id + FROM cte; +NOTICE: AUDIT: OBJECT,3,1,WRITE,INSERT,TABLE,public.test3,WITH CTE AS +( + INSERT INTO test3 VALUES (1) + RETURNING id +) +INSERT INTO test2 +SELECT id + FROM cte; +GRANT UPDATE ON TABLE public.test2 TO auditor; +-- +-- Object logged because of: +-- insert on test3 +-- update on test2 +WITH CTE AS +( + UPDATE test2 + SET id = 1 + RETURNING id +) +INSERT INTO test3 +SELECT id + FROM cte; +NOTICE:
Re: [HACKERS] Make more portable TAP tests of initdb
On Wed, Apr 15, 2015 at 5:29 AM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: David E. Wheeler wrote: On Apr 14, 2015, at 1:21 PM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: Castoroides has 5.8.4. Oops. WUT. Yeah, eh? Anyway I don't think it matters much: just don't enable TAP tests on machines with obsolete Perl. I think this is fine since 5.8's latest release is supported. And castoroides is not using --enable-tap-tests in the buildfarm btw. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] inherit support for foreign tables
Hi, Before suppressing the symptom, I doubt the necessity and/or validity of giving foreign tables an ability to be a parent. Is there any reasonable usage for the ability? I think we should choose to inhibit foreign tables from becoming a parent rather than leaving it allowed then taking measures for the consequent symptom. regards, At Tue, 14 Apr 2015 15:52:18 -0300, Alvaro Herrera alvhe...@2ndquadrant.com wrote in 20150414185218.gx4...@alvh.no-ip.org Jim Nasby wrote: On 4/14/15 5:49 AM, Etsuro Fujita wrote: postgres=# create foreign table ft1 (c1 int) server myserver options (table_name 't1'); CREATE FOREIGN TABLE postgres=# create foreign table ft2 (c1 int) server myserver options (table_name 't2'); CREATE FOREIGN TABLE postgres=# alter foreign table ft2 inherit ft1; ALTER FOREIGN TABLE postgres=# select * from ft1 for update; ERROR: could not find junk tableoid1 column I think this is a bug. Attached is a patch fixing this issue. What happens when the foreign side breaks the inheritance? Does the FDW somehow know to check that fact for each query? This is a meaningless question. The remote tables don't have to have an inheritance relationship already; only the local side sees them as connected. I think the real question is whether we're now (I mean after this patch) emitting useless tableoid columns that we didn't previously have. I think the answer is yes, and if so I think we need a smaller comb to fix the problem. This one seems rather large. -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Auditing extension for PostgreSQL (Take 2)
Thank you for pointing that out! Ironic that it was the commit directly after the one I was testing with that broke the patch. It appears the end of the last CF is a very bad time to be behind HEAD. Fixed in attached v8 patch. Thank you for your quick response. BTW, in my understanding pg_audit allows to track a table access even if it's used in a view. I think this is a nice feature and it would be better explicitly stated in the document and the test case is better included in the regression test. Here is a sample session: CREATE TABLE test2 (id INT); CREATE VIEW vtest2 AS SELECT * FROM test2; GRANT SELECT ON TABLE public.test2 TO auditor; GRANT SELECT ON TABLE public.vtest2 TO auditor; SELECT * FROM vtest2; NOTICE: AUDIT: SESSION,1,1,READ,SELECT,,,SELECT * FROM vtest2; NOTICE: AUDIT: OBJECT,1,1,READ,SELECT,VIEW,public.vtest2,SELECT * FROM vtest2; NOTICE: AUDIT: OBJECT,1,1,READ,SELECT,TABLE,public.test2,SELECT * FROM vtest2; Best regards, -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese:http://www.sraoss.co.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] FPW compression leaks information
On Wed, Apr 15, 2015 at 11:10 AM, Fujii Masao masao.fu...@gmail.com wrote: On Wed, Apr 15, 2015 at 12:00 AM, Magnus Hagander mag...@hagander.net wrote: On Tue, Apr 14, 2015 at 4:50 PM, Heikki Linnakangas hlinn...@iki.fi wrote: On 04/14/2015 05:42 AM, Robert Haas wrote: On Sun, Apr 12, 2015 at 8:38 PM, Heikki Linnakangas hlinn...@iki.fi wrote: As to RLS - yeah, that's where I think a lot of the possible covert channel attacks are. But it doesn't have to be RLS per se. It can be, for example, a table some of whose contents you expose via a security barrier view. It can be a security-definer function that you call and it returns some data that you don't have rights to view directly. Basically, it can be any table to which you have some access, but not complete access. Then you can use timing attacks, scrutinize EXPLAIN plans for clues, look at CTIDs, and so on. Basically, I'm worried that it's going to be completely impractical to prevent a certain amount of information leakage when you have partial access to a table, and that in a largely-futile effort to try to prevent it, we'll end up making a whole bunch of things (like the WAL insert position) super-user only, and that this will in fact be a net reduction in security because it'll encourage people to use the superuser account more. Perhaps that concern is ill-founded, but that's what I'm worried about. I'm not a big fan of locking down WAL position information either. If we have to treat the current WAL position is security-sensitive information, we're doing something wrong. But I don't want to just give up either. One option is to make this controllable on a per-table basis, as Amit suggested. Then we could always disable it for pg_authid, add a suitable warning to the docs, and let the DBA enable/disable it for other tables. It's a bit of a cop-out, but would fix the immediate problem. I think it's a fairly narrow attack vector. As long as we document it clearly, and make it easy enough to turn it off, I think that's definitely enough. Most people will not care at all, I'm sure - but we need to cater to those that do. I think we could probably even get away with just documenting the risk and having people turn off the compression *completely* if they care about it, but if we can do it at a table level, that's obviously a lot better. +1 OK. I am fine to implement anything required here if needed, meaning the following: 1) Doc patch to mention that it is possible that compression can give hints to attackers when working on sensible fields that have a non-fixed size. 2) Switch at relation level to control wal_compression. This needs to change XLogRecordAssemble by adding some new logic to check if a relation is enforcing WAL compression or not. As a reloption, there are three possible values: true, false and fallback to system default. Also, I think that we should simply extend XLogRegisterBuffer() and pass to it the reloption flag that is then registered in registered_buffer, and XLogRecordAssemble() decides with this flag if block is compressed or not. Do we want to add this reloption switch to indexes as well? Or only tables? For indexes things will get heavier as we would need to add a parameter for all the index types. Regards, -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] FPW compression leaks information
On Wed, Apr 15, 2015 at 12:15 AM, Stephen Frost wrote: We need a proper hardening guide anyway, this would just need to be included in that documentation. +1. I am sure that many users would like a hardening manual in the official documentation. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] FPW compression leaks information
On Wed, Apr 15, 2015 at 12:00 AM, Magnus Hagander mag...@hagander.net wrote: On Tue, Apr 14, 2015 at 4:50 PM, Heikki Linnakangas hlinn...@iki.fi wrote: On 04/14/2015 05:42 AM, Robert Haas wrote: On Sun, Apr 12, 2015 at 8:38 PM, Heikki Linnakangas hlinn...@iki.fi wrote: As to RLS - yeah, that's where I think a lot of the possible covert channel attacks are. But it doesn't have to be RLS per se. It can be, for example, a table some of whose contents you expose via a security barrier view. It can be a security-definer function that you call and it returns some data that you don't have rights to view directly. Basically, it can be any table to which you have some access, but not complete access. Then you can use timing attacks, scrutinize EXPLAIN plans for clues, look at CTIDs, and so on. Basically, I'm worried that it's going to be completely impractical to prevent a certain amount of information leakage when you have partial access to a table, and that in a largely-futile effort to try to prevent it, we'll end up making a whole bunch of things (like the WAL insert position) super-user only, and that this will in fact be a net reduction in security because it'll encourage people to use the superuser account more. Perhaps that concern is ill-founded, but that's what I'm worried about. I'm not a big fan of locking down WAL position information either. If we have to treat the current WAL position is security-sensitive information, we're doing something wrong. But I don't want to just give up either. One option is to make this controllable on a per-table basis, as Amit suggested. Then we could always disable it for pg_authid, add a suitable warning to the docs, and let the DBA enable/disable it for other tables. It's a bit of a cop-out, but would fix the immediate problem. I think it's a fairly narrow attack vector. As long as we document it clearly, and make it easy enough to turn it off, I think that's definitely enough. Most people will not care at all, I'm sure - but we need to cater to those that do. I think we could probably even get away with just documenting the risk and having people turn off the compression *completely* if they care about it, but if we can do it at a table level, that's obviously a lot better. +1 Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] FPW compression leaks information
On Wed, Apr 15, 2015 at 11:55 AM, Michael Paquier wrote: OK. I am fine to implement anything required here if needed, meaning the following: 1) Doc patch to mention that it is possible that compression can give hints to attackers when working on sensible fields that have a non-fixed size. 2) Switch at relation level to control wal_compression. This needs to change XLogRecordAssemble by adding some new logic to check if a relation is enforcing WAL compression or not. As a reloption, there are three possible values: true, false and fallback to system default. Also, I think that we should simply extend XLogRegisterBuffer() and pass to it the reloption flag that is then registered in registered_buffer, and XLogRecordAssemble() decides with this flag if block is compressed or not. Do we want to add this reloption switch to indexes as well? Or only tables? For indexes things will get heavier as we would need to add a parameter for all the index types. After looking at reloptions.c, what we are going to need as well is a new relopt_type, like RELOPT_TYPE_ENUM for this purpose to be able to define the 'default' value. We could as well have things using RELOPT_TYPE_STRING. Any input on this matter is welcome. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Auditing extension for PostgreSQL (Take 2)
On Wed, Apr 15, 2015 at 8:57 AM, David Steele da...@pgmasters.net wrote: On 4/14/15 7:13 PM, Tatsuo Ishii wrote: This patch does not apply cleanly due to the moving of pgbench (patch to filelist.sgml failed). Thank you for pointing that out! Ironic that it was the commit directly after the one I was testing with that broke the patch. It appears the end of the last CF is a very bad time to be behind HEAD. Fixed in attached v8 patch. Thank you for updating the patch! I applied the patch and complied them successfully without WARNING. I tested v8 patch with CURSOR case I mentioned before, and got segmentation fault again. Here are log messages in my environment, =# select test(); LOG: server process (PID 29730) was terminated by signal 11: Segmentation fault DETAIL: Failed process was running: select test(); LOG: terminating any other active server processes WARNING: terminating connection because of crash of another server process DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. HINT: In a moment you should be able to reconnect to the database and repeat your command. FATAL: the database system is in recovery mode I hope that these messages helps you to address this problem. I will also try to address this. Regards, --- Sawada Masahiko -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Turning off HOT/Cleanup sometimes
On Tue, Apr 14, 2015 at 6:07 PM, Simon Riggs si...@2ndquadrant.com wrote: On 11 March 2015 at 20:55, Peter Eisentraut pete...@gmx.net wrote: I don't know how to move forward. We could give users a knob: This might make your queries faster or not -- good luck. But of course nobody will like that either. What is clear is that large SELECT queries are doing the work VACUUM should do. We should not be doing large background tasks (block cleanup) during long running foreground tasks. But there is no need for changing behaviour during small SELECTs. So the setting of 4 gives current behaviour for small SELECTs and new behaviour for larger SELECTs. Peter commented previously that README.HOT should get an update. The relevant section seems to be When can/should we prune or defragment?. I wonder if it would be a useful heuristic to still prune pages if those pages are already dirty. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
On Tue, Apr 14, 2015 at 6:22 PM, Peter Geoghegan p...@heroku.com wrote: Why is that good? We did discuss this before. I've recapped some of what I believe to be the most salient points below. I think that people were all too quick to dismiss the idea of a wall time interval playing some role here (at least as a defense against correlated references, as a correlated reference period). I suppose that that's because it doesn't fit with an intuition that says that that kind of interval ought to be derived algebraically - magic delay settings are considered suspect. Yep, Tom gave that reason here: http://www.postgresql.org/message-id/11258.1397673...@sss.pgh.pa.us But there was also this point from Andres - gettimeofday is not free: http://www.postgresql.org/message-id/20140416075307.gc3...@awork2.anarazel.de And this point from me - this can degrade to random eviction under high pressure: http://www.postgresql.org/message-id/ca+tgmoayuxr55zueapp6d2xbyicjwacc9myyn5at4tindsj...@mail.gmail.com You'll notice that my proposal avoids all three of those objections. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
On Wed, Apr 15, 2015 at 2:55 AM, Robert Haas robertmh...@gmail.com wrote: On Wed, Apr 16, 2014 at 2:44 PM, Tom Lane t...@sss.pgh.pa.us wrote: Merlin Moncure mmonc...@gmail.com writes: Anyways, I'm still curious if you can post similar numbers basing the throttling on gross allocation counts instead of time. Meaning: some number of buffer allocations has to have occurred before you consider eviction. Besides being faster I think it's a better implementation: an intermittently loaded server will give more consistent behavior. Yeah --- I think wall-clock-based throttling is fundamentally the wrong thing anyway. Are we going to start needing a CPU speed measurement to tune the algorithm with? Not the place to be going. But driving it off the number of allocations that've been done could be sensible. (OTOH, that means you need a central counter, which itself would be a bottleneck.) So, I was thinking about this a little bit more today, prodded by my coworker John Gorman. I'm wondering if we could drive this off of the clock sweep; that is, every time the clock sweep touches a buffer, its usage count goes down by one, but we also set two flag bits. Every time the buffer gets touched thereafter, we check whether any flag bits are set; if so, we clear one and increase the usage count, else we do nothing. So the usage count can increase at most twice per clock sweep. The advantage of that is that, as with Peter's approach, it is harder for the usage count of a buffer to max out - to get there, you need sustained access over a longer period of time. But it's not time-dependent, so replaying the same workload at twice the speed or half the speed on faster or slower hardware doesn't change the choice of which buffer to evict, which seems good. And it will cause us to prefer buffers which are accessed repeatedly over a period of time rather than buffers that are accessed a bunch of times in quick succession and then not touched again for a while, which seems like a good bet. IIUC, this will allow us to increase usage count only when the buffer is touched by clocksweep to decrement the usage count. I think such a solution will be good for the cases when many evictions needs to be performed to satisfy the workload, OTOH when there are not too many evictions that needs to be done, in such a case some of the buffers that are accessed much more will have equal probability to get evicted as compare to buffers which are less accessed. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Make more portable TAP tests of initdb
On Tue, Apr 14, 2015 at 05:29:36PM -0300, Alvaro Herrera wrote: David E. Wheeler wrote: On Apr 14, 2015, at 1:21 PM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: Castoroides has 5.8.4. Oops. WUT. Yeah, eh? Anyway I don't think it matters much: just don't enable TAP tests on machines with obsolete Perl. I think this is fine since 5.8's latest release is supported. Solaris 10 ships Perl 5.8.4, and RHEL 5.11 ships Perl 5.8.8. Therefore, Perl installations lacking this File::Path feature will receive vendor support for years to come. Replacing the use of keep_root with rmtree+mkdir will add 2-10 lines of code, right? Let's do that and not raise the system requirements. Thanks, nm -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
On Tue, Apr 14, 2015 at 7:10 PM, Greg Stark st...@mit.edu wrote: The way the clock sweep algorithm is meant to be thought about is that it's an approximate lru. Each usage count corresponds to an ntile of the lru. So we don't know which buffer is least recently used but it must be in the set of buffers with usage count 0 and that should be 1/nth of all the buffers. Agreed. In order for that property to be maintained though the usage count for some buffer should be getting decremented every time we touch a buffer. That is, every time we promote one buffer to the most recently moved ntile we should be demoting some other buffer. Agreed. It's not easy to get this behavior exactly, though, because the buffer you kick out necessarily has a usage count of 0 and the one you bring in probably shouldn't. And we don't wanna have to run the clock sweep every time somebody touches a non-maximal usage count. But I think it is still true that this is, to some degree, what our algorithm is trying to approximate, and I also think it's pretty clear that our current approximation isn't that great. The way our cache works we promote when a buffer is accessed but we only demote when a buffer is flushed. We flush a lot less often than we touch buffers so it's not surprising that the cache ends up full of buffers that are all in the most recently used section. This isn't really correct. We promote when it's accessed, but we demote it when the clock sweep hand passes over it, which happens each time we consider it for eviction. It does not have to do with flushing dirty date to disk, and it does not happen only when the buffer is actually evicted. Now it's complicated by the fact that we aren't promoting buffers directly to the most recently used ntile. We're incrementing the usage count by one. That makes it more of a least frequently used list rather than a lru. I think that's a mistake but I recall some debate about that when it first went in. Note that the discussion of 2Q, LRU(k), and perhaps others ask not only how recently the page was used, but how frequently it was used. The recency of the next-to-last access is often used as a proxy for frequency. Consider two buffers. One gets touched 5 times in a row, once a day. The other gets touched 5 times per day, at equal intervals. In general, the second buffer is a better choice to retain than the first, even if it has been touched less recently. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] FPW compression leaks information
On Wed, Apr 15, 2015 at 11:55 AM, Michael Paquier michael.paqu...@gmail.com wrote: On Wed, Apr 15, 2015 at 11:10 AM, Fujii Masao masao.fu...@gmail.com wrote: On Wed, Apr 15, 2015 at 12:00 AM, Magnus Hagander mag...@hagander.net wrote: On Tue, Apr 14, 2015 at 4:50 PM, Heikki Linnakangas hlinn...@iki.fi wrote: On 04/14/2015 05:42 AM, Robert Haas wrote: On Sun, Apr 12, 2015 at 8:38 PM, Heikki Linnakangas hlinn...@iki.fi wrote: As to RLS - yeah, that's where I think a lot of the possible covert channel attacks are. But it doesn't have to be RLS per se. It can be, for example, a table some of whose contents you expose via a security barrier view. It can be a security-definer function that you call and it returns some data that you don't have rights to view directly. Basically, it can be any table to which you have some access, but not complete access. Then you can use timing attacks, scrutinize EXPLAIN plans for clues, look at CTIDs, and so on. Basically, I'm worried that it's going to be completely impractical to prevent a certain amount of information leakage when you have partial access to a table, and that in a largely-futile effort to try to prevent it, we'll end up making a whole bunch of things (like the WAL insert position) super-user only, and that this will in fact be a net reduction in security because it'll encourage people to use the superuser account more. Perhaps that concern is ill-founded, but that's what I'm worried about. I'm not a big fan of locking down WAL position information either. If we have to treat the current WAL position is security-sensitive information, we're doing something wrong. But I don't want to just give up either. One option is to make this controllable on a per-table basis, as Amit suggested. Then we could always disable it for pg_authid, add a suitable warning to the docs, and let the DBA enable/disable it for other tables. It's a bit of a cop-out, but would fix the immediate problem. I think it's a fairly narrow attack vector. As long as we document it clearly, and make it easy enough to turn it off, I think that's definitely enough. Most people will not care at all, I'm sure - but we need to cater to those that do. I think we could probably even get away with just documenting the risk and having people turn off the compression *completely* if they care about it, but if we can do it at a table level, that's obviously a lot better. +1 OK. I am fine to implement anything required here if needed, meaning the following: 1) Doc patch to mention that it is possible that compression can give hints to attackers when working on sensible fields that have a non-fixed size. I think that this patch is enough as the first step. 2) Switch at relation level to control wal_compression. ALTER TABLE SET is not allowed on system catalog like pg_authid. So should we change it so that a user can change the flag even on system catalog? I'm afraid that the change might cause another problem, though. Probably we can disable the compression on every system catalogs by default. But I can imagine that someone wants to enable the compression even on system catalog. For example, pg_largeobject may cause lots of FPW. This needs to change XLogRecordAssemble by adding some new logic to check if a relation is enforcing WAL compression or not. As a reloption, there are three possible values: true, false and fallback to system default. Also, I think that we should simply extend XLogRegisterBuffer() and pass to it the reloption flag that is then registered in registered_buffer, and XLogRecordAssemble() decides with this flag if block is compressed or not. Do we want to add this reloption switch to indexes as well? Maybe. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
On Wed, Apr 15, 2015 at 12:15 AM, Amit Kapila amit.kapil...@gmail.com wrote: IIUC, this will allow us to increase usage count only when the buffer is touched by clocksweep to decrement the usage count. Yes. I think such a solution will be good for the cases when many evictions needs to be performed to satisfy the workload, OTOH when there are not too many evictions that needs to be done, in such a case some of the buffers that are accessed much more will have equal probability to get evicted as compare to buffers which are less accessed. Possibly, but I think it's even worse under the current algorithm. Under this proposal, if we go for a long time without any buffer evictions, every buffer usage's count will top out at 2 more than wherever it was after the last clock sweep. In the worst case, every buffer (or most of them) could end up with the same usage count. But under the status quo, they'll all go to 5, which is an even bigger loss of information, and which will make the first eviction much more expensive than if they are all pegged at 2 or 3. There could be ways to improve things further, such as by slowly advancing the clock sweep even when no eviction is required. But that's a bit tricky to do, too. You'd like (perhaps) to advance it one step for every buffer allocation, but you don't want a centralized counter, or unnecessary contention on the clock sweep when no eviction is necessary in the first place. There's probably some way to make it work, though, if we put our mind to it. I'm inclined to try the simpler approach first and see what it buys. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] What exactly is our CRC algorithm?
On 04/03/2015 05:28 AM, Abhijit Menon-Sen wrote: At 2015-04-03 00:33:10 +0300, hlinn...@iki.fi wrote: I came up with the attached. I like it very much. src/port/Makefile has (note src/srv): +# pg_crc32c_sse42.o and its _src.o version need CFLAGS_SSE42 +pg_crc32c_sse42.o: CFLAGS+=$(CFLAGS_SSE42) +pg_crc32c_sse42_srv.o: CFLAGS+=$(CFLAGS_SSE42) Other than that, this looks great. Thank you. BTW, we might want to move the traditional and legacy crc32 implementations out of src/common. They are only used in backend code now. I agree. Committed this now, after some further cleanup and reorganizing the autoconf stuff. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical column ordering
On Tue, Apr 14, 2015 at 2:38 PM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: As I said, I'm still writing the first pieces of this so I'm not sure what other ramifications it will have. If there are any thoughts, I would appreciate them. (Particularly useful input on whether it is acceptable to omit lognums/physnums from _outRangeTblEntry.) I think the general rule is that an RTE should not contain any structure information about the underlying relation that can potentially change: the OID is OK because it's immutable for any given relation. The relkind is not quite immutable because you can create a _SELECT rule on a table and turn it into a view; I'm not sure how we handle that, but it's a fairly minor exception anyhow. Everything else in the RTE, with the new and perhaps-unfortunate exception of security quals, is stuff derived from what's in the query, not the table. I think it would be good for this to work the same way: the structural information about the table should be found in the relcache, not the RTE. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical column ordering
I've been looking at this again. It has become apparent to me that what we're doing in parse analysis is wrong, and the underlying node representation is wrong too. Here's a different approach, which I hope will give better fruits. I'm still working on implementing the ideas here (and figuring out what the fallout is). Currently, the patch stores RangeTblEntry-eref-colnames in logical order; and it also adds a map from logical colnums to attnum (called lognums). Now this is problematic for two reasons: 1. the lognums map becomes part of the stored representation of a rule; any time you modified the logical ordering of a table underlying some view, the view's _RETURN rule would have to be modified as well. Not good. 2. RTE-eref-colnames is in attlognum order and thus can only be sanely interpreted if RTE-lognums is available, so not only lognums would have to be modified, but colnames as well. I think the solution to both these issues is to store colnames in attnum ordering not logical, and *not* output RTE-lognums as part of _outRangeTblEntry. This means that every time we read the RTE for the table we need to obtain lognums from its tupledesc. RTE-eref-colnames can then be sorted appropriately at plan time. At RTE creation time (addRangeTableEntry and siblings) we can obtain lognums and physnums. Both these arrays are available for later application in setrefs.c, avoiding the need of the obviously misplaced relation_open() call we currently have there. There is one gotcha, which is that expandTupleDesc (and, really, everything from expandRTE downwards) will need to be split in two somehow: one part needs to fill in the colnames array in attnum order, and the other part needs to expand the attribute array into Var nodes in logical order. (If you recall, we need attphysnums at setrefs.c time so that we can fix-up any TupleDesc created from a targetlist so that it contains the proper attphysnum values. The attphysnum values for each attribute do not propagate properly there, and I believe this is the mechanism to do so.) As I said, I'm still writing the first pieces of this so I'm not sure what other ramifications it will have. If there are any thoughts, I would appreciate them. (Particularly useful input on whether it is acceptable to omit lognums/physnums from _outRangeTblEntry.) An alternative idea would be to add lognums and physnums to RelOptInfo instead of RangeTblEntry (we would do so during get_relation_info). I'm not sure how this works for setrefs.c though, if at all; the advantage is that RelOptInfo is not part of stored rules so we don't have to worry about not saving them there. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] inherit support for foreign tables
Jim Nasby wrote: On 4/14/15 5:49 AM, Etsuro Fujita wrote: postgres=# create foreign table ft1 (c1 int) server myserver options (table_name 't1'); CREATE FOREIGN TABLE postgres=# create foreign table ft2 (c1 int) server myserver options (table_name 't2'); CREATE FOREIGN TABLE postgres=# alter foreign table ft2 inherit ft1; ALTER FOREIGN TABLE postgres=# select * from ft1 for update; ERROR: could not find junk tableoid1 column I think this is a bug. Attached is a patch fixing this issue. What happens when the foreign side breaks the inheritance? Does the FDW somehow know to check that fact for each query? This is a meaningless question. The remote tables don't have to have an inheritance relationship already; only the local side sees them as connected. I think the real question is whether we're now (I mean after this patch) emitting useless tableoid columns that we didn't previously have. I think the answer is yes, and if so I think we need a smaller comb to fix the problem. This one seems rather large. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Make more portable TAP tests of initdb
On Tue, Apr 14, 2015 at 09:25:33AM -0700, David E. Wheeler wrote: On Apr 14, 2015, at 9:05 AM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: http://perldoc.perl.org/File/Path.html With this formulation: remove_tree($tempdir, {keep_root = 1}); Does Perl 5.8 have this? Yes, it does. http://cpansearch.perl.org/src/NWCLARK/perl-5.8.9/lib/File/Path.pm One way to avoid a lot of these questions would be to use Test::MinimumVersion or similar in a git hook. If we're promising 5.8.mumble, we should be able to keep that promise always rather than search case by case, at least up to bugs in the version-checking software. http://search.cpan.org/~rjbs/Test-MinimumVersion-0.101081/lib/Test/MinimumVersion.pm What say? Cheers, David. -- David Fetter da...@fetter.org http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fet...@gmail.com Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Make more portable TAP tests of initdb
David E. Wheeler wrote: On Apr 14, 2015, at 9:05 AM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: http://perldoc.perl.org/File/Path.html With this formulation: remove_tree($tempdir, {keep_root = 1}); Does Perl 5.8 have this? Yes, it does. http://cpansearch.perl.org/src/NWCLARK/perl-5.8.9/lib/File/Path.pm Uh. 5.8.9 does, but 5.8.8 does not. Not sure if this is a problem. Ancient buildfarm member pademelon (Tom's HPUX stuff) has 5.8.9 -- but that was probably installed by hand. Brolga (really ancient Cygwin stuff) has 5.10.1. As far as I can tell (perldoc perldelta5101) it was upgraded to File::Path 2.07_03 which should be fine. Spoonbill has 5.12.2; 5.12.0 already included the fixes in 5.10.1 so it should be fine also. Friarbird has 5.12.4. Castoroides has 5.8.4. Oops. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Make more portable TAP tests of initdb
David E. Wheeler wrote: On Apr 14, 2015, at 1:21 PM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: Castoroides has 5.8.4. Oops. WUT. Yeah, eh? Anyway I don't think it matters much: just don't enable TAP tests on machines with obsolete Perl. I think this is fine since 5.8's latest release is supported. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Make more portable TAP tests of initdb
On Apr 14, 2015, at 1:21 PM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: Castoroides has 5.8.4. Oops. WUT. smime.p7s Description: S/MIME cryptographic signature
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
On 4/14/15 5:22 PM, Peter Geoghegan wrote: As long as we're doing random brainstorming, I'd suggest looking at making clocksweep actually approximate LRU-K/LRU-2 (which, again, to be clear, my prototype did not do). The clocksweep could maintain statistics about the recency of the second-to-last access across all buffers, and discriminate against buffers according to what bucket of the population they fit in to. Not sure how aggressively we'd penalize those buffers that had very old penultimate references (or credit those that had very recent penultimate references), or what the bucket partitioning scheme is, but that's probably where'd I'd take it next. For example, buffers with a penultimate reference that is more than a standard deviation below the mean would be double penalized (and maybe the opposite, for those buffers with penultimate accesses a stddev above the mean). If that didn't work so well, then I'd look into an ARC style recency and frequency list (while remembering things about already evicted blocks, which LRU-K does not doalthough that paper is from the early 1990s). Along the lines of brainstorming... why do we even allow usage_count 1? Clocksweep was used pretty successfully by at least FreeBSD, but they simply used a bit to indicate recently used. Anything that wasn't recently used moved from the active pull to the inactive pool (which tended to be far larger than the active pool with decent amounts of memory), and a small number of buffers were keep on the 'free' list by pulling them out of the inactive pool and writing them if they were dirty. All of this was done on an LRU basis. Given how common it is for the vast bulk of shared_buffers in an install to be stuck at 5, I'd think the first thing we should try is a combination of greatly reducing the max for usage_count (maybe to 2 instead of 1 to simulate 2 pools), and running the clock sweep a lot more aggressively in a background process. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Turning off HOT/Cleanup sometimes
On 11 March 2015 at 20:55, Peter Eisentraut pete...@gmx.net wrote: I don't know how to move forward. We could give users a knob: This might make your queries faster or not -- good luck. But of course nobody will like that either. What is clear is that large SELECT queries are doing the work VACUUM should do. We should not be doing large background tasks (block cleanup) during long running foreground tasks. But there is no need for changing behaviour during small SELECTs. So the setting of 4 gives current behaviour for small SELECTs and new behaviour for larger SELECTs. The OP said this... op We also make SELECT clean up blocks as it goes. That is useful in OLTP workloads, but it means that large SQL queries and pg_dump effectively do much the same work as VACUUM, generating huge amounts of I/O and WAL on the master, the cost and annoyance of which is experienced directly by the user. That is avoided on standbys. Effects of that are that long running statements often run much longer than we want, increasing bloat as a result. It also produces wildly varying response times, depending upon extent of cleanup required. /op This is not a performance patch. This is about one user doing the cleanup work for another. People running large SELECTs should not be penalised. The patch has been shown to avoid that and no further discussion should be required. I don't really care whether we have a parameter for this or not. As long as we have the main feature. It's trivial to add/remove a parameter to control this. Currently there isn't one. I'd like to commit this. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, RemoteDBA, Training Services hot_disable.v9.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
On Tue, Apr 14, 2015 at 2:25 PM, Robert Haas robertmh...@gmail.com wrote: So, I was thinking about this a little bit more today, prodded by my coworker John Gorman. I'm wondering if we could drive this off of the clock sweep; that is, every time the clock sweep touches a buffer, its usage count goes down by one, but we also set two flag bits. Every time the buffer gets touched thereafter, we check whether any flag bits are set; if so, we clear one and increase the usage count, else we do nothing. So the usage count can increase at most twice per clock sweep. The advantage of that is that, as with Peter's approach, it is harder for the usage count of a buffer to max out - to get there, you need sustained access over a longer period of time. But it's not time-dependent, so replaying the same workload at twice the speed or half the speed on faster or slower hardware doesn't change the choice of which buffer to evict, which seems good. Why is that good? My little prototype basically implemented the LRU-K approach to preventing correlated references (which tpc-b has some of, in a really obvious way - not sure how significant that is). This would have accidentally made it weight frequency better, but the prototype did not pretend to actually implement something like LRU-K (which is not the state of the art in any case), because it did not consider the recency of the second-to-last access of the buffer at all. The prototype's performance started off well, but regressed in later pgbench-collector iterations (due to the influence of VACUUM, I think). If I wanted to cheat, I could have only done one large 45 minute run, which would have made the prototype look much better still. And it will cause us to prefer buffers which are accessed repeatedly over a period of time rather than buffers that are accessed a bunch of times in quick succession and then not touched again for a while, which seems like a good bet. I think that people were all too quick to dismiss the idea of a wall time interval playing some role here (at least as a defense against correlated references, as a correlated reference period). I suppose that that's because it doesn't fit with an intuition that says that that kind of interval ought to be derived algebraically - magic delay settings are considered suspect. However, the same can be said of the Five-minute rule, which is a highly influential rule of thumb that Jim Gray came up with. The Five minute rule is now obsolete, but that took a long time, and the fundamental observation still applies (Wikipedia says it depends on what type of disks you have these days, but the fact remains that rule of thumbs like this can be more or less robust). I think that correlated reference type delays *are* used effectively in real world systems without it being fragile/overly workload dependent. I can't convince myself that this fully solves the (currently existing) problem of usage counts increasing too fast. In theory, every time the clock sweep hand advances, it decrements the usage count of one buffer by one, but also makes it possible for the usage count of that buffer to increase by up to two (or by only one if the buffer previously had a usage count of five). So with the right access pattern, it seems like you could still get to a place where a typical buffer eviction has to do a lot of scanning before it finds a victim buffer. As with the present algorithm, we'd be most likely to have that problem when the buffer hit ratio is very high, so that there are many opportunities for usage counts to go up between each opportunity to push usage counts back down. But it seems like it would at least limit the damage. I haven't tried this or anything, so this is just random brainstorming. That's what it'll take, I think -- random brainstorming, and crude prototypes to test theories. As long as we're doing random brainstorming, I'd suggest looking at making clocksweep actually approximate LRU-K/LRU-2 (which, again, to be clear, my prototype did not do). The clocksweep could maintain statistics about the recency of the second-to-last access across all buffers, and discriminate against buffers according to what bucket of the population they fit in to. Not sure how aggressively we'd penalize those buffers that had very old penultimate references (or credit those that had very recent penultimate references), or what the bucket partitioning scheme is, but that's probably where'd I'd take it next. For example, buffers with a penultimate reference that is more than a standard deviation below the mean would be double penalized (and maybe the opposite, for those buffers with penultimate accesses a stddev above the mean). If that didn't work so well, then I'd look into an ARC style recency and frequency list (while remembering things about already evicted blocks, which LRU-K does not doalthough that paper is from the early 1990s). There are approaches to relieving lock
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
On Wed, Apr 16, 2014 at 2:44 PM, Tom Lane t...@sss.pgh.pa.us wrote: Merlin Moncure mmonc...@gmail.com writes: Anyways, I'm still curious if you can post similar numbers basing the throttling on gross allocation counts instead of time. Meaning: some number of buffer allocations has to have occurred before you consider eviction. Besides being faster I think it's a better implementation: an intermittently loaded server will give more consistent behavior. Yeah --- I think wall-clock-based throttling is fundamentally the wrong thing anyway. Are we going to start needing a CPU speed measurement to tune the algorithm with? Not the place to be going. But driving it off the number of allocations that've been done could be sensible. (OTOH, that means you need a central counter, which itself would be a bottleneck.) So, I was thinking about this a little bit more today, prodded by my coworker John Gorman. I'm wondering if we could drive this off of the clock sweep; that is, every time the clock sweep touches a buffer, its usage count goes down by one, but we also set two flag bits. Every time the buffer gets touched thereafter, we check whether any flag bits are set; if so, we clear one and increase the usage count, else we do nothing. So the usage count can increase at most twice per clock sweep. The advantage of that is that, as with Peter's approach, it is harder for the usage count of a buffer to max out - to get there, you need sustained access over a longer period of time. But it's not time-dependent, so replaying the same workload at twice the speed or half the speed on faster or slower hardware doesn't change the choice of which buffer to evict, which seems good. And it will cause us to prefer buffers which are accessed repeatedly over a period of time rather than buffers that are accessed a bunch of times in quick succession and then not touched again for a while, which seems like a good bet. I can't convince myself that this fully solves the (currently existing) problem of usage counts increasing too fast. In theory, every time the clock sweep hand advances, it decrements the usage count of one buffer by one, but also makes it possible for the usage count of that buffer to increase by up to two (or by only one if the buffer previously had a usage count of five). So with the right access pattern, it seems like you could still get to a place where a typical buffer eviction has to do a lot of scanning before it finds a victim buffer. As with the present algorithm, we'd be most likely to have that problem when the buffer hit ratio is very high, so that there are many opportunities for usage counts to go up between each opportunity to push usage counts back down. But it seems like it would at least limit the damage. I haven't tried this or anything, so this is just random brainstorming. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
I've been meaning to write this since PGConf and now isn't a great time since I'm on my phone but I think it's time. The way the clock sweep algorithm is meant to be thought about is that it's an approximate lru. Each usage count corresponds to an ntile of the lru. So we don't know which buffer is least recently used but it must be in the set of buffers with usage count 0 and that should be 1/nth of all the buffers. In order for that property to be maintained though the usage count for some buffer should be getting decremented every time we touch a buffer. That is, every time we promote one buffer to the most recently moved ntile we should be demoting some other buffer. The way our cache works we promote when a buffer is accessed but we only demote when a buffer is flushed. We flush a lot less often than we touch buffers so it's not surprising that the cache ends up full of buffers that are all in the most recently used section. Now it's complicated by the fact that we aren't promoting buffers directly to the most recently used ntile. We're incrementing the usage count by one. That makes it more of a least frequently used list rather than a lru. I think that's a mistake but I recall some debate about that when it first went in.
Re: [HACKERS] Turning off HOT/Cleanup sometimes
On 4/14/15 6:07 PM, Simon Riggs wrote: On 11 March 2015 at 20:55, Peter Eisentraut pete...@gmx.net wrote: I don't know how to move forward. We could give users a knob: This might make your queries faster or not -- good luck. But of course nobody will like that either. What is clear is that large SELECT queries are doing the work VACUUM should do. We should not be doing large background tasks (block cleanup) during long running foreground tasks. But there is no need for changing behaviour during small SELECTs. So the setting of 4 gives current behaviour for small SELECTs and new behaviour for larger SELECTs. The OP said this... op We also make SELECT clean up blocks as it goes. That is useful in OLTP workloads, but it means that large SQL queries and pg_dump effectively do much the same work as VACUUM, generating huge amounts of I/O and WAL on the master, the cost and annoyance of which is experienced directly by the user. That is avoided on standbys. Effects of that are that long running statements often run much longer than we want, increasing bloat as a result. It also produces wildly varying response times, depending upon extent of cleanup required. /op This is not a performance patch. This is about one user doing the cleanup work for another. People running large SELECTs should not be penalised. The patch has been shown to avoid that and no further discussion should be required. I don't really care whether we have a parameter for this or not. As long as we have the main feature. It's trivial to add/remove a parameter to control this. Currently there isn't one. I'd like to commit this. +1 from me. One of the last databases I worked on had big raw partitions that were written to and then sequentially scanned exactly once before being dropped. It was painful to see all those writes happening for nothing. In other cases there were sequential scans that happened directly after the main writes, but then the next read might be days in the future (if ever) and the system was basically idle for a while which would have allowed vacuum to come in and do the job without affecting performance of the main job. I think that in batch-oriented databases this patch will definitely be a boon to performance. -- - David Steele da...@pgmasters.net signature.asc Description: OpenPGP digital signature
Re: [HACKERS] Auditing extension for PostgreSQL (Take 2)
This patch does not apply cleanly due to the moving of pgbench (patch to filelist.sgml failed). Best regards, -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese:http://www.sraoss.co.jp Attached is the v7 pg_audit patch. I've tried to address Peter's documentation concerns by cleaning up the terminology and adding a real-world case plus usage recommendations. The word auditing has been expunged from the docs in favor of the term audit logging. Per Simon's request, there is now a pg_audit.log_relation setting that makes session audit logging exhaustively log all relations as it did before. The ROLE logging class is back as well. Simon also suggested a way that pg_audit could be tested with standard regression so I have converted all tests over and removed test.pl. Sawada, I'd certainly appreciate it if you'd try again and see if you are still getting a segfault with your test code (which you can find in the regression tests). Currently the patch will compile on master (I tested with b22a36a) or optionally with Alvaro's deparse patches applied (only 0001 0002 needed). I've supplied a different regression test out file (expected/pg_audit-deparse.out) which can be copied over the standard out file (expected/pg_audit.out) if you'd like to do regression on pg_audit with deparse. The small section of code that calls pg_event_trigger_ddl_commands() can be compiled by defining DEPARSE or removed the #ifdefs around that block. Please let me know if I've missed anything and I look forward to comments and questions. Thanks, -- - David Steele da...@pgmasters.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers