Re: [HACKERS] index-only quals vs. security_barrier views
On Thu, Feb 09, 2012 at 12:02:29PM -0500, Robert Haas wrote: > When Heikki worked up his original index-only scan patches (which > didn't end up looking much like what eventually got committed), he had > the notion of an index-only qual. That is, given a query like this: > > select sum(1) from foo where substring(a,1,3) = 'abc'; > > We could evaluate the substring qual before performing the heap fetch, > and fetch the tuple from the heap only if the qual passes. > Now, there's a fly in the ointment here, which is that applying > arbitrary user-defined functions to tuples that might not be visible > doesn't sound very safe. The user-defined function in question might > perform some action based on those invisible tuples that has side > effects, which would be bad, because now we're violating MVCC > semantics. Or it might throw an error, killing the scan dead on the > basis of the contents of some tuple that the scan shouldn't even see. > However, there is certainly a class of functions for which this type > of optimization would be safe, and it's an awful lot like the set of > functions that can be safely pushed down through a security_barrier > view - namely, things that don't have side effects or throw errors. > So it's possible that the proleakproof flag KaiGai is proposing to add > to pg_proc could do double duty, serving also to identify when it's > safe to apply a qual to an index tuple when the corresponding heap > tuple might be invisible. However, I have some reservations about > assuming that the two concepts are exactly the same. For one thing, > people are inevitably going to want to cheat a little bit more here > than is appropriate for security views, and encouraging people to mark > things LEAKPROOF when they're really not is a security disaster > waiting to happen. The similarity is indeed tempting, but I find the concepts sufficiently distinct to not make one device serve both. Adding to the reservations you offer, LEAKPROOF is superuser-only. This other marker would not entail any special privilege. > For another thing, there are some important cases > that this doesn't cover, like: > > select * from foo where substring(a,1,3) like '%def%'; > > The like operator doesn't seem to be leakproof in the security sense, > because it can throw an error if the pattern is something like a > single backslash (ERROR: LIKE pattern must not end with escape > character) and indeed it doesn't seem like it would be safe here > either if the pattern were stored in the table. But if the pattern > were constant, it'd be OK, or almost OK: there's still the edge case > where the table contains invisible rows but no visible ones - whether > or not we complain about the pattern there ought to be the same as > whether or not we complain about it on a completely empty table. If > we got to that point, then we might as well consider the qual > leakproof for security purposes under the same set of circumstances > we'd consider it OK to apply to possibly-invisible tuples. This sort of thing implicates substring(), too, when you call it as substring(a, 1, b); b < 0 produces an error. To handle these, I think we'd need a facility along the lines of protransform. Have a function inspecting call nodes for a particular other function and determining whether each is ok-for-index-only-quals. You could even force protransform itself into that role. Create an additional pg_proc entry identical to the ordinary substring() but for a different name and having the ok-for-index-only-quals flag. Add a protransform to the main pg_proc entry that inspects the argument nodes and, when they're safe, replaces the call with a call to that errorfree_substring() at plan time. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] index-only quals vs. security_barrier views
On Sat, Feb 11, 2012 at 7:16 AM, Jesper Krogh wrote: > Ok, but there are still cases where we don't even need to construct > a data tuple at all: > > 2012-02-11 13:14:01.579 jk=# explain select count(*) from testtable where > fts @@ to_tsquery('english','test1'); > QUERY PLAN > --- > Aggregate (cost=31.24..31.25 rows=1 width=0) > -> Bitmap Heap Scan on testtable (cost=16.03..31.23 rows=4 width=0) > Recheck Cond: (fts @@ '''test1'''::tsquery) > -> Bitmap Index Scan on ftsid (cost=0.00..16.03 rows=4 width=0) > Index Cond: (fts @@ '''test1'''::tsquery) > (5 rows) In that case I believe you DO need the heap tuple. That "Recheck Cond" there means that the index might be lossy - i.e. return tuples that don't really match the search condition. > Another idea sprung into my head, that indices on (ctid, columns>) > could actually serve as some kind of "vertical" partitioning of the table. The ctid of a tuple is its physical position in the table. It makes no sense to index that. Since it's unique, it makes even less sense to index that plus other things in the same index. Does anyone have any comments on the issue raised in my original email? I would like to get (some version of) his patch committed, but I would also like to not back ourselves into a corner. Thanks, -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] index-only quals vs. security_barrier views
On 2012-02-09 22:17, Jesper Krogh wrote: On 2012-02-09 21:09, Robert Haas wrote: That doesn't make sense to me. If you probe index A for rows where a = 1 and find that CTID (100,1) is such a row, and now want to return a column value b that is not present in that index, the fastest way to get the row is going to be to fetch block 100 from the heap and return the data out of the first tuple. To get the value out of some other index that does include column b would require scanning the entire index looking for that CTID, just so you could then grab the corresponding index tuple, which wouldn't make any sense at all. You're right, in my head, everything it wired up against my primary keys, of-course that isn't the case for the DB. Sorry for the noise. Ok, but there are still cases where we don't even need to construct a data tuple at all: 2012-02-11 13:14:01.579 jk=# explain select count(*) from testtable where fts @@ to_tsquery('english','test1'); QUERY PLAN --- Aggregate (cost=31.24..31.25 rows=1 width=0) -> Bitmap Heap Scan on testtable (cost=16.03..31.23 rows=4 width=0) Recheck Cond: (fts @@ '''test1'''::tsquery) -> Bitmap Index Scan on ftsid (cost=0.00..16.03 rows=4 width=0) Index Cond: (fts @@ '''test1'''::tsquery) (5 rows) Another idea sprung into my head, that indices on (ctid,columns>) could actually serve as some kind of "vertical" partitioning of the table. Wether it actually will me more efficient or not need to be tested. Jesper -- Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] index-only quals vs. security_barrier views
On 2012-02-09 21:09, Robert Haas wrote: That doesn't make sense to me. If you probe index A for rows where a = 1 and find that CTID (100,1) is such a row, and now want to return a column value b that is not present in that index, the fastest way to get the row is going to be to fetch block 100 from the heap and return the data out of the first tuple. To get the value out of some other index that does include column b would require scanning the entire index looking for that CTID, just so you could then grab the corresponding index tuple, which wouldn't make any sense at all. You're right, in my head, everything it wired up against my primary keys, of-course that isn't the case for the DB. Sorry for the noise. -- Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] index-only quals vs. security_barrier views
On Thu, Feb 9, 2012 at 1:33 PM, Jesper Krogh wrote: > On 2012-02-09 18:02, Robert Haas wrote: >> >> I don't have any appetite for trying to do anything more with >> index-only scans for 9.2, though maybe someone else will think >> otherwise. But I would like very much to get KaiGai's leakproof stuff >> committed, and so it seems like a good idea to reconcile the needs of >> that machinery with what might eventually be needed here. > > Those were a couple of nice cases where index-only-scans > could win more than they does today. I have another one here: > > 2012-02-09 19:17:28.788 jk=# \d testtable > Table "public.testtable" > Column | Type | Modifiers > +--+ > id | integer | not null default nextval('testtable_id_seq'::regclass) > fts | tsvector | > Indexes: > "prk_idx" UNIQUE, btree (id) > "fts_id" gin (fts) > > 2012-02-09 19:19:39.054 jk=# explain select id from testtable where fts @@ > to_tsquery('english','test1000'); > QUERY PLAN > --- > Bitmap Heap Scan on testtable (cost=20.29..161.28 rows=37 width=4) > Recheck Cond: (fts @@ '''test1000'''::tsquery) > -> Bitmap Index Scan on fts_id (cost=0.00..20.28 rows=37 width=0) > Index Cond: (fts @@ '''test1000'''::tsquery) > (4 rows) > > Time: 0.494 ms > 2012-02-09 19:19:52.748 jk=# > > In this situation the tuple can be regenerated from the index, but > not from the index-satisfying the where clause, this allows significantly > more complex where-clauses and may also benefit situations where > we only going for one or more of the primary-key/foreing-key columns > for join-conditions. I don't understand what you're saying here. > Above situation does not need to involve a gin-index, but a btree index > where the where clause can be matched up using one index, and the tuple > constructed using another falls into the same category. That doesn't make sense to me. If you probe index A for rows where a = 1 and find that CTID (100,1) is such a row, and now want to return a column value b that is not present in that index, the fastest way to get the row is going to be to fetch block 100 from the heap and return the data out of the first tuple. To get the value out of some other index that does include column b would require scanning the entire index looking for that CTID, just so you could then grab the corresponding index tuple, which wouldn't make any sense at all. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] index-only quals vs. security_barrier views
On 2012-02-09 18:02, Robert Haas wrote: I don't have any appetite for trying to do anything more with index-only scans for 9.2, though maybe someone else will think otherwise. But I would like very much to get KaiGai's leakproof stuff committed, and so it seems like a good idea to reconcile the needs of that machinery with what might eventually be needed here. Those were a couple of nice cases where index-only-scans could win more than they does today. I have another one here: 2012-02-09 19:17:28.788 jk=# \d testtable Table "public.testtable" Column | Type | Modifiers +--+ id | integer | not null default nextval('testtable_id_seq'::regclass) fts| tsvector | Indexes: "prk_idx" UNIQUE, btree (id) "fts_id" gin (fts) 2012-02-09 19:19:39.054 jk=# explain select id from testtable where fts @@ to_tsquery('english','test1000'); QUERY PLAN --- Bitmap Heap Scan on testtable (cost=20.29..161.28 rows=37 width=4) Recheck Cond: (fts @@ '''test1000'''::tsquery) -> Bitmap Index Scan on fts_id (cost=0.00..20.28 rows=37 width=0) Index Cond: (fts @@ '''test1000'''::tsquery) (4 rows) Time: 0.494 ms 2012-02-09 19:19:52.748 jk=# In this situation the tuple can be regenerated from the index, but not from the index-satisfying the where clause, this allows significantly more complex where-clauses and may also benefit situations where we only going for one or more of the primary-key/foreing-key columns for join-conditions. Above situation does not need to involve a gin-index, but a btree index where the where clause can be matched up using one index, and the tuple constructed using another falls into the same category. -- Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] index-only quals vs. security_barrier views
When Heikki worked up his original index-only scan patches (which didn't end up looking much like what eventually got committed), he had the notion of an index-only qual. That is, given a query like this: select sum(1) from foo where substring(a,1,3) = 'abc'; We could evaluate the substring qual before performing the heap fetch, and fetch the tuple from the heap only if the qual passes. The current code is capable of generating an index-only scan plan for this query, but the MVCC visibility check always happens first: if the page is all-visible, we go ahead and evaluate the qual using only the index data, but if the page is not all-visible, we do the heap fetch first and then check the qual only if the tuple is visible to our snapshot. It would be nice to have the ability to do those checks in the other order, in the case where the projected cost of checking the qual is less than the projected cost of the heap fetch. This would allow index-only scans to win in more situations than they can right now, because we'd conceivably avoid quite a bit of random I/O if the qual is fairly selective and there are a decent number of visibility map bits that are unset. In fact, this technique might pay off even if we don't have a covering index: select * from foo where substring(a,1,3) = 'abc'; If the expected selectivity of the qual is low and the index is a lot smaller than the table, we might want to iterate through all the index tuples in the entire index and fetch the heap tuples for only those where the qual passes. This would allow index-only scans - or whatever term you want to use, since there's not much that's index "only" about this case - to potentially win even when no visibility-map bits are set at all. Now, there's a fly in the ointment here, which is that applying arbitrary user-defined functions to tuples that might not be visible doesn't sound very safe. The user-defined function in question might perform some action based on those invisible tuples that has side effects, which would be bad, because now we're violating MVCC semantics. Or it might throw an error, killing the scan dead on the basis of the contents of some tuple that the scan shouldn't even see. However, there is certainly a class of functions for which this type of optimization would be safe, and it's an awful lot like the set of functions that can be safely pushed down through a security_barrier view - namely, things that don't have side effects or throw errors. So it's possible that the proleakproof flag KaiGai is proposing to add to pg_proc could do double duty, serving also to identify when it's safe to apply a qual to an index tuple when the corresponding heap tuple might be invisible. However, I have some reservations about assuming that the two concepts are exactly the same. For one thing, people are inevitably going to want to cheat a little bit more here than is appropriate for security views, and encouraging people to mark things LEAKPROOF when they're really not is a security disaster waiting to happen. For another thing, there are some important cases that this doesn't cover, like: select * from foo where substring(a,1,3) like '%def%'; The like operator doesn't seem to be leakproof in the security sense, because it can throw an error if the pattern is something like a single backslash (ERROR: LIKE pattern must not end with escape character) and indeed it doesn't seem like it would be safe here either if the pattern were stored in the table. But if the pattern were constant, it'd be OK, or almost OK: there's still the edge case where the table contains invisible rows but no visible ones - whether or not we complain about the pattern there ought to be the same as whether or not we complain about it on a completely empty table. If we got to that point, then we might as well consider the qual leakproof for security purposes under the same set of circumstances we'd consider it OK to apply to possibly-invisible tuples. I don't have any appetite for trying to do anything more with index-only scans for 9.2, though maybe someone else will think otherwise. But I would like very much to get KaiGai's leakproof stuff committed, and so it seems like a good idea to reconcile the needs of that machinery with what might eventually be needed here. Comments? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers