Re: [HACKERS] TABLESAMPLE doesn't actually satisfy the SQL spec, does it?
Petr Jelinek p...@2ndquadrant.com writes: On 2015-07-12 18:02, Tom Lane wrote: A possible way around this problem is to redefine the sampling rule so that it is not history-dependent but depends only on the tuple TIDs. For instance, one could hash the TID of a candidate tuple, xor that with a hash of the seed being used for the current query, and then select the tuple if (hash/MAXINT) P. That would work for bernoulli for physical tuples, yes. Only thing that worries me is future extensibility for data sources that only provide virtual tuples. Well, repeatability of a TABLESAMPLE attached to a join seems like an unsolved and possibly unsolvable problem anyway. I don't think we should assume that the API we define today will cope with that. But that is another reason why the current API is inadequate: there's no provision for specifying whether or how a tablesample method can be applied to non-base-table RTEs. (I re-read the thread and noted that Peter E. complained about that some time ago, but nothing was done about it. I'm fine with not supporting the case right now, but nonetheless it's another reason why we'd better make the API more easily extensible.) regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] TABLESAMPLE doesn't actually satisfy the SQL spec, does it?
On 2015-07-16 16:22, Tom Lane wrote: Petr Jelinek p...@2ndquadrant.com writes: On 2015-07-12 18:02, Tom Lane wrote: A possible way around this problem is to redefine the sampling rule so that it is not history-dependent but depends only on the tuple TIDs. For instance, one could hash the TID of a candidate tuple, xor that with a hash of the seed being used for the current query, and then select the tuple if (hash/MAXINT) P. That would work for bernoulli for physical tuples, yes. Only thing that worries me is future extensibility for data sources that only provide virtual tuples. Well, repeatability of a TABLESAMPLE attached to a join seems like an unsolved and possibly unsolvable problem anyway. I don't think we should assume that the API we define today will cope with that. Ok, It's true that the implementations I've seen in other databases so far only concern themselves by sampling physical relations and ignore the rest. But that is another reason why the current API is inadequate: there's no provision for specifying whether or how a tablesample method can be applied to non-base-table RTEs. (I re-read the thread and noted that Peter E. complained about that some time ago, but nothing was done about it. I'm fine with not supporting the case right now, but nonetheless it's another reason why we'd better make the API more easily extensible.) Nothing in terms of implementation yes, I did write my idea on how this could be done via extending the current API in the future. I won't try to pretend that I am absolutely sure that the API might not need some breaking change to do that though. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] TABLESAMPLE doesn't actually satisfy the SQL spec, does it?
On 12 July 2015 at 18:50, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: On Sun, Jul 12, 2015 at 12:02 PM, Tom Lane t...@sss.pgh.pa.us wrote: As best I can tell (evidence below), the SQL standard requires that if a single query reads a table with a TABLESAMPLE clause multiple times (say, because it's on the inside of a nestloop), then the exact same set of sampled rows are returned each time. Hmm, I tend to agree that it would be good if it behaved that way. Otherwise, it seems like the behavior could be quite surprising. Yeah. As a concrete example, consider select * from t1, t2 tablesample ... where t1.x = t2.x and suppose that there are multiple occurences of x = 10 in both tables. As things stand, if the join is done as a nestloop then a particular t2 row with x = 10 might appear in the output joined with some of the t1 rows with x = 10 but not with others. On the other hand, the results of a hash join would not be inconsistent in that way, since t2 would be read only once. Hmm, a non-key join to a sampled table. What would the meaning of such a query be? The table would need to big enough to experience updates and also be under current update activity. BERNOULLI isn't likely to have many users because it is so slow. So overall, such a query is not useful and as such unlikely. The mechanism of sampling was discussed heavily before and there wasn't an approach that met all of the goals: IIRC we would need to test visibility twice on each tuple to get around these problems. Given that users of TABLESAMPLE have already explicitly stated their preference for speed over accuracy, minor tweaks to handle corner cases don't seem warranted. If you have a simple, better way I would not object. Forgive me, I haven't yet understood your proposal about sampling rule above. -- Simon Riggshttp://www.2ndQuadrant.com/ http://www.2ndquadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services
Re: [HACKERS] TABLESAMPLE doesn't actually satisfy the SQL spec, does it?
On Sun, Jul 12, 2015 at 12:02 PM, Tom Lane t...@sss.pgh.pa.us wrote: As best I can tell (evidence below), the SQL standard requires that if a single query reads a table with a TABLESAMPLE clause multiple times (say, because it's on the inside of a nestloop), then the exact same set of sampled rows are returned each time. Hmm, I tend to agree that it would be good if it behaved that way. Otherwise, it seems like the behavior could be quite surprising. Generally, we don't want the set of tuples that can be seen by a query to change during the query; that's one of the things that snapshot isolation does for us, as compared with, say, a literal interpretation of READ COMMITTED, which would behave as SnapshotNow used to do. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] TABLESAMPLE doesn't actually satisfy the SQL spec, does it?
Robert Haas robertmh...@gmail.com writes: On Sun, Jul 12, 2015 at 12:02 PM, Tom Lane t...@sss.pgh.pa.us wrote: As best I can tell (evidence below), the SQL standard requires that if a single query reads a table with a TABLESAMPLE clause multiple times (say, because it's on the inside of a nestloop), then the exact same set of sampled rows are returned each time. Hmm, I tend to agree that it would be good if it behaved that way. Otherwise, it seems like the behavior could be quite surprising. Yeah. As a concrete example, consider select * from t1, t2 tablesample ... where t1.x = t2.x and suppose that there are multiple occurences of x = 10 in both tables. As things stand, if the join is done as a nestloop then a particular t2 row with x = 10 might appear in the output joined with some of the t1 rows with x = 10 but not with others. On the other hand, the results of a hash join would not be inconsistent in that way, since t2 would be read only once. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers