Re: [HACKERS] Theory of operation of collation patch
On Wed, Mar 9, 2011 at 1:52 AM, Tom Lane t...@sss.pgh.pa.us wrote: Another interesting item ... I see that you added a collation field to TypeName, apparently on the grounds that the SQL spec includes collation in data type. However, it seems to me that that is nonsense up with which we should not put. The SQL committee has demonstrably awful taste but they're usually not entirely nutty. Usually whatever they do they had a reason. Is there any hint at all of what they were trying to accomplish here? When you say basically only used in CAST and column definitions are there other less common cases where it's convenient? Or was this just a case of some existing database allowed COLLATE clauses in column definitions as part of the type and they preferred to have it in a different syntax so they did this to allow either to work? If we allow collate clause in ColQualList are we covering the very case they were trying to deal with? Can you give an example of what a column definition would look like if you put the COLLATE clause in the data type in a way that wouldn't be parsed according to your plan? -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Theory of operation of collation patch
Greg Stark gsst...@mit.edu writes: Can you give an example of what a column definition would look like if you put the COLLATE clause in the data type in a way that wouldn't be parsed according to your plan? Column definitions look and act the same. The point of the change is to not accept COLLATE in all the places where type names are used that are *not* column definition lists. The original patch accepted nonsense like create cast (text collate C as text collate de_DE) with ... and we'd have had to do quite a lot of additional hacking to plug those holes. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Theory of operation of collation patch
On tis, 2011-03-08 at 20:52 -0500, Tom Lane wrote: I think we should drop collate clause from TypeName and just have it in columnDef and the expression syntax. Yes, that sounds better in retrospect. It's easier to see that now that we see all the cases where it's used and not used. This might also ease the ambiguity problem that evidently led you to restrict the expression production's argument to c_expr. Maybe, but I seem to recall that I did actually check and concluded that c_expr covers all cases where collate clause is allowed. We could of course allow more cases, but maybe it's not necessary. It would also allow us to meet the letter of the spec for column definition, in that collate clause is not required to immediately follow data type. Note that that case is listed under a separate feature. I'm not sure if it's worth supporting, but if they bothered putting it in it's probably for compatibility with some existing implementation. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Theory of operation of collation patch
On Tue, Mar 08, 2011 at 08:52:41PM -0500, Tom Lane wrote: Another interesting item ... I see that you added a collation field to TypeName, apparently on the grounds that the SQL spec includes collation in data type. However, it seems to me that that is nonsense up with which we should not put. data type is basically only used in CAST and snip Sounds like a good plan. Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ Patriotism is when love of your own people comes first; nationalism, when hate for people other than your own comes first. - Charles de Gaulle signature.asc Description: Digital signature
Re: [HACKERS] Theory of operation of collation patch
On Mon, Mar 07, 2011 at 07:15:28PM -0500, Tom Lane wrote: Only if the expression-to-be-sorted does not already fully specify the collation, which so far as I can tell (either from the code or your description above) it does. I think that the explicit representation of collation as part of the PathKey node is unnecessary, inefficient, and bug-inducing --- the latter because it promotes fuzzy thinking about where the collation information is coming from. (And this isn't just hypothetical: IMO the bugs I exhibited upthread are *directly* due to fuzzy thinking about what defines an index's sort order.) Well, collation processing happens in two phases. Initially collate information is provided by the columns in the query, explicit clauses, etc. These are indeed attached to the values. From here the collations of expressions are determined. The SQL committee thought up a bunch of actually quite logical rules here. Explicit overrides implicit, implicit overrides default. Both explicit or both implicit is an error, etc. Note error state is only a problem if you use it for sorting or comparison, otherwise it is ignored. This phase of processing happens in the parse analysis, the end result being that every expression should have a collation set, and every operator where it matters has consistant collation information for its arguments. So at this point the collation should be attached to the expression. In the planning phase however, all collation information is ignored except where it matters, which is the comparison operators and ORDER BY and similar. By the previous phase all comparison operators should have a defined collation for its arguments and thus for itself, this allows it to construct appropriate pathkeys for index scans, etc. I havn't looked at the patch, perhaps it confuses the information in the two phases but the basic idea seems to me: - parse analysis - collation in expressions - planning - collation in path key Or, to put it another way: the properties that define a sort order are the sort comparison operator, the collation, the ASC/DESC bit, and the NULLS FIRST/LAST bit. Given the way that the SQL committee has constructed the language, the operator and the two flag bits are attached to the ORDER BY clause, but the collation is a property of the expression-to-be-sorted. If we fail to preserve that distinction in the internal representation, we're just creating problems for ourselves. The assigning of the collation to expressions is simply the method for getting the collation information to the operators. Mainly because 99% of the time there will be no explicit clauses in the queries, so the information has to get there by other means. Hope this helps, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ Patriotism is when love of your own people comes first; nationalism, when hate for people other than your own comes first. - Charles de Gaulle signature.asc Description: Digital signature
Re: [HACKERS] Theory of operation of collation patch
On 07.03.2011 17:43, Tom Lane wrote: because two expressions that are equal() must necessarily have the same collation property. Peter, Tom, I am not able to see this. If 'abc' == 'abc' is not collation depending at all. It is only encoding depending. Collation is only needed for upper(), lower() and sorting. Means it tells if e.g. upper('i') will get Y or I. It tells if the German s-umlaut will be sorted together with 's' or after 'z'. Btw. the follows on implementing collations will be different - and I hope Peter is aware of it. My experience is that a huge follow will be that users will complain that the sorting isn't correct even when it is correct. Susanne -- Susanne Ebrecht - 2ndQuadrant PostgreSQL Development, 24x7 Support, Training and Services www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Theory of operation of collation patch
Susanne Ebrecht susa...@2ndquadrant.com writes: On 07.03.2011 17:43, Tom Lane wrote: because two expressions that are equal() must necessarily have the same collation property. Peter, Tom, I am not able to see this. If 'abc' == 'abc' is not collation depending at all. It is only encoding depending. Sorry, I was using a term of art there. This isn't about data values, it's about parsed expression trees. See the equal() function in src/backend/nodes/equalfuncs.c. If two expression trees match according to equal(), then they must necessarily have the same collation property, because equal() compares all the fields and in particular the collation fields. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Theory of operation of collation patch
Martijn van Oosterhout klep...@svana.org writes: This phase of processing happens in the parse analysis, the end result being that every expression should have a collation set, and every operator where it matters has consistant collation information for its arguments. So at this point the collation should be attached to the expression. Right. I havn't looked at the patch, perhaps it confuses the information in the two phases but the basic idea seems to me: - parse analysis - collation in expressions - planning - collation in path key Not so right. A path key contains an expression tree, plus whatever *additional* information is needed to fully specify the sort ordering. If the collation is already fully determined by the expression tree, there is no need to duplicate that information in the PathKey node. And, as I said, doing so anyway has real negative consequences. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Theory of operation of collation patch
On Tue, Mar 8, 2011 at 3:21 PM, Tom Lane t...@sss.pgh.pa.us wrote: Not so right. A path key contains an expression tree, plus whatever *additional* information is needed to fully specify the sort ordering. If the collation is already fully determined by the expression tree, there is no need to duplicate that information in the PathKey node. And, as I said, doing so anyway has real negative consequences. Isn't the reason to copy that information outside the expression so that we can choose sometimes to ignore it? Namely, for == we can use an index with any defined collation even if it doesn't match the collation in the pathkey we're looking for? I think currently that's the only example but in theory we could have collations that are supersets of the desired collation. For example a UTF8 collation that sorts english in the desired way and sorts utf8 characters in some way that isn't relevant to the query. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Theory of operation of collation patch
Greg Stark gsst...@mit.edu writes: On Tue, Mar 8, 2011 at 3:21 PM, Tom Lane t...@sss.pgh.pa.us wrote: Not so right. A path key contains an expression tree, plus whatever *additional* information is needed to fully specify the sort ordering. If the collation is already fully determined by the expression tree, there is no need to duplicate that information in the PathKey node. And, as I said, doing so anyway has real negative consequences. Isn't the reason to copy that information outside the expression so that we can choose sometimes to ignore it? Namely, for == we can use an index with any defined collation even if it doesn't match the collation in the pathkey we're looking for? No, that's nonsense. A PathKey is all about sort order. It is not relevant in any situation where sort order is ignorable. I think currently that's the only example but in theory we could have collations that are supersets of the desired collation. For example a UTF8 collation that sorts english in the desired way and sorts utf8 characters in some way that isn't relevant to the query. That sounds like nonsense as well. There is no such thing as a superset ordering: to have such a thing, you'd have to have some orderings that didn't fully determine the ordering of data values, which is pretty much unworkable. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Theory of operation of collation patch
On mån, 2011-03-07 at 12:52 -0500, Tom Lane wrote: It looks like indcollation is acting as a substitute for including a CollateClause in the index key expression, which doesn't seem like a particularly good tradeoff considering all the overhead you must introduce into the default case. Conceptually, this characterization is correct. But assuming you do away with the indcollation column and instead create expression indexes on (col COLLATE x), you'd then have an interesting time matching those expressions to expressions in the query when choosing an index. Example: CREATE TABLE test (a text); CREATE INDEX test_idx ((a COLLATE foo)); -- expression index -- easy to choose the index here SELECT * FROM test WHERE a COLLATE foo 'x'; -- not so easy here SELECT * FROM test WHERE a 'x' COLLATE foo; -- more difficult CREATE TABLE test2 (b text COLLATE foo); SELECT * FROM test WHERE x (SELECT b FROM test2 WHERE something); Note that this situation already exists: If you create a plain column index using CREATE INDEX test_idx (a COLLATE foo), all of the above cases should pick up the index; if you create an expression index using CREATE INDEX test_idx ((a COLLATE foo)), they won't, except the first one. (That is modulo your earlier demonstration that it apparently doesn't work at all for you, but I'm just saying in case you want to try out various ways of changing this.) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Theory of operation of collation patch
On mån, 2011-03-07 at 19:15 -0500, Tom Lane wrote: Or, to put it another way: the properties that define a sort order are the sort comparison operator, the collation, the ASC/DESC bit, and the NULLS FIRST/LAST bit. Given the way that the SQL committee has constructed the language, the operator and the two flag bits are attached to the ORDER BY clause, but the collation is a property of the expression-to-be-sorted. If we fail to preserve that distinction in the internal representation, we're just creating problems for ourselves. You are probably right; it could be simplified. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Theory of operation of collation patch
Peter Eisentraut pete...@gmx.net writes: On mån, 2011-03-07 at 12:52 -0500, Tom Lane wrote: It looks like indcollation is acting as a substitute for including a CollateClause in the index key expression, which doesn't seem like a particularly good tradeoff considering all the overhead you must introduce into the default case. Conceptually, this characterization is correct. But assuming you do away with the indcollation column and instead create expression indexes on (col COLLATE x), you'd then have an interesting time matching those expressions to expressions in the query when choosing an index. Hmm ... interesting point, but I think it already doesn't work in all cases for expression indexes. There is an analogous issue for binary-compatible datatypes, which we hack around by stripping RelabelType nodes before doing this type of comparison. Maybe we'll have to do something similar with CollateClauses. I never much cared for that hack, though ... wonder if there is a cleaner way. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Theory of operation of collation patch
Another interesting item ... I see that you added a collation field to TypeName, apparently on the grounds that the SQL spec includes collation in data type. However, it seems to me that that is nonsense up with which we should not put. data type is basically only used in CAST and column definitions (plus some features related to methods, which we haven't got). In CAST, they have to have a syntax rule stating that you must not use COLLATE in the target type name. In column definitions, they call out collate clause as a separate syntactic element, and then they have to have a syntax rule stating that you can't use both that and a collate clause embedded in the type name. This is just stupid. And it conflates the notion of collation with that of type --- based on our discussions so far, it seems like it'd be a lot better to keep those separate. I think we should drop collate clause from TypeName and just have it in columnDef and the expression syntax. This might also ease the ambiguity problem that evidently led you to restrict the expression production's argument to c_expr. It would also allow us to meet the letter of the spec for column definition, in that collate clause is not required to immediately follow data type. I see that right now you don't allow collate clause in ColQualList, so it's not accepting syntax that the spec says is valid; and you really can't fix that as long as collate clause can be part of TypeName, because then there would be two ways to parse the case where COLLATE does immediately follow the type name. Thoughts? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Theory of operation of collation patch
Is there any documentation of $SUBJECT? Because the more I look at this patch the more I think it's misdesigned; either that or I fundamentally misunderstand what it's trying to do. I complained yesterday about why the planner wasn't using indcollation to identify the sort order of an index. I now think that the reason it doesn't obviously fail to fail is that indcollation is dead code, and so is approximately 99% of what you added to the planner, because two expressions that are equal() must necessarily have the same collation property. Tracking the collation as a separate property of a pathkey is thus a useless activity. If this conclusion isn't correct, please explain why not. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Theory of operation of collation patch
On mån, 2011-03-07 at 11:43 -0500, Tom Lane wrote: Is there any documentation of $SUBJECT? Because the more I look at this patch the more I think it's misdesigned; either that or I fundamentally misunderstand what it's trying to do. I complained yesterday about why the planner wasn't using indcollation to identify the sort order of an index. I now think that the reason it doesn't obviously fail to fail is that indcollation is dead code, and so is approximately 99% of what you added to the planner, because two expressions that are equal() must necessarily have the same collation property. Tracking the collation as a separate property of a pathkey is thus a useless activity. If this conclusion isn't correct, please explain why not. I'll have to check into these details, but here is a test case that shows that it's doing something with the collation of an index: DROP TABLE test; CREATE TABLE test (a text); CREATE INDEX test_1 ON test (a); CREATE INDEX test_d ON test (a COLLATE de_DE); CREATE INDEX test_e ON test (a COLLATE es_ES); CREATE INDEX test_f ON test (a COLLATE fr_FR); SET enable_seqscan TO off; SET enable_bitmapscan TO off; EXPLAIN SELECT * FROM test WHERE a 'foo'; QUERY PLAN --- Index Scan using test_1 on test (cost=0.00..51.90 rows=437 width=32) Index Cond: (a 'foo'::text) (2 rows) EXPLAIN SELECT * FROM test WHERE a 'foo' COLLATE de_DE; QUERY PLAN --- Index Scan using test_d on test (cost=0.00..51.90 rows=437 width=32) Index Cond: (a ('foo'::text COLLATE de_DE)) (2 rows) EXPLAIN SELECT * FROM test WHERE a 'foo' COLLATE es_ES; QUERY PLAN --- Index Scan using test_e on test (cost=0.00..51.90 rows=437 width=32) Index Cond: (a ('foo'::text COLLATE es_ES)) (2 rows) EXPLAIN SELECT * FROM test WHERE a 'foo' COLLATE fr_FR; QUERY PLAN --- Index Scan using test_f on test (cost=0.00..51.90 rows=437 width=32) Index Cond: (a ('foo'::text COLLATE fr_FR)) (2 rows) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Theory of operation of collation patch
Peter Eisentraut pete...@gmx.net writes: On mån, 2011-03-07 at 11:43 -0500, Tom Lane wrote: ... I now think that the reason it doesn't obviously fail to fail is that indcollation is dead code, and so is approximately 99% of what you added to the planner, because two expressions that are equal() must necessarily have the same collation property. Tracking the collation as a separate property of a pathkey is thus a useless activity. If this conclusion isn't correct, please explain why not. I'll have to check into these details, but here is a test case that shows that it's doing something with the collation of an index: [ pokes at it some more... ] It looks like indcollation is acting as a substitute for including a CollateClause in the index key expression, which doesn't seem like a particularly good tradeoff considering all the overhead you must introduce into the default case. But more to the point, your examples do *not* work for me. I can reproduce both failing to use an index that should work, and selecting an index that doesn't work: d1u=# CREATE TABLE test (a text); CREATE TABLE d1u=# CREATE INDEX test_1 ON test (a); CREATE INDEX d1u=# CREATE INDEX test_d ON test (a COLLATE de_DE); CREATE INDEX d1u=# CREATE INDEX test_e ON test (a COLLATE es_ES); CREATE INDEX d1u=# CREATE INDEX test_f ON test (a COLLATE fr_FR); CREATE INDEX d1u=# CREATE INDEX test_fz ON test ((a||'z') COLLATE fr_FR); CREATE INDEX d1u=# explain select * from test order by a; QUERY PLAN Index Scan using test_f on test (cost=0.00..63.90 rows=1310 width=32) (1 row) d1u=# explain select * from test order by a collate fr_FR; QUERY PLAN --- Sort (cost=90.93..94.20 rows=1310 width=32) Sort Key: ((a COLLATE fr_FR)) - Seq Scan on test (cost=0.00..23.10 rows=1310 width=32) (3 rows) d1u=# set enable_seqscan TO 0; SET d1u=# explain select * from test order by a collate fr_FR; QUERY PLAN -- Sort (cost=190.93..194.20 rows=1310 width=32) Sort Key: ((a COLLATE fr_FR)) - Seq Scan on test (cost=100.00..123.10 rows=1310 width=32) (3 rows) d1u=# explain select * from test order by a||'z'; QUERY PLAN - Index Scan using test_fz on test (cost=0.00..67.18 rows=1310 width=32) (1 row) (This is in a database with encoding utf8 and lc_collate = c) regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Theory of operation of collation patch
On Mon, Mar 07, 2011 at 11:43:20AM -0500, Tom Lane wrote: Is there any documentation of $SUBJECT? Because the more I look at this patch the more I think it's misdesigned; either that or I fundamentally misunderstand what it's trying to do. I complained yesterday about why the planner wasn't using indcollation to identify the sort order of an index. I now think that the reason it doesn't obviously fail to fail is that indcollation is dead code, and so is approximately 99% of what you added to the planner, because two expressions that are equal() must necessarily have the same collation property. Tracking the collation as a separate property of a pathkey is thus a useless activity. If this conclusion isn't correct, please explain why not. The collation is a property of the operators/functions and not of the values. An individual value does not have a collation, a column does. A pathkey represents a sort order, right? To define a sort order you need a collation and so the path key is the natural place to put it. Two path keys that differ only by the collation are *not* compatable, but obviously can be converted by a sort node. After the plan phase, the collation will be ignored in 99% of the executer, except for operators like =, and that need to look at it. Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ Patriotism is when love of your own people comes first; nationalism, when hate for people other than your own comes first. - Charles de Gaulle signature.asc Description: Digital signature
Re: [HACKERS] Theory of operation of collation patch
Martijn van Oosterhout klep...@svana.org writes: On Mon, Mar 07, 2011 at 11:43:20AM -0500, Tom Lane wrote: Is there any documentation of $SUBJECT? The collation is a property of the operators/functions and not of the values. An individual value does not have a collation, a column does. OK. A pathkey represents a sort order, right? To define a sort order you need a collation and so the path key is the natural place to put it. Only if the expression-to-be-sorted does not already fully specify the collation, which so far as I can tell (either from the code or your description above) it does. I think that the explicit representation of collation as part of the PathKey node is unnecessary, inefficient, and bug-inducing --- the latter because it promotes fuzzy thinking about where the collation information is coming from. (And this isn't just hypothetical: IMO the bugs I exhibited upthread are *directly* due to fuzzy thinking about what defines an index's sort order.) Or, to put it another way: the properties that define a sort order are the sort comparison operator, the collation, the ASC/DESC bit, and the NULLS FIRST/LAST bit. Given the way that the SQL committee has constructed the language, the operator and the two flag bits are attached to the ORDER BY clause, but the collation is a property of the expression-to-be-sorted. If we fail to preserve that distinction in the internal representation, we're just creating problems for ourselves. I'm willing to take a pass at fixing this code during the alpha cycle, but I want to be sure I understand it correctly first. So, if there's a hole in my thinking, please point it out. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers