Re: [HACKERS] Open issues for collations

Robert Haas Tue, 05 Apr 2011 07:14:34 -0700

Reading through this thread...

On Sat, Mar 26, 2011 at 12:36 AM, Tom Lane <t...@sss.pgh.pa.us> wrote:
> ** Selecting a field from a record-returning function's output.
> Currently, we'll use the field's declared collation; except that
> if the field has default collation, we'll replace that with the common
> collation of the function's inputs, if any.  Is either part of that
> sane?  Do we need to make this work for functions invoked with other
> syntax than a plain function call, eg operator or cast syntax?


There were a couple of different ideas about which way we ought to go
with this, but I'm happy to defer to what Tom and Martijn hashed out:

MO> That seems all a bit weird. I spent some time reading through the SQL
MO> spec to see if I could came up with a few ideas about what they thought
MO> relevent. I think the gist of it is that I think the result row should
MO> have for each column its declared collation in all cases.

TL> That interpretation would be fine with me.  It would let us get rid of
TL> the special-case code at lines 307-324 of parse_collate.c, which I put
TL> in only because there are cases in the collate.linux.utf8.sql regression
TL> test that fail without it.  But I'm perfectly happy to conclude that
TL> those test cases are mistaken.

I'm not sure whether that's been done, though, or whether we're even
going to do it.

> ** What to do with domains whose declaration includes a COLLATE clause?
> Currently, we'll impute that collation to the result of a cast to the
> domain type --- even if the cast's input expression includes an
> explicit COLLATE clause.  It's not clear that that's per spec.  If it
> is correct, should we behave similarly for functions that are declared
> to return a domain type?  Should it matter if the cast-to-domain is
> explicit or implicit?  Perhaps it'd be best if domain collations only
> mattered for columns declared with that domain type.  Then we'd have
> a general rule that collations only come into play in an expression
> as a result of (a) the declared type of a column reference or (b)
> an explicit COLLATE clause.

I think we had agreement than a cast to a domain type with a collation
should stomp on any existing collation on the contained expression.

> * In plpgsql, is it OK for declared local variables to inherit the
> function's input collation?  Should we provide a COLLATE option in
> variable declarations to let that be overridden?

I think everyone who responded said "yes" to both questions.

> * RI triggers should insert COLLATE clauses in generated queries to
> satisfy SQL2008 9.13 SR 4a, which says that RI comparisons use the
> referenced column's collation.  Right now you may get either table's
> collation depending on which query type is involved.  I think an obvious
> failure may not be possible so long as equality means the same thing in
> all collations, but it's definitely possible that the planner might
> decide it can't use the referenced column's unique index, which would
> suck for performance.  (Note: this rule seems to prove that the
> committee assumes equality can mean different things in different
> collations, else they'd not have felt the need to specify.)

I tested this and, indeed, if the collations don't match, the index
can't be used.

CREATE TABLE me (x character varying COLLATE "en_US");
CREATE TABLE me2 (x character varying COLLATE "es_ES");
CREATE TABLE you (x character varying COLLATE "es_ES" NOT NULL,
PRIMARY KEY (x));
ALTER TABLE me ADD FOREIGN KEY (x) REFERENCES you(x);
ALTER TABLE me2 ADD FOREIGN KEY (x) REFERENCES you(x);
SET enable_seqscan=false;
SET enable_hashjoin=false;

With that setup, this still does a seqscan-and-sort:

EXPLAIN select * from me, you where me.x = you.x;

But this uses the index:

EXPLAIN select * from me2, you where me2.x = you.x;

I found another problem, too:

rhaas=# insert into you values ('1');
INSERT 0 1
rhaas=# insert into me values ('1');
INSERT 0 1
rhaas=# alter table me alter column x set data type varchar collate "en_GB";
ERROR:  could not determine which collation to use for string comparison
HINT:  Use the COLLATE clause to set the collation explicitly.
CONTEXT:  SQL statement "SELECT fk."x" FROM ONLY "public"."me" fk LEFT
OUTER JOIN ONLY "public"."you" pk ON ( pk."x"::pg_catalog.text
OPERATOR(pg_catalog.=) fk."x"::pg_catalog.text) WHERE pk."x" IS NULL
AND (fk."x" IS NOT NULL)"

> * It'd sure be nice if we had some nontrivial test cases that work in
> encodings besides UTF8.  I'm still bothered that the committed patch
> failed to cover single-byte-encoding cases in upper/lower/initcap.

Seems like no one knows how to do this.

> * Remove initdb's warning about useless locales?  Seems like pointless
> noise, or at least something that can be relegated to debug mode.

Everyone was in favor of this.

> * Is it worth adding a cares-about-collation flag to pg_proc?  Probably
> too late to be worrying about such refinements for 9.1.

Probably too late for this.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Open issues for collations

Reply via email to