[HACKERS] Proposed COLLATE implementation

Martijn van Oosterhout Tue, 27 Dec 2005 05:30:21 -0800

Greetings all,

If you're not interested in COLLATE, operator classes or related
things, stop now, this is quite a long email.


Firstly, status. PostgreSQL doesn't really support collations at all.
The order of strings is defined at initdb time by the locale then and
cannot be changed later. We allow lists to be sorted in either
ascending or decending order but that's about it. Whatever order there
is is deduced from b-tree operator classes.

The purpose of this patch is to raise collations to (reasonably) first
class object. The idea is that you can define a collation across any
type and that you will then be able to ORDER BY, GROUP BY and INDEX
using that collation. A collation defines both order and equality.

The SQL standard does define COLLATE although they only apply that to
character strings. There are no predefined collations in the standard.
There are rules about how collations and collation states propegate
from the leaves of the parse tree all the way to the root. In its
simplest form, columns and constants have defined collations which
modify the behaviour of functions using these values. At any point in
the parse tree the user can override the collation with an explicit
<collate clause>. If there is ambiguity about what collation applies at
any point for a function that needs to know, this is a error.

All this is parse-time analysis.

Proposed Implementation:

NODES

To implement the above, two new node types are created: CollateClause
which represents the <collate clause> in SQL syntax, and CollateState
which represents the actual state at any node. Currently the only nodes
expected to require these are OpExpr, FuncExpr, Var and Const. Although
I guess it may apply to any node that can be used in an expression.

CATALOG CHANGES

To track collations requires a new table in the catalog, which I have
named pg_collations. It contains the following fields:

Oid oid;           -- OID for this collation
Name collname;     -- Name of the collation (for collate clause)
bool collasc;      -- Ascending or descending
Oid  collopclass;  -- Implementing Operator Class
int4 colltype;     -- Currently, 0=simple, 1=uses locale
Oid colllocale;    -- Locale in pg_locales

(Should we be identifying the type here? or is it ok to lookup the type
via the operator class).

The first few fields name the collation so it can be referred to by the
user. Then the collasc field determines how to use the operator class
as given in the collopclass field. If it indicates descending order, it
will invert the sense of the operator class. For example, asking for
the GT op for a reverse collation will actually return the LT operator
for the operator class.

The purpose of the colltype and colllocale fields are described further
down.

The important thing at this point is that by specifying a collation you
are also specifying an operator class. At the moment the ascending and
descending collations for each type are hard-coded for initdb. At the
moment they have been allocated OIDs starting at 2800, which is the
first large available block.

Each column of a table has a default collation, which defaults to the
default collation of the type but can be specified in the table
declaration. To store this requires an additional column in
pg_attribute (attcollate) which contains the OID of the collation for
that column. When it is referenced in a query, this collation is copied
to the CollateState node of the Var node, from whence it can affect the
query.

Finally, to allow the parser to complain about ambiguous CollateStates,
we need to indicate which functions actually need a senseble collate
state to function. To this end a single boolean field has been added to
pg_proc (proneedcollate). If this is true, the parser should error out
when the collation state is COLLATE_NONE.

INDEXES

Another place you will be allowed to use the collate clause is while
creating indexes. If you declare an index using a particular collation,
it can be used in queries that order by the same collation. Note, that
the collate clause indicates the operator class, so you can either
specify one or the other, but not both.

So each column of an index will also have a collation. However,
pg_attribute has already got an extra field to store the collation for
columns so it makes sense to store the collation here. In the process
the pg_index.indclass field becomes redundant as it can be inferred
from the pg_attribute rows associated with the index.

To make this work there also needs to be a notion of compatability
between collations. For example, two collations which are the reverse
of eachother are compatable in the sense that an index defined with one
collation would be usable for the other simply by scanning in reverse.

FUNCTIONS

In particular for string comparison but also possibly for user-defined
types, a function will need to know what collation it is operating
under. For this purpose an extra field (fn_collate) is added to
FmgrInfo which is filled in with the collation from the expression tree
(if any) or wherever relevent (eg. from the pg_attribute column when
doing statistics or creating indexes).

A PG_GETCOLLATE() macro is added to facilitate user-functions
retreiving this data. This function throws an error when no collation
has been defined. This shouldn't happen in practice as issues should
have been weeded out at parse-time.

This macro returns the OID of the collation but in many cases it will
not be necessary. In particular, functions should NOT invert their
result if the collation is inverted. It is considered the
responsibility of the caller to invert the result if necessary. The
reasons for this are:

1. In most cases that matter (order comparison) the issue can be dealt
with at parse time by the NEGATOR or COMMUTATOR options.
2. For index scans, we would just do a reverse scan instead (or forward
if the index is inverted)
3. Requiring every function to check the collation for inversion is
wasteful, since in many cases the case can be dealt with statically.

DEFAULT COLLATIONS

At this point I'm inclined to define a few collations to be built in or
specially handled:

COLLATE ASC    - default collation for type, ascending (ie, what we do now)
COLLATE DESC   - default collation for type, inverted
COLLATE POSIX  - For strings, define a simple bytewise string comparison.

Indeed, it is expected that by default, all columns involving strings
in the system catalog will always use COLLATE POSIX. Additionaly, type
"name" will always use that collation, even if the user changes the
default (by a method to be specified). This is straightforwardly done
at initdb time.

The purpose of COLLATE DESC is to simplify index declarations. Saying

CREATE INDEX foo ON bar( a COLLATE ASC, b COLLATE DESC );

would allow it to be used in a query using ORDER BY a, b DESC, without
the user having to lookup the name of the collation.

When it comes to naming collations, the question arises whether
ascending/descending collations need to have different names. Or
should there be two collations with the same name with ASC/DESC as a
modifier? Do collations have to be unique across different types; for
example, can varchar and text both have a collation "ignorecase"?

Another issue is that a column could be declared with a descending
collation by default. Say it was an integer column, then (a < 5) would
return FALSE for a = 1. While technically correct, I'm thinking of
ruling it out on the basis of being too confusing, and only allow
descending collations at query time or in index specifications.

Another strange point at the moment is how to determine the default
collation of a type. At the moment it is done by finding the default
operator class and looking up the ascending version of that. However,
we may want to provide the user a way of specifying it directly,
perhaps by:

ALTER TYPE text SET DEFAULT COLLATION ignorecase;

PATHKEYS

Currently during planning, pathkeys are indicated by an operator of the
operator class. Here we would simply replace that with the oid of the
collation, which can be matched directly with the collation defined by
the index.

USER DEFINED TYPES

None of this is interesting unless it can be applied to user-defined
types also. Fortunatly this is easy, when the user declares a b-tree
operator class, we can generate the collations automatically. We may
even allow the user to specify the name of the collation. However, if
the user wanted to create multiple collations based on the same
operator class (by using the PG_GETCOLLATE() macro above, we may want
to provide them a way of creating them directly.

COLLATIONS USING LOCALES

For strings, collation can be done in many different ways defined by
what is referred to as a locale. As indicated above in the definition
of pg_collations, there is a colltype field. For most collations this
will be 0 (simple collation). However, for strings the intention is to
use a type 1 (using locales). In this case the last column refers to
the OID of the locale, so you can many collations using the same
operator class, but different locale oids. On a system level it changes
nothing, but inside the functions implementing it, they should use
PG_GETLOCALE(). This will return an opaque pg_locale_t (see below)
handle which can then be used to implement the specifics.

In principle, user-defined types need to be able to use this also,
perhaps by using the clause COLLATE USING LOCALE in the operator class.
In theory there should a collation for each combination of
locale-dependant datatype, locale and order ascending/descending.
How/when these are created has not yet be determined.

MORE TYPES OF COLLATION

Another collation type I've speculated about but not thought about
implementing is a "mapping collation", in which you map the values
through a function and then collate that. The obvious example would be
a case-insensetive mapping where lower is applied before collation.

Implementation could be pretty much done by simply substituting the
functions into the parse tree. For example, if you defined something
like:

CREATE COLLATION ignorecase ON text USING lower($1) COLLATE defaulttext;

Then anytime you did a comparison with that collation, you would simply
insert those function calls into the parse-tree and then collate with
"defaulttext". When declaring an index you would just make it a
functional index. The rules for functional indexes should make it work
out-of-the-box.

OTHER TECHNICAL ISSUES

- Applying a COLLATE clause to an unknown literal causes it to be
coerced to the type that collation is based on. But what about if we
have something like COLLATE DESC?

- This requires some changes in the bootstrap procedures given that we
need to be able to do lookups on the operator class for each type
fairly early on. Some are predefined but it does require moving the
opclass setup further up the list. However, if we store a default
collation in pg_type, we wouldn't need to do that.

- Sorting arrays. Should they get their own collations, or should
they use the collations of their base types.

LOCALES

I've left this to the end because I don't want people distracted by
what is essentially a side-issue. While this would be needed to
implement COLLATE the way the SQL spec intended, it can actually be
implemented and dealt with as a seperate patch. The main reason a basic
implementation exists is that it provides a great way of finding places
that didn't define a collation, since any comparison involving "text"
requires one.

To deal with locales I created another table in the catalog,
pg_locales. This provides an OID which can be referenced from
elsewhere, such as the pg_collations table.

The design is intended to provide some pluggability, so locale
information can come from multiple sources. Also, each locale will be
referenced by an identifier which is unrelated to any external
identifier, so we're not bound by them.

The columns defined currently are:

Name    locname       - Identifier used by postgresql
Name    locsysname    - String identifying the locale for the locale provider
int4    locsource     - System providing this locale
int4    locencoding   - Encoding expected by provider

It is expected that the list of sources for locale data will be short,
probably hard-coded into the backend (currenty internal/system/icu).
The only locale defined at startup is POSIX, which is implemented
internally. The intention is for any other locales to be defined at the
end of initdb. The expected syntax is something like:

CREATE LOCALE hungarian AS 'hu_HU' USING glibc;

This should use the provider to check the locale exists and has a
conpatible encoding. If so it is entered into the table ready for use.

In the backend, there will be implementations of functions like
pg_strcoll_l, pg_localeconv_l, which work like the C system library
versions only they take an extra pg_locale_t argument. This is used to
dispatch the call to the right place. There will be a function to
quickly determine if a locale is C to shortcircuit complexity where it
is not needed.

STATUS

Implementation so far is available here:

http://svana.org/kleptog/temp/collate-current.patch.gz

This patch isn't "clean" and changes a few things that are not strictly
necessary. It won't finish initdb right now because it gets an error in
ANALYSE (the array issue above).

Feedback, help, comments: please reply.

Have a nice day,
-- 
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

pgpFoMC9sLPdW.pgp
Description: PGP signature

[HACKERS] Proposed COLLATE implementation

Reply via email to