Re: [GENERAL] DB encoding, locale and indexes

Sterfield Thu, 05 Feb 2015 07:21:08 -0800

2015-02-05 15:56 GMT+01:00 Tom Lane <t...@sss.pgh.pa.us>:

> Sterfield <sterfi...@gmail.com> writes:
> > I'm a sysadmin working for an application that stores all its data in a
> PG
> > database.
> > Currently, the cluster has its encoding set to UTF-8, and the locale
> (both
> > LC_COLLATE and LC_CTYPE) is set to 'en_US.UTF-8'.
>
> > I discovered recently that the indexes created on varchar fields are not
> > working for LIKE operator, as they are created without the correct class
> > operator (as stated in
> > http://www.postgresql.org/docs/9.2/static/indexes-types.html).
>
> Right, because en_US.UTF-8 uses dictionary sort order rather than plain
> byte-by-byte sort.
>
> > The most straight-forward solution would be to create a second index on
> the
> > same field but with the class operator, in order to have indexes both for
> > =, >, < operators AND LIKE / regexp operators. Few additional indexes,
> some
> > diskspace eaten, problem solved.
>
> Yup.
>
> > However, some people are saying that nothing has to change on the index,
> > and that the only thing we have to do is to change the LC_COLLATE of each
> > databases to 'C', in order for the indexes to work without the class
> > operator.
>
> Yes, that is another possible solution, and it's documented.  Keep in mind
> though that you can *not* just reach into pg_database and tweak those
> fields; if you did, all your indexes would be corrupt, because they'd no
> longer match the sort order the system is expecting.  The only safe way to
> get there would be to dump and reload into a new database set up this way.
> (If you wanted to live dangerously, I guess you could manually tweak the
> pg_database fields and then REINDEX every affected index ... but this
> still involves substantial downtime, and I would not recommend doing it
> without practicing on a test installation.)
>


Yeah, I'll not take the risk. For current databases, I'll probably create
manually new indexes. For new environment, I'll change the LC_COLLATE to
'C'. I've spent some time re-creating a test environment, using encoding to
UTF8, locale to 'en_US.UTF-8' except LC_COLLATE set to 'C'. Nothing special
to report, the index is working as expected for LIKE operators, and I have
correct answers if I'm doing a LIKE '<something>é%'.


> You also have to ask whether any of your applications are expecting ORDER
> BY some-text-field to produce dictionary order rather than ASCII order.
>

Indeed, the order of the results is not the same with a LC_COLLATE to
'en_US.UTF-8' or LC_COLLATE to 'C', but I highly doubt that the application
is taking advantage of having an index already sorted.


> >    - If I have unicode character stored in my database (for example 'é'),
> >    and the LC_COLLATE set to 'C', how the index will behave if I do a
> query
> >    with LIKE '<something>é%' ?
>
> It's still the same character, but it will sort in a possibly unexpected
> way.
>
>                         regards, tom lane
>

Many thanks for your help, guys, especially on this non-trivial subject (at
least, for me).

Cheers,

Re: [GENERAL] DB encoding, locale and indexes

Reply via email to