2015-02-05 15:56 GMT+01:00 Tom Lane <t...@sss.pgh.pa.us>: > Sterfield <sterfi...@gmail.com> writes: > > I'm a sysadmin working for an application that stores all its data in a > PG > > database. > > Currently, the cluster has its encoding set to UTF-8, and the locale > (both > > LC_COLLATE and LC_CTYPE) is set to 'en_US.UTF-8'. > > > I discovered recently that the indexes created on varchar fields are not > > working for LIKE operator, as they are created without the correct class > > operator (as stated in > > http://www.postgresql.org/docs/9.2/static/indexes-types.html). > > Right, because en_US.UTF-8 uses dictionary sort order rather than plain > byte-by-byte sort. > > > The most straight-forward solution would be to create a second index on > the > > same field but with the class operator, in order to have indexes both for > > =, >, < operators AND LIKE / regexp operators. Few additional indexes, > some > > diskspace eaten, problem solved. > > Yup. > > > However, some people are saying that nothing has to change on the index, > > and that the only thing we have to do is to change the LC_COLLATE of each > > databases to 'C', in order for the indexes to work without the class > > operator. > > Yes, that is another possible solution, and it's documented. Keep in mind > though that you can *not* just reach into pg_database and tweak those > fields; if you did, all your indexes would be corrupt, because they'd no > longer match the sort order the system is expecting. The only safe way to > get there would be to dump and reload into a new database set up this way. > (If you wanted to live dangerously, I guess you could manually tweak the > pg_database fields and then REINDEX every affected index ... but this > still involves substantial downtime, and I would not recommend doing it > without practicing on a test installation.) >
Yeah, I'll not take the risk. For current databases, I'll probably create manually new indexes. For new environment, I'll change the LC_COLLATE to 'C'. I've spent some time re-creating a test environment, using encoding to UTF8, locale to 'en_US.UTF-8' except LC_COLLATE set to 'C'. Nothing special to report, the index is working as expected for LIKE operators, and I have correct answers if I'm doing a LIKE '<something>é%'. > You also have to ask whether any of your applications are expecting ORDER > BY some-text-field to produce dictionary order rather than ASCII order. > Indeed, the order of the results is not the same with a LC_COLLATE to 'en_US.UTF-8' or LC_COLLATE to 'C', but I highly doubt that the application is taking advantage of having an index already sorted. > > - If I have unicode character stored in my database (for example 'é'), > > and the LC_COLLATE set to 'C', how the index will behave if I do a > query > > with LIKE '<something>é%' ? > > It's still the same character, but it will sort in a possibly unexpected > way. > > regards, tom lane > Many thanks for your help, guys, especially on this non-trivial subject (at least, for me). Cheers,