The following is a real critical problem that we ran into here at TripAdvisor, 
but have yet figured out a clear way to mitigate.

TL;DR:
Streaming replicas—and by extension, base backups—can become dangerously broken 
when the source and target machines run slightly different versions of glibc.  
Particularly, differences in strcoll and strcoll_l leave "corrupt" indexes on 
the slave.  These indexes are sorted out of order with respect to the strcoll 
running on the slave.  Because postgres is unaware of the discrepancy is uses 
these "corrupt" indexes to perform merge joins; merges rely heavily on the 
assumption that the indexes are sorted and this causes all the results of the 
join past the first poison pill entry to not be returned.  Additionally, if the 
slave becomes master, the "corrupt" indexes will in cases be unable to enforce 
uniqueness, but quietly allow duplicate values.

Context:
We were doing a hardware upgrade on a large internal machine a couple months 
ago.  We followed a common procedure here: stand up a the new HA pair as 
streaming replica's of the old system; then failover to the new pair.  All 
systems involved were running 9.1.9 (though that is not relevant as we'll see), 
and built from source.

Immediately, after the failover we saw some weird cases with some small 
indexes.  We thought it was because the streaming replication failover had gone 
poorly (and because we weren't running latest version of postgres on that 
machine), so we rebuilt them and moved on.  Until last week when an important 
query stopped getting optimized as a hash join and turned into a merge join.  
From that query I generated a simple, single column join between two tables.  
That query returns 50 million rows with merge joins disabled and 0 rows with 
them enabled.  Rebuilding the index fixed the issue, but this was an important 
table and so we did some digging.

Using some query optimizer coercion, I was able to show that 1. the "corrupt" 
index had the same number of rows as the table, and 2. the index returned rows 
in a different, but nearly identical ordering to the one that you would receive 
by explicitly sorting the column.  Taking a pair of rows that were out of 
place, I manage to narrow the issue down.  Luckily, we able to find the old 
server sitting on the floor.  The simplest form of this issue is:

    SELECT 'M' > 'ஐ';

Root cause:
Depending on your charset the first character might look like an ascii 'M'.  It 
is not.  The two characters in question are the utf8 representations of 
http://www.fileformat.info/info/unicode/char/ff2d/index.htm and 
http://www.fileformat.info/info/unicode/char/0b90/index.htm respectively.  
Across different machines, running the same version of postgres, and in 
databases with identical character encodings and collations ('en_US.UTF-8') 
that select will return different results if the version of glibc is different. 
 This holds whether one pg instance is a pg_basebackup of the other or if you 
run an initdb and then immediately start postgres.

Digging further lead me to: master:src/backend/utils/adt/varlena.c:1494,1497  
These are the lines where postgres calls strcoll_l and strcoll, in order to 
sort strings in a locale aware manner.

In the simplest case, the attached c file returns inconsistent results across 
glibc versions and environments.  It just sets the collation to 'en_US.UTF-8', 
and then compares two one character strings (the characters above).  Depending 
on the version of glibc you are running, you may see positive, negative or zero 
as the output.  I have observed:

Old Server (CentOS 5.8, kernel 2.6.18-308.24.1.el5):
0 -> glibc-2.5-81.el5_8.7
1 -> glibc-devel-2.5-81.el5_8.7 statically linked
0 -> Source builds of glibc (2.5, 2.6, 2.10)

New Server (CentOS 6.4, kernel 2.6.32-358.el6.x86_64)
-1 -> glibc-2.12-1.132.el6.x86_64

Dev server (CentOS 6.4, kernel 2.6.32-358.el6.x86_64)
-1 -> glibc-2.12-1.107.el6_4.5.x86_64
-1 -> Source build (2.12.2, 2.16, 2.18)
15 -> Source build (HEAD)

Laptop (Ubuntu, kernel 3.11.0-12-generic)
-1 -> 2.17-93ubuntu4
15 -> 2.17-93ubuntu4 statically linked

Mac OS (For comparison only)
62365 -> OSX 10.8, 10.9


From my digging, I have been unable to figure out why glibc is returning 
different results in different situations.  It is probably worth getting a 
discussion going on their mailing lists as well.

Regardless, the reality is that there are different versions of glibc out there 
in the wild, and they do not sort consistently across versions/environments.  
Streaming replica's rely on the assumption that the sort order within a 
collation is consistent across machines.  If they differ ever so slightly, then 
a single pair of rows with poison pill characters can yield an entire index 
invalid.  We were lucky that the first discrepancy was at the beginning of the 
index.  If it was 85% of the way through then we probably would never have 
noticed that merge joins were broken on that machine.

We still are discussing internally how we plan to prevent this in the future, 
but we'd like to open this up to the community for wider discussion.

- Matt K, TripAdvisor

Attachment: localetest.c
Description: localetest.c

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Reply via email to