On Wed, Jan 22, 2020 at 10:00 PM Mark Dilger <mark.dil...@enterprisedb.com> wrote: > Hopefully, this addresses Robert’s concern upthread about the filesystem name > not necessarily being in utf8 format, though I might be misunderstanding the > exact thrust of his concern. I can think of other possible interpretations > of his concern as he expressed it, so I’ll wait for him to clarify.
No, that's not it. Suppose that Álvaro Herrera has some custom settings he likes to put on all the PostgreSQL clusters that he uses, so he creates a file álvaro.conf and uses an "include" directive in postgresql.conf to suck in those settings. If he also likes UTF-8, then the file name will be stored in the file system as a 12-byte value of which the first two bytes will be 0xc3 0xa1. In that case, everything will be fine, because JSON is supposed to always be UTF-8, and the file name is UTF-8, and it's all good. But suppose he instead likes LATIN-1. Then the file name will be stored as an 11-byte value and the first byte will be 0xe1. The second byte, representing a lower-case 'l', will be 0x6c. But we can't put a byte sequence that goes 0xe1 0x6c into a JSON manifest stored as UTF-8, because that's not valid in UTF-8. UTF-8 requires that every byte from 0xc0-0xff be followed by one or more bytes in the range 0x80-0xbf, and our hypothetical file name that starts with 0xe1 0x6c does not meet that criteria. Now, you might say "well, why don't we just do an encoding conversion?", but we can't. When the filesystem tells us what the file names are, it does not tell us what encoding the person who created those files had in mind. We don't know that they had *any* encoding in mind. IIUC, a file in the data directory can have a name that consists of any sequence of bytes whatsoever, so long as it doesn't contain prohibited characters like a path separator or \0 byte. But only some of those possible octet sequences can be stored in a manifest that has to be valid UTF-8. The degree to which there is a practical problem here is limited by the fact that most filenames within the data directory are chosen by the system, e.g. base/16384/16385, and those file names are only going to contain ASCII characters (i.e. code points 0-127) and those are valid in UTF-8 and lots of other encodings. Moreover, most people who create additional files in the data directory will probably use ASCII characters for those as well, at least if they are from an English-speaking country, and if they're not, they're likely going to use UTF-8, and then they'll still be fine. But there is no rule that says people have to do that, and if somebody wants to use file names based around SJIS or whatever, the backup manifest functionality should not for that reason break. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company