John Cowan responded: > > Storage of UNIX filenames on Windows databases, for example, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ O.k., I just quoted this back from the original email, but it really is a complete misconception of the issue for databases. "Windows databases" is a misnomer to start with.
There are some databases, like Access, that are Windows-only applications, but most serious SQL databases in production (DB2, Oracle, Sybase ASE and ASA, and so on) are crossplatform from the get go, and have their *own* rules for what can and cannot legitimately be stored in data fields, independent of what platform you are running them on. A Sybase ASE database has the same behavior running on Windows as running on Sun Solaris or Linux, for that matter. > > can be done with BINARY fields, which correctly capture the > > identity of them as what they are: an unconvertible array of > > byte values, not a convertible string in some particular > > code page. > > This solution, however, is overkill, Actually, I don't think it is. One of the serious classes of fundamental errors that database administrators and database programmers run into when creating global applications is ignoring or misconstruing character set issues. In a database, if I define the database (or table or field) as containing UTF-8 data, it damn well better have UTF-8 data in it, or I'm just asking for index corruptions, data corruptions or worse -- and calls from unhappy customers. When database programmers "lie" to the database about character sets, by setting a character set to Latin-1, say, and then pumping in data which is actually UTF-8, for instance, expecting it to come back out unchanged with no problems, they are skating on very thin ice ... which usually tends to break right in the middle of some critical application during a holiday while your customer service desk is also down. ;-) Such "lying to the database" is generally the tactic of first resort for "fixing" global applications when they start having to deal with mixed Japanese/European/UTF-8 data on networks, but it is clearly a hack for not understanding and dealing with the character set architecture and interoperability problems of putting such applications together. UNIX filenames are just one instance of this. The first mistake is to network things together in ways that create a technical mismatch between what the users of the localized systems think the filenames mean and what somebody on the other end of such a system may end up interpreted the bag o' bytes to mean. The application should be constructed in such a way that the locale/charset state can be preserved on connection, with the "filename" interpreted in terms of characters in the realm that needs to deal with it that way, and restored to its bag o' bytes at the point that needs it that way. If you can't do that reliably with a "raw" UNIX set of applications, c'est la vie -- you should be building more sophisticated multi-tiered applications on top of your UNIX layer, applications which *can* track and properly handle locale and character set identities. Failing that, then BINARY fields *are* the appropriate way to deal with arbitrary arrays of bytes that cannot be interpreted as characters. Trying to pump them into UTF-8 text data fields and processing them as such when they *aren't* UTF-8 text data is lying to the database and basically forfeiting your warranty that the database will do reasonable things with that data. It's as stupid as trying to store date or numeric types in text data fields without first converting them to formatted strings of text data. > in the same way that it would > be overkill to encode all 8-bit strings in XML using Base-64 > just because some of them may contain control characters that are > illegal in well-formed XML. Dunno about the XML issue here -- you're the expert on what the expected level of illegality in usage is there. But for real database applications, there are usually mountains and mountains of stuff going on, most of it completely orthogonal to something as conceptually straightforward as maintaining the correct interpretation of a UNIX filename. It isn't really overkill, in my opinion, to design the appropriate tables and metadata needed for ensuring that your filename handling doesn't blow up somewhere because you've tried to do an UPDATE on a UTF-8 data field with some random bag o' bytes that won't validate as UTF-8 data. > > > In my opinion, trying to do that with a set of encoded characters > > (these 128 or something else) is *less* likely to solve the > > problem than using some visible markup convention instead. > > The trouble with the visible markup, or even the PUA, is that > "well-formed filenames", those which are interpretable as > UTF-8 text, must also be encoded so as to be sure any > markup or PUA that naturally appears in the filename is > escaped properly. This is essentially the Quoted-Printable > encoding, which is quite rightly known to those stuck with > it as "Quoted-Unprintable". I wasn't actually suggesting that Quoted-Printable (which was, indeed the model I had in mind) would be an appropriate solution to UNIX filename handling. It is actually more appropriate for the corrupted document issue, but as you note, even there, it basically just leaves you with a visibly readable corruption, but a corruption nonetheless. I don't think that having visible markup (or any other scheme for ostensibly carrying around "correct" corrupt data) is a substitute for fixing the application architecture and data conversion points to eliminate the corruptions in the first place. > > Simply > > encoding 128 characters in the Unicode Standard ostensibly to > > serve this purpose is no guarantee whatsoever that anyone would > > actually implement and support them in the universal way you > > envision, any more than they might a "=93", "=94" convention. > > Why not, when it's so easy to do so? And they'd be *there*, > reserved, unassignable for actual character encoding. > > Plane E would be a plausible location. The point I'm making is that *whatever* you do, you are still asking for implementers to obey some convention on conversion failures for corrupt, uninterpretable character data. My assessment is that you'd have no better success at making this work universally well with some set of 128 magic bullet corruption pills on Plane 14 than you have with the existing Quoted-Unprintable as a convention. Further, as it turns out that Lars is actually asking for "standardizing" corrupt UTF-8, a notion that isn't going to fly even two feet, I think the whole idea is going to be a complete non-starter. --Ken