On terça-feira, 5 de junho de 2012 16.48.58, João Abecasis wrote: > Thiago Macieira wrote: > > On terça-feira, 5 de junho de 2012 13.20.36, João Abecasis wrote: > >> I would not go as far as saying that the concept is broken, but it > >> definitely isn't portable. One big issue is that nowadays we don't > >> use an 8-bit encoding on Windows and those functions assume QString > >> <-> QByteArray conversions. > > > > I called it broken due to the ambiguity when it comes to interpreting > > data that may or may not contain file names, like command-line > > arguments. See the next paragraph of what I had written. > > > > Codecs must never be used in a situation where there's an ambiguity on > > what the encoding is. That will lead to mojibake sooner or later. > > That's the whole point, you shouldn't treat file names differently when > processing the command line or configuration files. Those should be > interpreted using their specific rules: command line goes according to > locale, configuration goes according to established convention (say, > UTF-8).
So you're asking that filenames be passed on the locale encoding (say, UTF-8) on the command-line, regardless of what the filesystem encoding is? That is a possible solution, it requires that all tools be taught to do that, including the ones that currently couldn't care less about encodings. Many shell tools assume that the file name can be copied verbatim from the DIR* entry (from readdir(2)) to the screen. I don't see that happening any more than I see any of the other solutions. The only pragmatic solution is to enforce filesystem encoding == locale encoding. In fact, there is one more possible solution which stands a chance: forcing the problem onto the kernel. Make the entire userspace API be UTF-8 and have the kernel recode to the filesystem encoding as necessary. The problem with this solution is that it a) will suffer extreme resistance from kernel developers and other people who think of file names as "binary data" instead of human-readable text; and b) is no different from the other solution of enforcing the encoding. [snip] > QtGit commit -m $'R\xc3\xa9sum\xc3\xa9' 'R%E9sum%E9.txt' > > Bash enables you to pass around byte sequences it doesn't understand. > That would help in getting UTF-8 into the application but would not help > pass in Latin1, as the application would (correctly) flag it as an > invalid UTF-8 sequence and replace 'é' with '?'. Instead, we need to > escape the file name in a way that will not be misinterpreted by bash or > command line processing, giving code that actually expects a path (say, > QFile, encodeName or QFileSystemEntry) a chance to interpret it. > > For this to work, command line parsing need not understand more than > UTF-8, but encodeName would then be able to convert those \xHH sequences > to the proper Latin1-encoded "Résumé" and retrieve the right file from > the file system. You used %HH instead of \xHH in your example, which is the URL encoding and for which there are well-defined rules. I'd prefer that. > We're talking about the edge cases, right? 99% of uses will work with or > without the need for encode/decodeName. > > Currently, we don't support the edge cases at all. If we were able to > support the edge cases but not be able to interoperate with other > applications on those cases it would already be an improvement. What we don't currently support are file names that cannot be decoded by the locale codec, such as a latin1("Résumé.txt") on a UTF-8 system. If that's what you mean by edge cases, then what's what I meant. What I cannot agree with is sacrificing the normal case for the edge case. It might be possible to fix the problem of the edge case above (a restricted issue) with little fall-out, but I don't see a way of solving the generic problem of filesystem encoding != locale encoding. > > The way I see it, the only way that this could work is if two > > conditions were met: > > > > 1) there is a cross-toolkit, lossless and encoding-independent > > representation of a filepath > > This would be nice, agreed. I'd say it's mandatory. If we only try to solve the restricted problem of filenames outside the filesystem encoding (which is equal to the locale encoding), there's a possible solution similar to Qt3's. If we want to solve the generic problem, then we must always transmit data in a specific encoding between applications. That requires, in turn, that all applications be modified to understand that for the common case (not just the edge case). > > [Desktop Entry] > > Encoding=UTF-8 > > # the third argument to git commit is a URI reference for a filename > > # called "Résumé.txt" encoded in Latin 1 > > Exec=git commit -m Résumé R%E9sum%E9.txt > > > > without modifications, git would not understand that representation > > and would not find the file on disk. Changing *every* *single* > > application under the sun to use a different representation for file > > paths than what they do today is not feasible. > > What a lot of applications will do is treat byte-sequences (strings) > agnostically and not validate them as valid UTF-8 sequences, allowing > you to still access files with those "funny" names. Only if the calling application or the user wrote those "funny names" literally, in the format that the filesystem OS functions expect. That is not the case above, where we used a different, specific encoding. > While I agree with a lot of what you are saying about what is reasonable > and feasible for encodings of file names (heck, everyone should be using > UTF-8 or shot on the spot!), Qt as a toolkit should not lock you out of > files that mistakenly or not ended up with those "funny" names. One > consequence of using locale to decide how to encode file names is that > it is all too easy to come across those files (e-mail, file sharing, USB > sticks and whatnot). > > Again, as a general purpose toolkit Qt needs to allow you to read, > rename and delete those files. It should potentially allow you to store > their names and come back to them at a later time. Again, we're mixing the general problem with the edge case. There is a possible solution for the edge case, implementable without sacrificing the normal case, and which would cause little fall out in terms of interoperability. But I really, really do not see how we can solve the generic case of allowing the user to change the filesystem encoding at will. And by user here, I mean both the developer using Qt by calling QFile::setEncodingFunction as well as the end user toggling some configuration switches in the system. To solve the edge case, we need to somehow store in a regular QString a byte sequence that can be converted back to its original 8-bit form, regardless of the locale encoding being used. If we say that changing the codecs themselves in QTextCodec and QString is out of the question (that was the Qt 3 solution), then the only place remaining is QFile::encodeName and QFile::decodeName. By necessity, they must do more than just QString::{to,from}Local8Bit. >From there, we come to the conclusion that the QString representing such a file name must contain special processing instructions (e.g., one or more special characters). One form of special processing instruction is escaping each character, like URLs do. The problem with the approach of escaping is what to do when the escape character occurs in a file name. If that is a possibility, the escape character needs to be escaped by itself (like "\\" for backslashes in C or "%25" for percents in URLs). If we use this approach, then we will not interoperate properly with non-Qt applications when this character happens. The only sane solution, then, is to use a character that has a very small chance of ever being used or, better yet, a zero chance (I don't think there's any). If that happens, then this character will be close to "untypeable" on the terminal. Not a big loss, I'd say. In fact, I'd recommend that, instead of escaping each bad character, we escape each path component (the escaped sequence ends at the next slash). That is, suppose I am a Greek user and I unpacked a bad .zip file on the "My Documents" folder, which is called: /home/foo/έγγραφα If my file was called "Résume.txt" in Latin1, the QString representing such a file name would be: /home/foo/έγγραφα/<escape>Résumé.txt If it was named "βιογραφικό σημείωμα.txt" in ISO-8859-7, the QString representation would be: /home/foo/έγγραφα/<escape>âéïãñáöéêü óçìåßùìá.txt That has the drawback of being hard to use when it comes to path manipulation. Appending, prepending, extracting or inserting text could have unexpected consequences. An intermediate option would be to escape each sequence of non-locale characters. Instead of the escaping ending at the slash, an unescape character is necessary. For simplicity, let's say ⟪ shifts and ⟫ unshifts, the file could be: /home/foo/έγγραφα/⟪biocqavij|⟫ ⟪sgle_yla⟫.txt Pros: - implementable, with little fall-out for interoperability if we choose the escape character well - probably survives a round-trip through the locale codec, so the escaping isn't lost - since it survives the round-trip, it can be used across QProcess, on the command-line, etc. - survives the user too, since it can be copy & pasted, edited, provided that the escape characters remain Limitations: a) Qt-only, I don't expect anyone else to use such file names b) if encodeName() isn't used properly, it leads to a bad encoding of the file name onto 8-bit. Applications dealing with the filesystem need to be extra careful so as to not show two representations of the same file. c) for that matter, it's possible to produce an escaped form that matches a regular file name d) double representations are often a source of security issues if not handled carefully (cf. overlong sequences in UTF-8) As you can see, I didn't come up with this today. I've known these alternatives for years. I don't think they're worth our time. -- Thiago Macieira - thiago.macieira (AT) intel.com Software Architect - Intel Open Source Technology Center Intel Sweden AB - Registration Number: 556189-6027 Knarrarnäsgatan 15, 164 40 Kista, Stockholm, Sweden
signature.asc
Description: This is a digitally signed message part.
_______________________________________________ Development mailing list Development@qt-project.org http://lists.qt-project.org/mailman/listinfo/development