Hi all, > According to > http://www.gnu.org/software/libc/manual/html_node/Extended-Char-Intro.html > wchar_t on GNU systems is 4 bytes by default. Internal representation of > multibyte strings always uses fixed widths or something like x[3] wouldn't > work (without scanning the string). So in case x in the above example is a > wchar_t you overflow the buffer nicely ;) .
As I see, this is now a completely different approach to the whole situation than what the current UTF-8 hack patchset uses. The current UTF-8 patchset still _thinks_ in _bytes_, but tries to correctly display them using UTF-8 or whatever the current locale is. Using wchar_t all over the source gives me a feeling that this approach wants mc to _think_ in _characters_. I'm not sure at all that this is the right way to go for a file manager and text editor. Unix philosophy says filenames are sequences of bytes (as opposed to Windows which says filenames are sequences of characters). Whenever you use a multibyte locale, you might face filenames that are not valid according to this locale. But these are still valid filenames on the system, just cannot be displayed with your current locale, but maybe they're okay with another locale. For a file manager I expect that it can handle these kind of files without a problem. Hence the filenames should be handled as byte sequences and then mc should try to do the best to display this filename as good as possible, but even if it cannot display it correctly and needs to use some question marks, it should perfectly be able to remove, rename, edit this file, invoke an external command on it etc. Typing a command and using Esc+Enter to put this filename into the command line should also work. So its name should be converted from the original byte stream to anything else sequence only for displaying purposes, but stored as the original byte stream inside mc's memory segment. Similar things happen e.g. with file editing. Suppose I receive a large English text file and I find a typo and want to fix that. I do it in mcedit and then save the file. I didn't even realize that this file also contained some French words encoded in Latin-1, while my whole system is set to UTF-8. mcedit must save the file leaving the original Latin-1 accents the same, no matter if it's not a valid UTF-8. It's definitely a bug if these characters disappeared from the file or if in any other way mc couldn't handle them. Actually, will mcedit be able to edit UTF-8 encoded files inside a Latin-1 terminal? Or edit Latin-1 files inside an UTF-8 terminal? Will mc be able to assume UTF-8 filenames while the terminal is Latin-1? ... I recommend everyone to take a look at the 'joe' text editor, version 3.1 or 3.2 to see how it handles charsets. I don't mean to look at the implementation, only the user-visible behavior of the software. IMHO this is the way things have to work. 'joe' thinks the file being edited is always a byte stream. It knows the behavior of the terminal from the locale settings, this is not overrideable in joe, which is a perfect decision (as opposed to vim) since this is exactly what the locale environment variables are for. The default encoding assumed for a file is the current locale, however, you can easily change it any time pressing ^T E. Changing this assumed character set does not change anything in the file, it just changes the way the file is displayed on the screen, what bytes a keypress will insert, how many bytes a backspace or delete or overtyping will remove etc. Obviously, byte sequences that are invalid in the selected charset are displayed by some special symbol, maybe using special color. This whole approach guarantees that joe can edit files of arbitrary encodings over arbitrary terminals, and in the same time, it is still binary safe and keeps the byte sequence unchanged even if that is not valid according to the assumed character set. As opposed to joe, take a look at Gnome and KDE, especially KDE, their bugzilla etc. to see how many bug reports they have about accented filenames. The complete KDE system thinks of filenames as sequence of human readable characters and hence it usually fails to handle out-of-locale filenames. Just think how many complaints and bug reports you would receive that someone uses a modern Linux system with its default UTF-8 locale, recursively downloads some stuff from an ftp server and then blames on mc-4.7.0 that it cannot cope with these filenames (whoops, they're in Latin-1), cannot access, delete, rename etc. them. These users would have to quit to the shell to properly rename them which means that mc fails to perform one of its most basic jobs. I hope this won't happen. So while the approach of "thinking in characters" is the better for most of the desktop applications, I'm pretty sure that for file managers as mc, text editors as mcedit "thinking in bytes" is the right way to go and convert the byte stream solely for displaying purposes. -- Egmont _______________________________________________ Mc-devel mailing list http://mail.gnome.org/mailman/listinfo/mc-devel