Hi, I'm very glad to hear that there's going to be some work on it!
> My work leader is author of the original utf-8 patch for mc. Oh my God... Who is he? (Just for curiosity.) There were lots of names: Jakub Jelinek, Vladimir Nadvornik, Jindrich Novy - AFAIK they're all involved in the current UTF-8 patches. > I have read some old posts about this theme and source codes of mc, too. I recommend reading the following two threads. Especially because I also wrote my detailed opinion in both threads and I don't want to re-type them :-))) so imagine they're #included here. "Proposal for simplification" (2005 Sep-Oct) is (amongst others) about possibly dropping support for one of ncurses and slang. If, after some investigations, you think that dropping these would make your work much easier than probably this is the way to go. "Request for discussion - how to make MC unicode capable" (2007 Feb-Mar) contains lots of useful ideas. Please also see my UTF-8 related patches at https://svn.uhulinux.hu/packages/dev/mc/patches/ The most important goal I think is to get work accepted in mainstream mc. This means we need clean code that is well-designed, modularized, easy to understand, easy to verify, easy to modify/improve/fix. And of course that works correctly :) I think the first step should be to decide which scripts to support (or plan future support for). This should probably include testing commonly used terminal emulators what they do at certain circumstances. Here's what I mean: - Handling single-width characters is trivial. - Handling double-width (CJK) shouldn't be hard, but there are some tricky questions arising. E.g. what to do if only the left half of a double-width character is visible in the rightmost column during editing a file? What if wordwrap mode is on and it should continue in the next row? I don't think terminal emulators support wrapping CJK characters (and it would make no sense actually) so probably some special symbol (e.g. "»") should be displayed at the end of the line if word wrapping is off, and if word wrapping is enabled then probably the whole new character should be wrapped to the next line. What if some CJK text (maybe a filename) needs to be printed in a smaller box, probably with a ~ in the middle, probably by cutting at its end...? I guess you'll need several helper functions similar to (but more complex than) my "00-70-utf8-common.patch". - What to do with zero-width characters, including combining characters? Very few terminal emulators support them correctly (e.g. plain old xterm). Should we address supporting them on these terminals? (Or at least design mc now so that it can be added easily later, without a complete rewrite?) - What to do with BIDI issues (Right-To-Left writing)? I don't know if there are terminal emulators out there at all that support RTL. But maybe mc could reverse these strings on its own and send them out without sending LTR or RTL marks so that eventually the user sees them correctly. Needless to say, this would make editing a line or a file much trickier. Maybe you should study emacs/vim whether they support BIDI... - How much support does ncurses or slang give to make these complicated things easier? The current version of mc with utf8 patches works well with single-width characters, but behaves quite bad with CJK. According to my experiences so far, most of the terminal emulators and applications handle double-width correctly, but other issues (zero-width, combining, bidi) still suffer from plenty of bugs. So for me it would seem to be a wise decision to address single and double wide characters, but not yet support other tricks. (Of course by "not supporting" them I mean that mc still does something reasonable in these cases, e.g. prints the Unicode value within <> signs or similar. It's not affordable if the screen gets completely damaged or something out of mc's control happens.) Some more random pieces of ideas you might found useful: There's a stuff called "gnulib". I have absolutely no info on it, except that once I sent a bugreport to the findutils folks that case insensitive UTF-8 matching didn't work, and later they reported they were able to fix it due to an upgraded gnulib. MC with utf8 patches also suffer from such problem, case insensitive search in the viewer only works for non-accented letters. Probably gnulib provides a nice function that could solve it. In order to be able to view or edit half-text half-binary files and fully work on them, you'll need string searching and regexp matching functions that perfectly tolerate invalid byte sequences, but still find matches within the valid parts. Maybe you should take a look whether there's an already existing solution that you can use. (Maybe glibc's regex stuff, maybe pcre... I don't know whether these work correctly on mixed text/binary strings.) Currently mc with utf8 patches has a nasty bug that if a filename is invalid utf8 and you copy it with F5, the newly created filename will have literal question marks. My guess is that the shell pattern matching (a "*" by default for the "source mask") might work incorrectly if invalid UTF-8 is seen. One possible way to solve it is to use the encoding called UTF-8b, see http://mail.nl.linux.org/linux-utf8/2000-07/msg00040.html , option "D". However, as some conversion is needed, this encoding is only suitable for relatively short strings such as filanemes, not for file contents. And if you have functions I've outlined in the previous paragraphs then they may handle this case correctly. Good luck! -- Egmont _______________________________________________ Mc-devel mailing list http://mail.gnome.org/mailman/listinfo/mc-devel