Yongwei Wu wrote:
I am now a frequent user of the `gq' commands, even with Chinese text
(I have `set formatoptions+=mM'). However, though Chinese does not use
space between characters, it is still bad layout to have characters
like `'"《' appear at the end of a line, or `'"》,,。、?!' at the
beginning of a line. I wonder whether there is an option for this
purpose, or is it possible to add one if it is not there already?

Best regards,

Yongwei

mmm... I suppose that mM flags in 'formatoptions' might be restricted to act only between "wide" characters, not just "any" codepoint > U+00FF. However, it would not address the fact that it _is_ bad form to break a line before characters like (e.g.) a CJK ideographic comma or ideographic full stop, which _are_ wide characters. I'm not sure how to solve this. The description if 'isident' 'iskeyword' and 'isprint' resend to 'isfname', and there it is said that "characters above 255 are always included" so that's no help.

I suppose the handling of CJK characters would deserve a distinction between "ideograms" and "punctuation", which is rather easy in principle, at least in UTF-8, since different Unicode blocks are allocated to them, as follows (some are all-CJK, others are C-only, J-only or K-only but gvim should of course handle all of them as correctly as possible):

1100-11FF Hangul Jamo
2E80-2EFF CJK radicals (main and abbreviated forms)
2F00-2FDF Kangxi radicals (214 main forms only)
2FF0-2FFF Ideographic description characters
3000-303F CJK symbols and punctuation
3040-309F Hiragana
30A0-30FF Katakana
3100-312F Bopomofo
3130-318F Hangul Compatibility Jamo
3190-319F Kanbun
31A0-31BF Bopomofo Extended
31C0-31EF CJK Strokes
31F0-31FF Katakana Phonetic Extensions for Ainu
3200-32FF Enclosed CJK Letters and Months
3300-33FF CJK Compatibility
3400-4DBF CJK Unified Ideographs Extension A
4DC0-4DFF Yijing Hexagram Symbols
4E00-9FBF CJK Unified Ideographs
A000-A48F Yi Syllables
A490-A4CF Yi Radicals
AC00-D7AF Hangul Syllables
F900-FAFF CJK Compatibility Ideographs
FE10-FE1F Vertical Presentation Forms for CJK Symbols
FF00-FFEF Halfwidth and Fullwidth Forms
20000-2A6DF CJK Unified Ideographs Extension B
2F800-2FA1F CJK Compatibility Ideographs Supplement

(I "think" I got them all but I'm not 100% sure.) Most of these are "text"; a few characters, especially in some blocks, are either "punctuation" or a third category which I would compare to (in Latin text) Roman numerals, card suit symbols embedded in a text over Bridge, chessmen symbols embedded in the commentary of a chess game, etc.; U+3000 is a "wide space"; and there are a few composing characters.

It oughtn't to be very hard to set apart those of the above which allow linebreaks on only one side, or on neither, instead of "allow line-breaking on both sides" which is the default for most wide characters.

Now even if the linebreaking rules could be defined by a Vim script (I suppose they can but I haven't checked it in detail), how best to implement it? Would it be acceptable to define a new filetype, let's say "cjk", for CJK text, and to define those line-breaking rules in something like an "indent/cjk.vim" or "ftplugin/cjk.vim" script? This would, however, still not address the problem of where to break lines in comments within other syntaxes (such as HTML, c, vimscript, etc.) when they contain CJK ideographic text.

Or else, the set of options and option flags (and the set of what they do) might be expanded to allow proper handling of CJK punctuation, but of course this would require patching the C source.


Best regards,
Tony.

Reply via email to