Re: [groff] Accented Cyrillic characters
Werner LEMBERG wrote in <20180802.162932.2121529583718521640...@gnu.org>: |> There appears to be specific code in groff to explicitly *BREAK* the |> return value of wcwidth(3). Actually, egregious mishandling of |> wcwidth(3) is a quite common error in application programs, so groff |> is certainly not alone here. | |Well... :-) | |> I'm not familiar with groff internals either (except for the manual |> page macroset implementations), but i had a quick look and instantly |> identified at least three places where wcwidth(3) handling is |> obviously broken, see the patch below. That patch is *NOT* intended |> for commit, but merely for giving others some hints in which areas |> to look. | |Thanks. Unfortunately, I don't have time to delve into the code, |sorry. Well if i recall the situation then that GNU library which is now linked into the build provides a function that actually offers wcwidth() specifically for UTF-8, which is what groff would need. Even if setlocale() has never been called that is, or called with "C". I have reported this in 2014 i think, unfortunately i still have no running fork. --steffen | |Der Kragenbaer,The moon bear, |der holt sich munter he cheerfully and one by one |einen nach dem anderen runter wa.ks himself off |(By Robert Gernhardt)
Re: [groff] Accented Cyrillic characters
> There appears to be specific code in groff to explicitly *BREAK* the > return value of wcwidth(3). Actually, egregious mishandling of > wcwidth(3) is a quite common error in application programs, so groff > is certainly not alone here. Well... :-) > I'm not familiar with groff internals either (except for the manual > page macroset implementations), but i had a quick look and instantly > identified at least three places where wcwidth(3) handling is > obviously broken, see the patch below. That patch is *NOT* intended > for commit, but merely for giving others some hints in which areas > to look. Thanks. Unfortunately, I don't have time to delve into the code, sorry. Werner
Re: [groff] Accented Cyrillic characters
Hi Robin, Robin Haberkorn wrote on Thu, Aug 02, 2018 at 07:47:35PM +0600: > But for the rest of glyphs, it should IMHO a) make sure that > accentuation glyphs have a zero-width There appears to be specific code in groff to explicitly *BREAK* the return value of wcwidth(3). Actually, egregious mishandling of wcwidth(3) is a quite common error in application programs, so groff is certainly not alone here. > (Sorry, I'm not that motivated to seriously debug this in the Groff > sources. Just hoped that somebody would already know what's going > on here.) I'm not familiar with groff internals either (except for the manual page macroset implementations), but i had a quick look and instantly identified at least three places where wcwidth(3) handling is obviously broken, see the patch below. That patch is *NOT* intended for commit, but merely for giving others some hints in which areas to look. On the one hand, it doesn't appear to help yet, there seems to be yet more breakage elsewhere. On the other hand, i have no idea whether these changes would have unintended side effects. It is quite likely that the details must be slightly different than my first draft patch. But so much is certain, it is wrong to treat the return values 0 and -1 from wcwidth(3) identically. That can almost never be right. The way wcwidth(3) is mishandled makes it obvious that fixing it will not be completely trivial. In the meantime, until groff gets fixed, as a workaround, you can use mandoc(1) to view your manual pages on the terminal (mandoc.bsd.lv), which does handle the width of accented cyrillic characters correctly inside table columns. Yours, Ingo - 8< - schnipp - >8 - 8< - schnapp - >8 - $ cat tmp3.man .TH TEST 1 .SH DESCRIPTION .TS box; l. саморазруше\[u0301]ние foo bar .TE $ LC_CTYPE=C.UTF-8 mandoc tmp3.man TEST(1) General Commands ManualTEST(1) DDEESSCCRRIIPPTTIIOONN +---+ |саморазруше́ние | |foo bar| +---+ TEST(1) - 8< - schnipp - >8 - 8< - schnapp - >8 - diff --git a/src/libs/libgroff/font.cpp b/src/libs/libgroff/font.cpp index 17e6f425..08f29bca 100644 --- a/src/libs/libgroff/font.cpp +++ b/src/libs/libgroff/font.cpp @@ -384,6 +384,8 @@ int font::get_width(glyph *g, int point_size) // Unicode font int width = 24; // XXX: Add a request to override this. int w = wcwidth(get_code(g)); +if (w == 0) + return 0; if (w > 1) width *= w; if (real_size == unitwidth || font::unscaled_charwidths) @@ -962,7 +964,7 @@ int font::load(int *not_found, int head_only) } if (is_unicode) { int w = wcwidth(metric.code); - if (w > 1) + if (w >= 0) metric.width *= w; } p = strtok(0, WS); diff --git a/src/roff/troff/node.cpp b/src/roff/troff/node.cpp index 27311b1c..a1ffd394 100644 --- a/src/roff/troff/node.cpp +++ b/src/roff/troff/node.cpp @@ -509,6 +509,8 @@ tfont_spec tfont_spec::plain() hunits tfont::get_width(charinfo *c) { + if (fm->get_width(c->as_glyph(), size.to_scaled_points()) == 0) +return 0; if (is_constant_spaced) return constant_space_width; else if (is_bold)
Re: [groff] Accented Cyrillic characters
> I tried adding a line like > > u0301 0 0 0xCC81 > > to the R font for devutf8. But it doesn't work. Right idea, wrong code point :-) See my other e-mail. Werner
Re: [groff] Accented Cyrillic characters
Hi Robin, > I tried adding a line like > u0301 0 0 0xCC81 > to the R font for devutf8. But it doesn't work. How does grotty > interpret the code? They are obviously not simply UTF-8 bytes. groff_font(5) explains the format under `charset'. You've put `0xCC81' because it's the UTF-8 for U+0301, but the number is the code for `\N', so you want `0x0301'. Here's the first entry. You should be able to spot what's going on. u0041_0300 24 0 0x00C0 -- Cheers, Ralph. https://plus.google.com/+RalphCorderoy
Re: [groff] Accented Cyrillic characters
> It boils down to persuading `\w', used by tbl(1), that the U+0301 takes > no space. > > $ groff -Tutf8 >/dev/null > .nr w \w'A' > .tm \nw > 24 > .nr w \w'\[u0435]' > .tm \nw > 24 > .nr w \w'\[u0435]\[u0301]' > .tm \nw > 48 > $ Indeed. I think this is a bug in groff: The devutf8 font files don't contain non-spacing glyphs. If you manually enter the line u0301 0 0 0x0301 to the *installed* utf8 device files `.../font/devutf8/{R,I,B,BI}', the problem vanishes. Similar lines would be necessary for all other latin, non-spacing glyphs. Note that currently the script `font/scripts/genfonts.sh' doesn't handle an entry `0' in the second column correctly, always overwriting it with `24' for devutf8; this prevents the proper solution to fix `font/devutf8/R.proto' directly. Werner PS: It seems that the files `dev{utf8,html}/R.in' are no longer in use.
Re: [groff] Accented Cyrillic characters
Hello Ralph! I see! Groff seems to combine composites to single code points if possible, probably in order to better support terminals and/or software that cannot themselves combine them. Makes sense. But for the rest of glyphs, it should IMHO a) make sure that accentuation glyphs have a zero-width and b) don't drop them from composite Unicode escapes. Why is there even something like composite support, where you can even specify Unicode points if they are always reduced to a single code point in the end? I tried adding a line like u0301 0 0 0xCC81 to the R font for devutf8. But it doesn't work. How does grotty interpret the code? They are obviously not simply UTF-8 bytes. (Sorry, I'm not that motivated to seriously debug this in the Groff sources. Just hoped that somebody would already know what's going on here.) Best regards, Robin 02.08.2018 17:26, Ralph Corderoy пишет: > Hello Robin! > >> Currently, I'm just adding a standalone UTF composite accent character >> (U+0301) after every vowel I want to show stress on since Unicode does >> not seem to define separate codepoints for all of the Cyrillic >> accented vowels. > > That's the recommendation in > https://en.wikipedia.org/wiki/Cyrillic_script_in_Unicode > >> the terminal emulator (at least URXVT) will combine the accent and the >> vowel into a single glyph. > > xterm(1) does too. libvte-based terminals seem to place it on the line > above!? > >> This approach of adding accents causes problems with tbl, though. The >> combination of the two characters into a single glyph screws up tbl's >> (and/or Groff's) assumptions. For instance, in a table like: >> | саморазруше́ние | >> | foo bar | >> the bars won't properly line up. > > It boils down to persuading `\w', used by tbl(1), that the U+0301 takes > no space. > > $ groff -Tutf8 >/dev/null > .nr w \w'A' > .tm \nw > 24 > .nr w \w'\[u0435]' > .tm \nw > 24 > .nr w \w'\[u0435]\[u0301]' > .tm \nw > 48 > $ > > Tricks like overstrike with `\o' and moving left with \h affect the \w > but don't give the desired output because grotty(1) also processes them. > >> For instance, \[u0435_0301] should theoretically also format as an >> accented Cyrillic e. But what happens instead is that the accent is >> dropped during formatting. Curiously, this works when using latin >> characters. For instance, \[e u0301], \[e aa], \[e '] will result in a >> properly accented latin e. > > I think those are mapped onto their Unicode rune, and as you start by > saying, then isn't one for U+0435 combined with U+0301. > > $ cd /usr/share/groff/1.22.3/font/devutf8 > $ grep 0435 R > u0435_030024 0 0x0450 > u0435_030824 0 0x0451 > u0435_030624 0 0x04D7 > $ grep '0045.*0301' R > u0045_0301 24 0 0x00C9 > u0045_0304_0301 24 0 0x1E16 > u0045_0302_0301 24 0 0x1EBE > $ > > I look forward to solutions and workarounds from the others here. :-) >
Re: [groff] Accented Cyrillic characters
Hello Robin! > Currently, I'm just adding a standalone UTF composite accent character > (U+0301) after every vowel I want to show stress on since Unicode does > not seem to define separate codepoints for all of the Cyrillic > accented vowels. That's the recommendation in https://en.wikipedia.org/wiki/Cyrillic_script_in_Unicode > the terminal emulator (at least URXVT) will combine the accent and the > vowel into a single glyph. xterm(1) does too. libvte-based terminals seem to place it on the line above!? > This approach of adding accents causes problems with tbl, though. The > combination of the two characters into a single glyph screws up tbl's > (and/or Groff's) assumptions. For instance, in a table like: > | саморазруше́ние | > | foo bar | > the bars won't properly line up. It boils down to persuading `\w', used by tbl(1), that the U+0301 takes no space. $ groff -Tutf8 >/dev/null .nr w \w'A' .tm \nw 24 .nr w \w'\[u0435]' .tm \nw 24 .nr w \w'\[u0435]\[u0301]' .tm \nw 48 $ Tricks like overstrike with `\o' and moving left with \h affect the \w but don't give the desired output because grotty(1) also processes them. > For instance, \[u0435_0301] should theoretically also format as an > accented Cyrillic e. But what happens instead is that the accent is > dropped during formatting. Curiously, this works when using latin > characters. For instance, \[e u0301], \[e aa], \[e '] will result in a > properly accented latin e. I think those are mapped onto their Unicode rune, and as you start by saying, then isn't one for U+0435 combined with U+0301. $ cd /usr/share/groff/1.22.3/font/devutf8 $ grep 0435 R u0435_0300 24 0 0x0450 u0435_0308 24 0 0x0451 u0435_0306 24 0 0x04D7 $ grep '0045.*0301' R u0045_0301 24 0 0x00C9 u0045_0304_0301 24 0 0x1E16 u0045_0302_0301 24 0 0x1EBE $ I look forward to solutions and workarounds from the others here. :-) -- Cheers, Ralph. https://plus.google.com/+RalphCorderoy
[groff] Accented Cyrillic characters
Hello! I'm working on a small Russian offline dictionary that formats the entries of words into Troff/Man pages, so you can view them in the terminal. There is a small problem when trying to format accented Cyrillic characters. Accents are commonly used in Russian to highlight word stress by placing them on the stressed syllable's first vowel. Currently, I'm just adding a standalone UTF composite accent character (U+0301) after every vowel I want to show stress on since Unicode does not seem to define separate codepoints for all of the Cyrillic accented vowels. AFAIK, the accent is not really interpreted by Groff - to it, it will seem like a standalone glyph. But the terminal emulator (at least URXVT) will combine the accent and the vowel into a single glyph. For instance саморазруше\[u0301]ние will effectively render as саморазруше́ние. This approach of adding accents causes problems with tbl, though. The combination of the two characters into a single glyph screws up tbl's (and/or Groff's) assumptions. For instance, in a table like: | саморазруше́ние | | foo bar | the bars won't properly line up. It will probably cause other more subtle formatting issues as well, but that's where I personally caught it. I tried to use the Groff Unicode composite syntax, so it becomes clear to Groff that the accented character is a single glyph. For instance, \[u0435_0301] should theoretically also format as an accented Cyrillic e. But what happens instead is that the accent is dropped during formatting. Curiously, this works when using latin characters. For instance, \[e u0301], \[e aa], \[e '] will result in a properly accented latin e. Why is that so? Did I catch a grotty bug here? Do you know any workaround I could employ? Best regards, Robin