Re: [groff] Accented Cyrillic characters

2018-08-02 Thread Steffen Nurpmeso
Werner LEMBERG wrote in <20180802.162932.2121529583718521640...@gnu.org>:
 |> There appears to be specific code in groff to explicitly *BREAK* the
 |> return value of wcwidth(3).  Actually, egregious mishandling of
 |> wcwidth(3) is a quite common error in application programs, so groff
 |> is certainly not alone here.
 |
 |Well... :-)
 |
 |> I'm not familiar with groff internals either (except for the manual
 |> page macroset implementations), but i had a quick look and instantly
 |> identified at least three places where wcwidth(3) handling is
 |> obviously broken, see the patch below.  That patch is *NOT* intended
 |> for commit, but merely for giving others some hints in which areas
 |> to look.
 |
 |Thanks.  Unfortunately, I don't have time to delve into the code,
 |sorry.

Well if i recall the situation then that GNU library which is now
linked into the build provides a function that actually offers
wcwidth() specifically for UTF-8, which is what groff would need.
Even if setlocale() has never been called that is, or called with
"C".  I have reported this in 2014 i think, unfortunately i still
have no running fork.

--steffen
|
|Der Kragenbaer,The moon bear,
|der holt sich munter   he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)



Re: [groff] Accented Cyrillic characters

2018-08-02 Thread Werner LEMBERG


> There appears to be specific code in groff to explicitly *BREAK* the
> return value of wcwidth(3).  Actually, egregious mishandling of
> wcwidth(3) is a quite common error in application programs, so groff
> is certainly not alone here.

Well... :-)

> I'm not familiar with groff internals either (except for the manual
> page macroset implementations), but i had a quick look and instantly
> identified at least three places where wcwidth(3) handling is
> obviously broken, see the patch below.  That patch is *NOT* intended
> for commit, but merely for giving others some hints in which areas
> to look.

Thanks.  Unfortunately, I don't have time to delve into the code,
sorry.


Werner



Re: [groff] Accented Cyrillic characters

2018-08-02 Thread Ingo Schwarze
Hi Robin,

Robin Haberkorn wrote on Thu, Aug 02, 2018 at 07:47:35PM +0600:

> But for the rest of glyphs, it should IMHO a) make sure that
> accentuation glyphs have a zero-width

There appears to be specific code in groff to explicitly *BREAK*
the return value of wcwidth(3).  Actually, egregious mishandling
of wcwidth(3) is a quite common error in application programs, so
groff is certainly not alone here.

> (Sorry, I'm not that motivated to seriously debug this in the Groff
> sources.  Just hoped that somebody would already know what's going
> on here.)

I'm not familiar with groff internals either (except for the manual
page macroset implementations), but i had a quick look and instantly
identified at least three places where wcwidth(3) handling is obviously
broken, see the patch below.  That patch is *NOT* intended for commit,
but merely for giving others some hints in which areas to look.

On the one hand, it doesn't appear to help yet, there seems to be
yet more breakage elsewhere.  On the other hand, i have no idea
whether these changes would have unintended side effects.  It is
quite likely that the details must be slightly different than my
first draft patch.  But so much is certain, it is wrong to treat
the return values 0 and -1 from wcwidth(3) identically.  That can
almost never be right.

The way wcwidth(3) is mishandled makes it obvious that fixing it
will not be completely trivial.

In the meantime, until groff gets fixed, as a workaround, you can
use mandoc(1) to view your manual pages on the terminal (mandoc.bsd.lv),
which does handle the width of accented cyrillic characters correctly
inside table columns.

Yours,
  Ingo


 - 8< - schnipp - >8 - 8< - schnapp - >8 -


 $ cat tmp3.man
.TH TEST 1
.SH DESCRIPTION
.TS
box;
l.
саморазруше\[u0301]ние
foo bar
.TE

 $ LC_CTYPE=C.UTF-8 mandoc tmp3.man
TEST(1) General Commands ManualTEST(1)



DDEESSCCRRIIPPTTIIOONN

   +---+
   |саморазруше́ние |
   |foo bar|
   +---+


   TEST(1)


 - 8< - schnipp - >8 - 8< - schnapp - >8 -


diff --git a/src/libs/libgroff/font.cpp b/src/libs/libgroff/font.cpp
index 17e6f425..08f29bca 100644
--- a/src/libs/libgroff/font.cpp
+++ b/src/libs/libgroff/font.cpp
@@ -384,6 +384,8 @@ int font::get_width(glyph *g, int point_size)
 // Unicode font
 int width = 24; // XXX: Add a request to override this.
 int w = wcwidth(get_code(g));
+if (w == 0)
+  return 0;
 if (w > 1)
   width *= w;
 if (real_size == unitwidth || font::unscaled_charwidths)
@@ -962,7 +964,7 @@ int font::load(int *not_found, int head_only)
}
if (is_unicode) {
  int w = wcwidth(metric.code);
- if (w > 1)
+ if (w >= 0)
metric.width *= w;
}
p = strtok(0, WS);
diff --git a/src/roff/troff/node.cpp b/src/roff/troff/node.cpp
index 27311b1c..a1ffd394 100644
--- a/src/roff/troff/node.cpp
+++ b/src/roff/troff/node.cpp
@@ -509,6 +509,8 @@ tfont_spec tfont_spec::plain()
 
 hunits tfont::get_width(charinfo *c)
 {
+  if (fm->get_width(c->as_glyph(), size.to_scaled_points()) == 0)
+return 0;
   if (is_constant_spaced)
 return constant_space_width;
   else if (is_bold)



Re: [groff] Accented Cyrillic characters

2018-08-02 Thread Werner LEMBERG


> I tried adding a line like
>
>   u0301 0 0 0xCC81
>
> to the R font for devutf8.  But it doesn't work.

Right idea, wrong code point :-)  See my other e-mail.


Werner



Re: [groff] Accented Cyrillic characters

2018-08-02 Thread Ralph Corderoy
Hi Robin,

> I tried adding a line like
> u0301 0 0 0xCC81
> to the R font for devutf8.  But it doesn't work. How does grotty
> interpret the code? They are obviously not simply UTF-8 bytes.

groff_font(5) explains the format under `charset'.
You've put `0xCC81' because it's the UTF-8 for U+0301,
but the number is the code for `\N', so you want `0x0301'.

Here's the first entry.  You should be able to spot what's going on.

u0041_0300  24  0   0x00C0

-- 
Cheers, Ralph.
https://plus.google.com/+RalphCorderoy



Re: [groff] Accented Cyrillic characters

2018-08-02 Thread Werner LEMBERG


> It boils down to persuading `\w', used by tbl(1), that the U+0301 takes
> no space.
> 
> $ groff -Tutf8 >/dev/null
> .nr w \w'A'   
> .tm \nw 
> 24
> .nr w \w'\[u0435]'
> .tm \nw 
> 24 
> .nr w \w'\[u0435]\[u0301]'
> .tm \nw  
> 48 
> $

Indeed.  I think this is a bug in groff: The devutf8 font files don't
contain non-spacing glyphs.  If you manually enter the line

  u0301   0   0   0x0301

to the *installed* utf8 device files `.../font/devutf8/{R,I,B,BI}',
the problem vanishes.  Similar lines would be necessary for all other
latin, non-spacing glyphs.

Note that currently the script `font/scripts/genfonts.sh' doesn't
handle an entry `0' in the second column correctly, always overwriting
it with `24' for devutf8; this prevents the proper solution to fix
`font/devutf8/R.proto' directly.


Werner


PS: It seems that the files `dev{utf8,html}/R.in' are no longer in
use.



Re: [groff] Accented Cyrillic characters

2018-08-02 Thread Robin Haberkorn
Hello Ralph!

I see! Groff seems to combine composites to single code points if possible,
probably in order to better support terminals and/or software that cannot
themselves combine them. Makes sense.
But for the rest of glyphs, it should IMHO a) make sure that accentuation glyphs
have a zero-width and b) don't drop them from composite Unicode escapes. Why is
there even something like composite support, where you can even specify Unicode
points if they are always reduced to a single code point in the end?

I tried adding a line like
u0301 0 0 0xCC81
to the R font for devutf8.
But it doesn't work. How does grotty interpret the code? They are obviously not
simply UTF-8 bytes.
(Sorry, I'm not that motivated to seriously debug this in the Groff sources.
Just hoped that somebody would already know what's going on here.)

Best regards,
Robin

02.08.2018 17:26, Ralph Corderoy пишет:
> Hello Robin!
> 
>> Currently, I'm just adding a standalone UTF composite accent character
>> (U+0301) after every vowel I want to show stress on since Unicode does
>> not seem to define separate codepoints for all of the Cyrillic
>> accented vowels.
> 
> That's the recommendation in
> https://en.wikipedia.org/wiki/Cyrillic_script_in_Unicode
> 
>> the terminal emulator (at least URXVT) will combine the accent and the
>> vowel into a single glyph.
> 
> xterm(1) does too.  libvte-based terminals seem to place it on the line
> above!?
> 
>> This approach of adding accents causes problems with tbl, though. The
>> combination of the two characters into a single glyph screws up tbl's
>> (and/or Groff's) assumptions. For instance, in a table like:
>> | саморазруше́ние |
>> | foo bar |
>> the bars won't properly line up.
> 
> It boils down to persuading `\w', used by tbl(1), that the U+0301 takes
> no space.
> 
> $ groff -Tutf8 >/dev/null
> .nr w \w'A'   
> .tm \nw 
> 24
> .nr w \w'\[u0435]'
> .tm \nw 
> 24 
> .nr w \w'\[u0435]\[u0301]'
> .tm \nw  
> 48 
> $
> 
> Tricks like overstrike with `\o' and moving left with \h affect the \w
> but don't give the desired output because grotty(1) also processes them.
> 
>> For instance, \[u0435_0301] should theoretically also format as an
>> accented Cyrillic e.  But what happens instead is that the accent is
>> dropped during formatting.  Curiously, this works when using latin
>> characters. For instance, \[e u0301], \[e aa], \[e '] will result in a
>> properly accented latin e.
> 
> I think those are mapped onto their Unicode rune, and as you start by
> saying, then isn't one for U+0435 combined with U+0301.
> 
> $ cd /usr/share/groff/1.22.3/font/devutf8
> $ grep 0435 R
> u0435_030024  0   0x0450
> u0435_030824  0   0x0451
> u0435_030624  0   0x04D7
> $ grep '0045.*0301' R 
> u0045_0301  24  0   0x00C9
> u0045_0304_0301 24  0   0x1E16
> u0045_0302_0301 24  0   0x1EBE
> $
> 
> I look forward to solutions and workarounds from the others here.  :-)
> 



Re: [groff] Accented Cyrillic characters

2018-08-02 Thread Ralph Corderoy
Hello Robin!

> Currently, I'm just adding a standalone UTF composite accent character
> (U+0301) after every vowel I want to show stress on since Unicode does
> not seem to define separate codepoints for all of the Cyrillic
> accented vowels.

That's the recommendation in
https://en.wikipedia.org/wiki/Cyrillic_script_in_Unicode

> the terminal emulator (at least URXVT) will combine the accent and the
> vowel into a single glyph.

xterm(1) does too.  libvte-based terminals seem to place it on the line
above!?

> This approach of adding accents causes problems with tbl, though. The
> combination of the two characters into a single glyph screws up tbl's
> (and/or Groff's) assumptions. For instance, in a table like:
> | саморазруше́ние |
> | foo bar |
> the bars won't properly line up.

It boils down to persuading `\w', used by tbl(1), that the U+0301 takes
no space.

$ groff -Tutf8 >/dev/null
.nr w \w'A'   
.tm \nw 
24
.nr w \w'\[u0435]'
.tm \nw 
24 
.nr w \w'\[u0435]\[u0301]'
.tm \nw  
48 
$

Tricks like overstrike with `\o' and moving left with \h affect the \w
but don't give the desired output because grotty(1) also processes them.

> For instance, \[u0435_0301] should theoretically also format as an
> accented Cyrillic e.  But what happens instead is that the accent is
> dropped during formatting.  Curiously, this works when using latin
> characters. For instance, \[e u0301], \[e aa], \[e '] will result in a
> properly accented latin e.

I think those are mapped onto their Unicode rune, and as you start by
saying, then isn't one for U+0435 combined with U+0301.

$ cd /usr/share/groff/1.22.3/font/devutf8
$ grep 0435 R
u0435_0300  24  0   0x0450
u0435_0308  24  0   0x0451
u0435_0306  24  0   0x04D7
$ grep '0045.*0301' R 
u0045_0301  24  0   0x00C9
u0045_0304_0301 24  0   0x1E16
u0045_0302_0301 24  0   0x1EBE
$

I look forward to solutions and workarounds from the others here.  :-)

-- 
Cheers, Ralph.
https://plus.google.com/+RalphCorderoy



[groff] Accented Cyrillic characters

2018-08-01 Thread Robin Haberkorn
Hello!

I'm working on a small Russian offline dictionary that formats the entries of
words into Troff/Man pages, so you can view them in the terminal.

There is a small problem when trying to format accented Cyrillic characters.
Accents are commonly used in Russian to highlight word stress by placing them on
the stressed syllable's first vowel.
Currently, I'm just adding a standalone UTF composite accent character (U+0301)
after every vowel I want to show stress on since Unicode does not seem to define
separate codepoints for all of the Cyrillic accented vowels.
AFAIK, the accent is not really interpreted by Groff - to it, it will seem like
a standalone glyph. But the terminal emulator (at least URXVT) will combine the
accent and the vowel into a single glyph.
For instance саморазруше\[u0301]ние will effectively render as саморазруше́ние.

This approach of adding accents causes problems with tbl, though. The
combination of the two characters into a single glyph screws up tbl's (and/or
Groff's) assumptions. For instance, in a table like:
| саморазруше́ние |
| foo bar |
the bars won't properly line up.
It will probably cause other more subtle formatting issues as well, but that's
where I personally caught it.

I tried to use the Groff Unicode composite syntax, so it becomes clear to Groff
that the accented character is a single glyph. For instance,
\[u0435_0301] should theoretically also format as an accented Cyrillic e.
But what happens instead is that the accent is dropped during formatting.
Curiously, this works when using latin characters. For instance, \[e u0301],
\[e aa], \[e '] will result in a properly accented latin e.

Why is that so? Did I catch a grotty bug here?
Do you know any workaround I could employ?

Best regards,
Robin