On Sun, Oct 08, 2023 at 08:45:11PM +0300, Eli Zaretskii wrote:
> > From: Gavin Smith <[email protected]>
> > Date: Sun, 8 Oct 2023 18:29:23 +0100
> > Cc: [email protected]
> >
> > On Sun, Oct 08, 2023 at 07:31:12PM +0300, Eli Zaretskii wrote:
> > > I see a very large diff, full of non-ASCII characters. A typical hunk
> > > is below:
> > >
> > > -(ì) @'{e} é (é) @'{@dotless{i}} í (í) @dotless{i} ı (ı) @dotless{j} ȷ
> > > -(ȷ) ‘@H{a}’ a̋ ‘@dotaccent{a}’ ȧ (ȧ) ‘@ringaccent{a}’ å (å)
> > > -‘@tieaccent{a}’ a͡ ‘@u{a}’ ă (ă) ‘@ubaraccent{a}’ a̲ ‘@udotaccent{a}’ ạ
> > > -(ạ) ‘@v{a}’ ǎ (ǎ) @,c ç (ç) ‘@,{c}’ ç (ç) ‘@ogonek{a}’ ą (ą)
> > > +(ì) @'{e} é (é) @'{@dotless{i}} í (í) @dotless{i} ı (ı) @dotless{j} ȷ
> > > (ȷ)
> > > +‘@H{a}’ a̋ ‘@dotaccent{a}’ ȧ (ȧ) ‘@ringaccent{a}’ å (å)
> > > ‘@tieaccent{a}’ a͡
> > > +‘@u{a}’ ă (ă) ‘@ubaraccent{a}’ a̲ ‘@udotaccent{a}’ ạ (ạ) ‘@v{a}’ ǎ (ǎ)
> > > +@,c ç (ç) ‘@,{c}’ ç (ç) ‘@ogonek{a}’ ą (ą)
> > >
> > > It looks like a filling problem to me, perhaps because something
> > > counts bytes instead of characters?
> >
> > It's almost certainly a problem with filling as you say. In the C (XS)
> > code, the return value of wcwidth is used for each character to get
> > the width of each line. The pure Perl code doesn't use the wcwidth
> > function as far as I know but keeps a count for each line based on
> > regex character classes. The relevant code is in
> > Texinfo/Convert/Unicode.pm, in the 'string_width' function.
>
> So perhaps the wcwidth function is the culprit. I'm guessing that it
> returns 1 for every printable character in my case.
Just comparing the first line in the hunk:
-(ì) @'{e} é (é) @'{@dotless{i}} í (í) @dotless{i} ı (ı) @dotless{j} ȷ
+(ì) @'{e} é (é) @'{@dotless{i}} í (í) @dotless{i} ı (ı) @dotless{j} ȷ (ȷ)
the line you are getting is longer than the reference results.
I wonder if for some of the non-ASCII characters wcwidth is returning 0 or
-1 leading the line to be longer.
It's also possible that other codepoints have inconsistent wcwidth results,
especially for combining accents.
Do you know if it is the gnulib implementation of wcwidth that is being
used or a MinGW one?