Re: Texinfo 7.1.0.90 pretest results [mingw]

2024-06-17 Thread Eli Zaretskii
> From: Bruno Haible 
> Date: Tue, 18 Jun 2024 00:24:12 +0200
> 
> For mingw, though, I'll stay with a native Windows perl. That's the point of
> a native Windows build.

Right.  And you cannot really build the XS extensions with MinGW
against a non-native Perl anyway.



Re: declaring function pointers with explicit prototypes for the info reader

2024-06-16 Thread Eli Zaretskii
> Date: Sun, 16 Jun 2024 16:29:10 +0200
> From: Patrice Dumas 
> 
> In standalone info reader code in info/ most function pointers are
> declared as a generic function pointer VFunction *, defined in info.h as
> 
> typedef void VFunction ();
> 
> I think that it would be much better to use actual prototypes depending
> on the functions to have type checking by the compiler.  I started doing
> that and did not find any evident issue with having explicit prototypes,
> but I may be missing something.
> 
> Would there be any reason not to have explicit prototypes?

If the code passes function pointers to other functions, or stores
function pointers in arrays, the prototypes of the functions will have
to match each other and/or the functions they are being passed to.  So
beware when two functions of different signatures are placed into the
same array or passed as an argument to the same function, because the
compiler will at least emit a warning if not an error.



Re: menu and sectioning consistency warning too strict?

2024-04-11 Thread Eli Zaretskii
> From: Gavin Smith 
> Date: Wed, 10 Apr 2024 23:53:31 +0100
> 
> I agree that the warning is not really necessary.   I don't mind
> either way.  It's up to you if you want to try to remove the warning.
> It's questionable whether lone @node without a following sectioning
> command is proper Texinfo. or what these constructs mean or how they should
> be output.

IMO, removing it would be a regression.  Either we should have a
separate setting for warning about missing sectioning commands, or the
warning should stay (or be replaced by a smarter one, like I wrote in
my previous message).

Most Texinfo manuals are intended to have a sectioning command in each
node, so this warning catches mistakes and is thus IMO valuable in a
vast majority of cases.



Re: menu and sectioning consistency warning too strict?

2024-04-11 Thread Eli Zaretskii
> Date: Wed, 10 Apr 2024 21:57:19 +0200
> From: Patrice Dumas 
> 
> With CHECK_NORMAL_MENU_STRUCTURE set to 1, there is a warning by
> texi2any:
> 
> a.texi:10: warning: node `node after chap1' is next for `chap1' in menu but 
> not in sectioning
> 
> for the following code:
> 
> @node Top
> @top top
> 
> @menu
> * chap1::
> * node after chap1::
> @end menu
> 
> @node chap1
> @chapter Chapter 1
> 
> @node node after chap1,, chap1, Top

AFAIU, the warning tells you that @chapter is missing in node after
chap1.

> I am not sure that this warning is warranted, this code seems ok to
> me, the lone node is not fully consistent with the sectioning structure,
> but not that inconsistent either.

I don't think I agree, since @chapter is missing in the second node.

> If there is another chapter after the lone node, there are two warnings,
> but this seems ok to me, as in that case, there is a clearer
> inconsistency, since with sectioning there is this time a different next:
> 
> b.texi:10: warning: node next pointer for `chap1' is `chap2' but next is 
> `node after chap1' in menu
> b.texi:15: warning: node prev pointer for `chap2' is `chap1' but prev is 
> `node after chap1' in menu

Which again tells you the same: @chapter is missing in node after
chap1.

> Should I try to remove the warning with a lone node at the end?

IMO, no, not unless you replace it with a smarter warning that
explicitly says @chapter is missing in the second node.



Re: organization of the documentation of customization variables

2024-03-27 Thread Eli Zaretskii
> Date: Tue, 26 Mar 2024 23:20:23 +0100
> From: Patrice Dumas 
> 
> > I took the list and tried to sort it into sections.  I may not have
> > done an especially good job of this, and there will likely be misplaced
> > variables.  I suggest this could be taken as a starting point for
> > reorganising the manual.
> 
> I started from that and did two nodes, as can be seen in the commit
> https://git.savannah.gnu.org/cgit/texinfo.git/commit/?id=c0a8822909514e947cefc7112986a2e704a023d0
> 
> Before I continue, is what I did the expected content?

>From where I stand, yes, it's a very good starting point, thanks.

> Here is what I propose:
> 
> * move HTML customization variables explanations to the 'Generating HTML'
>   chapter, either in an already existing section where they would be
>   inserted naturally (for example in the 'HTML CSS' section for
>   customization variables related to CSS) or to new sections or
>   subsections.  For exampl, I think that the new 'HTML Output Structure
>   Customization' node could be before 'Generating EPUB' or together with
>   'HTML Splitting', while 'File Names and Links Customization for HTML'
>   could be after 'HTML Cross-references' probably with other
>   customization variables nodes.
> * Move the 'HTML Customization Variables List' node to an appendix.

SGTM.  I think making the customization stuff separate subsections
would be better, as many users will not need that in the mainline
reading of the manual.  But that's a weak preference.



Re: Build from git broken - missing gperf?

2024-02-05 Thread Eli Zaretskii
> From: Gavin Smith 
> Date: Mon, 5 Feb 2024 19:35:59 +
> 
> I found it was being rebuilt by "make" because a dependency was updated:
> 
> $ ls -l gnulib/lib/iconv_open-aix.gperf 
> -rw-rw-r-- 1 g g 1.8k Jan 31 18:24 gnulib/lib/iconv_open-aix.gperf
> 
> which came from a gnulib update to another module (all that happened
> was that copyright years changed from 2023 to 2024).
> 
> Gnulib documents that "gperf" is a required tool for using the "iconv_open"
> module.  It's not especially easy, I find, to find why a particular gnulib
> module was brought in, but looking at the files under the "modules" directory
> of a gnulib checkout, I found the chain of dependencies
> 
> uniconv/u8-strconv-from-enc -> uniconv/u8-conv-from-enc -> striconveha
>   -> striconveh -> iconv_open
> 
> (Of course, there could other dependency chains that also brought this module
> in.)
> 
> Short of extirpating this dependency, the only solution appears to
> be to require anyone building from git to have gperf installed, which
> doesn't seem like a good situation, as it was never required before.
> 
> I don't know if uniconv/u8-conv-from-enc is a necessary module.  It's
> not easy to find out how the module is used as the documentation is
> lacking, but it appears to match libunistring.  The documentation is
> here:
> https://www.gnu.org/software/libunistring/manual/html_node/uniconv_002eh.html
> 
> I found uses of "u8_strconv_from_encoding" throughout the XS code,
> although most of the uses (I didn't check them all) have "UTF-8" as one
> of the arguments, making it appear that we are converting from UTF-8
> to UTF-8.

Should we ask the Gnulib folks to help us out?



Re: index sorting in texi2any in C issue with spaces

2024-02-04 Thread Eli Zaretskii
> From: Gavin Smith 
> Date: Sun, 4 Feb 2024 15:58:28 +
> Cc: pertu...@free.fr, bug-texinfo@gnu.org
> 
> On Fri, Feb 02, 2024 at 08:57:01AM +0200, Eli Zaretskii wrote:
> > > An alternative is not to have such a variable but just to have an option
> > > to collate according to the user's locale.  Then the user would run e.g.
> > > "LC_COLLATE=ll_LL.UTF-8 texi2any ..." to use collation from the 
> > > ll_LL.UTF-8
> > > locale.  They would have to have the locale installed that was appropriate
> > > for whichever manual they were processing (assuming the "variable 
> > > weighting"
> > > option is appropriate.)
> > 
> > What would be the default then, though?  AFAIR, we decided by default
> > to use en_US.utf-8 for collation, with the purpose of making the
> > sorting locale-independent by default, so that Info manuals produced
> > with the default settings are identical regardless of the user's
> > locale.
> 
> I agree that sorting should be locale-independent by default.

That's definitely ideal.



Re: index sorting in texi2any in C issue with spaces

2024-02-04 Thread Eli Zaretskii
> Date: Sun, 4 Feb 2024 11:42:52 +0100
> From: pertu...@free.fr
> Cc: Gavin Smith , bug-texinfo@gnu.org
> 
> On Fri, Feb 02, 2024 at 08:57:01AM +0200, Eli Zaretskii wrote:
> > I think en_US.utf-8 is (or at least can be by default) a combination
> > of @documentlanguage and @documentencoding.
> 
> I try to make the index collation as independent as possible of
> @documentencoding and output encoding.  Here the utf-8 is meant to
> provide a sorting 'independent' of the encoding.

Why is that a good idea?  Presumably, a manual whose language is
provided by @documentlanguage is indeed written in that language, and
so the collation should be according to that language?  Or what am I
missing?

If we want collation which uses only codepoints, disregarding any
collation weights defined by the Unicode TR10, we could use
en_US.utf-8, but then, as Gavin says, using glibc collation function
you get more than you asked, because weights are not ignored.  So we
need to use something else in the C variant of collation code, AFAIU.

> Regarding the language for now the aim was to have something as
> similar as the Perl output, which is obtained without a locale.  The
> choice of en_US was motivated by that aim.  I looked at the
> /usr/lib/locale/*/LC_COLLATE files on my debian GNU/Linux and there was
> no "en.utf-8", which would have been my first choice, so I used
> "en_US.utf-8".

I don't know enough about what Perl does in the module you are using.
"Obtained without a locale" means what exactly? a collation order that
only considers the Unicode codepoints of the characters?  Or does it
mean something else?  If it only considers the codepoints, then
collation in C using glibc functions will NOT produce the same order
even under en_US.utf-8, AFAIU.



Re: index sorting in texi2any in C issue with spaces

2024-02-01 Thread Eli Zaretskii
> From: Gavin Smith 
> Date: Thu, 1 Feb 2024 22:16:07 +
> Cc: Patrice Dumas , bug-texinfo@gnu.org
> 
> On Thu, Feb 01, 2024 at 09:01:42AM +0200, Eli Zaretskii wrote:
> > > Date: Wed, 31 Jan 2024 23:11:02 +0100
> > > From: Patrice Dumas 
> > > 
> > > > Moreover, en_US.utf-8 will use collation appropriate for (US) English.
> > > > There may be language-specific "tailoring" for other languages (e.g.
> > > > Swedish) that the user may wish to use instead.  Hence, it may be
> > > > a good idea to allow use of a user-specified locale for collation 
> > > > through
> > > > the C code.
> > > 
> > > That would not be difficult to implement as a customization variable.
> > > What about COLLATION_LANGUAGE?
> > 
> > What would be the possible values of this variable, and in what format
> > will those values be specified?
> 
> I imagine it would be a locale name for passing to newlocale and thence
> to strxfrm_l.  What Patrice implemented hardcord the name "en_US.utf-8"
> but this would be a possible value.

I think en_US.utf-8 is (or at least can be by default) a combination
of @documentlanguage and @documentencoding.

> (If there are locale names on MS-Windows that are different, it would
> be fine to support them the same way, only the invocation of texi2any
> would vary to use a different locale name.)

Yes, we will need to come up with something like that.  (And yes, the
names of locales on Windows are different, and can also take several
different formats.  For example, the equivalent of en_US can be either
"English_United States" or "en-US" [with a dash, not underscore], and
there's also a numerical locale ID -- e.g. 0x409 for en_US.)

> An alternative is not to have such a variable but just to have an option
> to collate according to the user's locale.  Then the user would run e.g.
> "LC_COLLATE=ll_LL.UTF-8 texi2any ..." to use collation from the ll_LL.UTF-8
> locale.  They would have to have the locale installed that was appropriate
> for whichever manual they were processing (assuming the "variable weighting"
> option is appropriate.)

What would be the default then, though?  AFAIR, we decided by default
to use en_US.utf-8 for collation, with the purpose of making the
sorting locale-independent by default, so that Info manuals produced
with the default settings are identical regardless of the user's
locale.

> It is probably not justified to provide an interface to the flags of
> CompareStringW on MS-Windows if we can't provide the same functionality
> with strcoll/strxfrm/strxfrm_l.

Agreed.  I mentioned that only for completeness, and as an
illustration of the fact that the APIs for controlling this stuff are
extremely platform-dependent, although the underlying ideas and
algorithms are the same.

> It seems not very important to provide more of these collation options
> for indices as it is not something users are complaining about.

Right.



Re: index sorting in texi2any in C issue with spaces

2024-01-31 Thread Eli Zaretskii
> Date: Wed, 31 Jan 2024 23:11:02 +0100
> From: Patrice Dumas 
> 
> > Moreover, en_US.utf-8 will use collation appropriate for (US) English.
> > There may be language-specific "tailoring" for other languages (e.g.
> > Swedish) that the user may wish to use instead.  Hence, it may be
> > a good idea to allow use of a user-specified locale for collation through
> > the C code.
> 
> That would not be difficult to implement as a customization variable.
> What about COLLATION_LANGUAGE?

What would be the possible values of this variable, and in what format
will those values be specified?



Re: index sorting in texi2any in C issue with spaces

2024-01-31 Thread Eli Zaretskii
> From: Gavin Smith 
> Date: Wed, 31 Jan 2024 20:10:56 +
> 
> It seems like a pretty obscure interface.  It is barely
> documented - newlocale is in the Linux Man Pages but not the
> glibc manual, and strxfrm_l was only in the Posix standard
> (https://pubs.opengroup.org/onlinepubs/9699919799/functions/strxfrm.html).
> I don't know of any other way of accessing the collation functionality.
> 
> Do you know how portable it is?

AFAIK, this is glibc-specific.

In general, the implementations of Unicode TR10 differ among
platforms, with glibc offering the most complete and compatible
implementation and the CLDR DB to support it (what you discovered in
/usr/share/i18n/locales on your system).  MS-Windows has a similar,
but different in effect, functionality, see

  
https://learn.microsoft.com/en-us/windows/win32/api/stringapiset/nf-stringapiset-comparestringw

It supports various flags, described here:

  
https://learn.microsoft.com/en-us/windows/win32/api/stringapiset/nf-stringapiset-comparestringex

that affect the handling of collation weights.  For example, the
NORM_IGNORESYMBOLS flag will have an effect similar to what Patrice
found: spaces (and other punctuation characters) are ignored when
sorting.

CompareStringW accepts "wide" strings, i.e. a string should be
converted to UTF-16 encoding before calling it.  There's a similar
CompareStringA, which accepts 'char *' strings, but it can only
compare strings whose characters are all representable in the current
system locale's codeset; if we want to have all text represented
internally in UTF-8, we should probably convert UTF-8 to UTF-16 and
use CompareStringW.

I don't know about *BSD and other platforms, but wouldn't be surprised
if they offered something of their own, still different from glibc
and/or strict TR10/CLDR compliance.

> Moreover, en_US.utf-8 will use collation appropriate for (US) English.
> There may be language-specific "tailoring" for other languages (e.g.
> Swedish) that the user may wish to use instead.  Hence, it may be
> a good idea to allow use of a user-specified locale for collation through
> the C code.

Probably.  Note that CompareStringW gives the caller a finer control:
they can tailor the handling of different weight categories, beyond
setting the locale for which the collation is needed.  Also, the
locale argument is defined differently for CompareStringW than via the
Posix-style setlocale or similar APIs (but that's something for the
implementation to figure out).

> I found some locale definition files on my system under
> /usr/share/i18n/locales (location mention in man page of the "locale"
> command) and there is a file iso14651_t1_common which appears to be
> based on the Unicode Collation tables.  I have only skimmed this file
> and don't understand the file format well (it's supposed to be documented
> in the output of "man 5 locale"), but is really part of glibc internals.
> 
> In that file, space has a line
> 
>  IGNORE;IGNORE;IGNORE; % SPACE
> 
> which appears to define space as a fourth-level collation element,
> corresponding to the Shifted option at the link above:
> 
>   "Shifted: Variable collation elements are reset to zero at levels one
>   through three. In addition, a new fourth-level weight is appended..."
> 
> In the Default Unicode Collation Element Table (DUCET), space has the line
> 
> 0020  ; [*0209.0020.0002] # SPACE
> 
> with the "*" character denoting it as a "variable" collation element.
> 
> I expect it would require creating a glibc locale to change the collation
> order, which is not something we can do.

I think if we want to ponder these aspects we should talk to the glibc
developers about the available options.



Re: makeinfo 7.1 misses menu errors

2024-01-19 Thread Eli Zaretskii
> Date: Fri, 19 Jan 2024 16:30:33 -0700
> From: Karl Berry 
> 
> Hi Gavin,
> 
> The problem as I remember it was that the error messages are awful:
> 
> No argument, but having any message at all is infinitely better than
> silence. I urge you to restore them by default, suboptimal as they are.
> 
> It's true that those msgs as such have never made a great deal of sense
> to me (including in the old C makeinfo). But they indicate perfectly
> well "there is a problem with the sectioning+menus related to node XYZ".
> It was not hard to figure it out once I knew that. I had no clue there
> was a problem until someone using makeinfo 6.x told me.

I agree.  Perhaps by default makeinfo should just display a general
warning about "some problem with sectioning vs menus", with a pointer
to the offending @menu command, and the warning text should advise to
use "-c CHECK_NORMAL_MENU_STRUCTURE=1" to get the details.  WDYT?



Re: makeinfo does not produce first output file when multiple files passed

2024-01-18 Thread Eli Zaretskii
> From: No Wayman 
> Date: Thu, 18 Jan 2024 10:52:17 -0500
> 
> 
> makeinfo --version: texi2any (GNU texinfo) 7.1
> Run on Arch Linux
> 
> Reproduction steps:
> 
> 1. Clone the "emacs-eat" repository:
> 
> $ cd /tmp/
> $ git clone https://codeberg.org/akib/emacs-eat.git
> 
> 2. Within the repository, attempt to build 3 info files from the 3 
> texi files in the repository:
> 
> $ cd ./emacs-eat
> $ makeinfo fdl.texi -o fdl.info gpl.texi -o gpl.info eat.texi -o 
> eat.info
> 
> This results in the last two info files being created (e.g. 
> gpl.info and eat.info in this example).
> The first file is not created regardless of the order they are 
> passed.
> Passing a "null" first argument results in all three files being 
> generated:
> 
> $ makeinfo /dev/null fdl.texi -o fdl.info gpl.texi -o gpl.info 
> eat.texi -o eat.info
> 
> Is this a misunderstanding on my part, or should all three files 
> be generated with the first makeinfo command in the reproduction 
> case?

I'm not sure what exactly is going on, but you _are_ making a mistake:
the files fdl.texi and gpl.texi are supposed to be @include'd by other
Texinfo files, not processed separately.  There's a comment at the
beginning of each one of them saying that.  Why are you processing
them as separate Texinfo documents?



Re: makeinfo 7.1 misses menu errors

2024-01-17 Thread Eli Zaretskii
> Date: Wed, 17 Jan 2024 14:55:33 -0700
> From: Karl Berry 
> 
> I recently learned that some @menu vs. sectioning discrepancies in the
> automake manual were found with makeinfo 6.7, but not 7.1. 
> 
> In essence, I moved a subsection (Errors with distclean) from one
> section to another, but forgot to remove the menu entry from the old one.
> (Surely not an uncommon error.)
> 
> Running the attached on 7.1, there are no errors or warnings.
> 6.7 correctly reports various problems resulting from this:

I believe this is an intentional feature in recent Texinfo versions.
To get the warnings back, you need to run makeinfo with the
command-line option "-c CHECK_NORMAL_MENU_STRUCTURE=1".



Re: "make distclean" does not bring back build tree to previous state

2023-12-13 Thread Eli Zaretskii
> From: Gavin Smith 
> Date: Tue, 12 Dec 2023 20:21:28 +
> Cc: bug-texinfo@gnu.org
> 
> On Sun, Dec 10, 2023 at 04:00:56PM +0100, Preuße, Hilmar wrote:
> > Hello,
> > 
> > I got a report telling that "make distclean" does not bring back the build
> > tree into the original state. After running a build (configure line below)
> > and calling "make distclean", we have a few differences. Some files were
> > deleted, some files were re-generated and hence show a different time stamp
> > inside the file:
> 
> We'll probably have to investigate each of these changes separately to
> see where they occur.  I am probably not going to fix them all in one
> go.
> 
> First, could you confirm which version of Texinfo you got these results
> with.
> 
> I tested with Texinfo 7.1, ran configure with the same configure line
> as you, then ran "make distclean".

Let me point out that "make distclean" is NOT supposed to revert the
tree to a clean state as far as Git is concerned.  "make distclean" is
supposed to remove any files built or modified as part of building a
release tarball, and release tarballs can legitimately include files
that are not versioned, and therefore are not managed by Git.  So
looking at the results of "git status" is not the correct way of
finding files that "make distclean" is supposed to remove.  Instead,
one should:

  . create a release tarball
  . unpack and build the release tarball in a separate directory
  . run "make distclean" in the directory where the tarball was built
  . unpack the release tarball in another directory, and then compare
that other directory with the one where you run "make distclean"

If a Makefile target is required that should remove all non-versioned
files, that should be a separate target, likely "maintainer-clean" or
somesuch.



Re: Texinfo.tex, problem with too-long table inside @float

2023-12-03 Thread Eli Zaretskii
> From: Gavin Smith 
> Date: Sun, 3 Dec 2023 13:57:08 +
> Cc: bug-texinfo@gnu.org
> 
> The solution that occurs to me is to recognise a third argument for
> @float.  @float was introduced in Texinfo 4.7, in 2004.  From NEWS:
> 
>   . new commands @float, @caption, @shortcaption, @listoffloats for
> initial implementation of floating material (figures, tables, etc).
> Ironically, they do not yet actually float anywhere.
> 
> The comments in texinfo.tex state:
> 
> % @float FLOATTYPE,LABEL,LOC ... @end float for displayed figures, tables,
> % etc.  We don't actually implement floating yet, we always include the
> % float "here".  But it seemed the best name for the future.
> 
> [...]
> 
> % #1 is the optional FLOATTYPE, the text label for this float, typically
> % "Figure", "Table", "Example", etc.  Can't contain commas.  If omitted,
> % this float will not be numbered and cannot be referred to.
> %
> % #2 is the optional xref label.  Also must be present for the float to
> % be referable.
> %
> % #3 is the optional positioning argument; for now, it is ignored.  It
> % will somehow specify the positions allowed to float to (here, top, bottom).
> %
> 
> (I can't find any discussions from the time about the new feature
> in the mailing list archives.) 

Maybe Karl (CC'ed) could comment on this?



Re: CC and CFLAGS are ignored by part of the build

2023-11-14 Thread Eli Zaretskii
> From: Bruno Haible 
> Date: Tue, 14 Nov 2023 04:23:58 +0100
> 
> Apparently some optimization options were still in effect. And indeed,
> the file tp/Texinfo/XS/config.status contains these lines:
> 
> CC='sparc64-linux-gnu-gcc'
> compiler='sparc64-linux-gnu-gcc'
> LTCC='sparc64-linux-gnu-gcc'
> compiler='sparc64-linux-gnu-gcc'
> S["CPP"]="sparc64-linux-gnu-gcc -E"
> S["ac_ct_CC"]="sparc64-linux-gnu-gcc"
> S["CC"]="sparc64-linux-gnu-gcc"
> S["PERL_CONF_cc"]="sparc64-linux-gnu-gcc"
> S["PERL_CONF_optimize"]="-O2 -g"
> 
> Per the GNU Coding Standards [1], when I specify CC and CFLAGS, it should
> override the package's defaults.
> 
> I understand that perl comes with its own installation and that building
> code that can be dynamically loaded by perl can be challenging. But the
> CC and CFLAGS values that I have specified are ABI-compatible with
> the ones that perl wants. Therefore I expect them to be obeyed.

AFAIU, that's impossible in general, because CFLAGS could include
flags that cannot be applied to both CC and PERL_CONF_cc due to
compatibility issues, since Perl could have been built using a very
different compiler.

IMNSHO, it isn't a catastrophe that compiling Perl extensions needs a
separate C flags variable.  It is basically similar to CFLAGS and
CXXFLAGS being separate for building the same project (which happens
inj practice, for example, in GDB, which is part C and part C++).  And
if the GCS doesn't cater for these (relatively rare and specialized)
situations, then I think the GCS needs to be amended.  There's no need
to be dogmatic about this.



Re: c32width gives incorrect return values in C locale

2023-11-11 Thread Eli Zaretskii
> From: Bruno Haible 
> Cc: bug-libunistr...@gnu.org
> Date: Sat, 11 Nov 2023 23:54:52 +0100
> 
> [CCing bug-libunistring]
> Gavin Smith wrote:
> > I did not understand why uc_width was said to be "locale dependent":
> > 
> >   "These functions are locale dependent."
> > 
> > - from 
> > .
> 
> That's because some Unicode characters have "ambiguous width" — width 1 in
> Western locales, width 2 is East Asian locales (for historic and font choice
> reasons).

I think this should be explained in the documentation, if it isn't
already.  This "ambiguous width" issue is very subtle and unknown to
many (most?) people, so not having it explicit in the documentation is
not user-friendly, IMO.

> > I also don't understand the purpose of the "encoding" argument -- can this
> > always be "UTF-8"?
> 
> Yes, it can be always "UTF-8"; then uc_width will always choose width 1 for
> these characters.

Regardless of the locale?  Is there an assumption that UTF-8 means
"not CJK" or something?

> > I'm also unclear on the exact relationship between the types char32_t,
> > ucs4_t and uint32_t.  For example, uc_width takes a ucs4_t argument
> > but u8_mbtouc writes to a char32_t variable.  In the code I committed,
> > I used a cast to ucs4_t when calling uc_width.
> 
> These types are all identical. Therefore you don't even need to cast.
> 
>   - char32_t comes from  (ISO C 11 or newer).
>   - ucs4_t comes from GNU libunistring.
>   - uint32_t comes from .

AFAIU, char32_t is identical to uint_least32_t (which is also from
stdint.h).



Re: Locale-independent paragraph formatting

2023-11-10 Thread Eli Zaretskii
> From: Gavin Smith 
> Date: Fri, 10 Nov 2023 19:48:04 +
> Cc: Bruno Haible , bug-texinfo@gnu.org
> 
> On Fri, Nov 10, 2023 at 08:47:10AM +0200, Eli Zaretskii wrote:
> > > Does anybody know if we could just write 'a' instead of U'a' and rely
> > > on it being converted?
> > > 
> > > E.g. if you do
> > > 
> > > char32_t c = 'a';
> > > 
> > > then afterwards, c should be equal to 97 (ASCII value of 'a').
> > 
> > Why not?  What could be the problems with using this?
> 
> I think what was confusing me was the statement that char32_t held a UTF-32
> encoded Unicode character.  I then thought it would have a certain byte
> order, so if the UTF-32 was big endian, the bytes would have the order
> 00 00 00 61, whereas the value 97 on a little endian machine would have
> the order 61 00 00 00.  However, it seems that UTF-32 just means the
> codepoint is encoded as a 32-bit integer, and the endianness of the
> UTF-32 sequence can be assumed to match the endianness of the machine.
> The standard C integer conversions can be assumed to work when assigning
> to/from char32_t because it is just an integer type, I assume.

AFAIU, since a codepoint in UTF-32 is just one UTF-32 unit, the issue
of endianness doesn't apply.  Endianness in UTF encodings applies only
if a codepoint takes more than one unit, since the endianness is
between units, not within units themselves (where it always follows
the machine).



Re: Locale-independent paragraph formatting

2023-11-09 Thread Eli Zaretskii
> From: Gavin Smith 
> Date: Thu, 9 Nov 2023 21:26:11 +
> 
> I have just pushed a commit (e3a28cc9bf) to use gnulib/libunistring
> functions instead of the locale-dependent functions mbrtowc and wcwidth.
> This allows for a significant simplification as we do not have to try
> to switch to a UTF-8 encoded locale.
> 
> I was not sure about how to put a char32_t literal in the source code.
> For example, where we previously had L'a' as a literal wchar_t letter 'a',
> I changed this to U'a'.  I could not find very much information about this
> online or whether this would be widely supported by C compilers.  The U prefix
> for char32_t is mentioned in a copy of the C11 standard I found online and
> also in a C23 draft.

I have MinGW GCC 9.2 here, and it supports U'a'.  But I don't think we
need it, we could just use 'a' instead, see below.

OTOH, the char32_t type is not supported by this GCC, not even if I
use -std=gnu2x.  Maybe we should use uint_least32_t instead?

> Does anybody know if we could just write 'a' instead of U'a' and rely
> on it being converted?
> 
> E.g. if you do
> 
> char32_t c = 'a';
> 
> then afterwards, c should be equal to 97 (ASCII value of 'a').

Why not?  What could be the problems with using this?



Re: Code from installed libtexinfo.so.0 run for non-installed texi2any

2023-11-06 Thread Eli Zaretskii
> Date: Mon, 6 Nov 2023 14:25:20 +0100
> From: pertu...@free.fr
> Cc: gavinsmith0...@gmail.com, bug-texinfo@gnu.org
> 
> > Do these two replace the several *XS shared libraries we had until
> > Texinfo 7.1, or are they in addition to them?
> 
> There are new *XS shared libraries in additions to those in 7.1,
> StructuringTransfo.la, TranslationsXS.la (which will most likely be
> merged in another one), ConvertXS.la.  The two libraries libtexinfoxs
> and libtexinfo contain the code common for those new *XS shared
> libraries and also code common with Parsetexi.la, which is an XS shared
> library existing in 7.1.
> 
> > In any case, it sounds like these libraries should be installed where
> > we were installing the *XS shared libraries till now.
> 
> It is pkglibdir.  Would be easy to change Makefile.am to put them there,
> but are we sure that the linker will find them when the dlopened *XS
> files are loaded by perl?

I don't know enough about search for shared libraries on Posix
systems, but at least on Windows the linker looks first in the
directory from which the calling shared library was loaded, so it
should work.  Loading by an absolute file name should also work, I
think, and is probably more reliable.



Re: Code from installed libtexinfo.so.0 run for non-installed texi2any

2023-11-06 Thread Eli Zaretskii
> Date: Mon, 6 Nov 2023 09:20:37 +0100
> From: pertu...@free.fr
> Cc: Gavin Smith , bug-texinfo@gnu.org
> 
> On Sun, Nov 05, 2023 at 09:59:44PM +0200, Eli Zaretskii wrote:
> > 
> > I don't have any libtexinfo shared library here, and I don't see one
> > being built, let alone installed, as part of Texinfo.  is this
> > something new in the development sources?  If so, what code is linked
> > into libtexinfo?
> 
> Yes, it is new.  In Texinfo we use a lot XS objects, which are C code
> with a specific interface that allows them to be loaded (dlopen'ed) by
> perl to replace pure perl functions by C functions.  This allows to use
> perl as a high level language, and C for speed.
> 
> libtexinfo corresponds to the 'pure' C common code that performs the
> computations needed for texi2any, working on C data only (no direct use
> of any perl data).  It is used by many XS objects, it is an internal
> library to be used, for now, only by those XS objects.
> 
> There is another new library, libtexinfoxs, for the 'perl C' common code
> used by those XS objects, that does the interface between C data and
> perl data.  This code is even more tied to the XS objects.  The two
> libraries are separate to clearly separate the code that does the
> computations (libtexinfo), that is not related to perl at all and the
> code used to interface C data and perl (libtexinfoxs).

Do these two replace the several *XS shared libraries we had until
Texinfo 7.1, or are they in addition to them?

In any case, it sounds like these libraries should be installed where
we were installing the *XS shared libraries till now.



Re: Code from installed libtexinfo.so.0 run for non-installed texi2any

2023-11-05 Thread Eli Zaretskii
> From: Gavin Smith 
> Date: Sun, 5 Nov 2023 18:12:45 +
> Cc: pertu...@free.fr, bug-texinfo@gnu.org
> 
> So you know what a dynamically loaded library is; this contains a collection
> of functions and potentially data structures that can be loaded by running
> code and run as part of a computer program.
> 
> Usually, when such a library is installed on a system, this is for use
> generally by any program.  For example, if there is a library file
> libz.so.1, this could be linked by passing the -lz flag to the C compiler
> when building the program.  The program would be able to call functions
> in the library and so on.
> 
> The program using this library would likely be written by a different
> person, and as part of a different project, to the persons and projects
> responsible for the creation of the library.  There is an assumption that
> the library has a stable interface, and the library and programs using
> the library are worked on completely independently.
> 
> The dynamically loaded libraries used by texi2any (XS modules) are
> completely different.  Technically, they are loaded in the same way,
> by the running Perl interpreter.  But they are an integral part of the
> texi2any program.  They are intended for the use of the texi2any program
> only, not any other.

The XS modules are installed in a directory which is usually not
looked into by the dynamic linker.  Is that what you are talking
about?  If so, we have been using "non-public libraries" since long
ago, no?  Or what am I missing?

> The file was being installed under /usr/local/lib/libtexinfo.so.1, as
> if to imply that a user could link it against their programs with -ltexinfo,
> or load it with dlopen, which would be completely inappropriate.

I don't have any libtexinfo shared library here, and I don't see one
being built, let alone installed, as part of Texinfo.  is this
something new in the development sources?  If so, what code is linked
into libtexinfo?



Re: Code from installed libtexinfo.so.0 run for non-installed texi2any

2023-11-05 Thread Eli Zaretskii
> From: Gavin Smith 
> Date: Sun, 5 Nov 2023 17:04:47 +
> 
> > Maybe one day libtexinfo could be a public library, but not for now
> > and libtexinfoxs should probably never ever be a public library.
> 
> I agree neither of them should be a public library now.

Can someone please explain what does "not being a public library"
mean, when we talk about shared libraries?  I don't think I'm familiar
with this notion.



Re: Texinfo 7.1 released

2023-10-25 Thread Eli Zaretskii
> From: Gavin Smith 
> Date: Mon, 23 Oct 2023 19:52:49 +0100
> Cc: bug-texinfo@gnu.org
> 
> I propose the following, more finished patch, which applies
> to Texinfo 7.1.  We can also do something similar for the master branch.

Unfortunately, this change doesn't work on MS-Windows:

  libtool: compile:  d:/usr/bin/gcc.exe -DHAVE_CONFIG_H -I. -I. -I./gnulib/lib 
-I./gnulib/lib -DDATADIR=\"d:/usr/share\" -Id:/usr/include -s -O2 -DWIN32 
-DPERL_TEXTMODE_SCRIPTS -DUSE_SITECUSTOMIZE -DPERL_IMPLICIT_CONTEXT 
-DPERL_IMPLICIT_SYS -DUSE_PERLIO -fwrapv -fno-strict-aliasing -mms-bitfields -s 
-O2 -DVERSION=\"0\" -DXS_VERSION=\"0\" -ID:/usr/Perl/lib/CORE -MT xspara.lo -MD 
-MP -MF .deps/xspara.Tpo -c xspara.c  -DDLL_EXPORT -DPIC -o .libs/xspara.o
  xspara.c: In function 'xspara__add_next':
  xspara.c:757:39: warning: passing argument 1 of 'get_utf8_codepoint' from 
incompatible pointer type [-Wincompatible-pointer-types]
757 |   get_utf8_codepoint (_letter, p, len);
|   ^~
|   |
|   rpl_wint_t * {aka unsigned int 
*}
  xspara.c:689:30: note: expected 'wchar_t *' {aka 'short unsigned int *'} but 
argument is of type 'rpl_wint_t *' {aka 'unsigned int *'}
689 | get_utf8_codepoint (wchar_t *pwc, const char *mbs, size_t n)
| ~^~~

The warning is real: wchar_t is a 16-bit data type on MS-Windows,
whereas the code assumes it's of the same width as wint_t.

I changed the offending code to say this instead:

  if (!strchr (end_sentence_characters
   after_punctuation_characters, *p))
{
  wchar_t wc;
  get_utf8_codepoint (, p, len);
  state.last_letter = wc;
}

and then it compiled cleanly.



Re: Texinfo 7.1 released

2023-10-23 Thread Eli Zaretskii
> From: Gavin Smith 
> Date: Sun, 22 Oct 2023 21:01:54 +0100
> Cc: bug-texinfo@gnu.org
> 
> On Sun, Oct 22, 2023 at 10:05:16PM +0300, Eli Zaretskii wrote:
> > > This patch, applied to 7.1, removes the recently added dTHX calls,
> > > but also removes the fprintf calls that were preventing compilation
> > > without it:
> > 
> > It doesn't help: 1:20.7 instead of 1:21.2.
> 
> I'm running out of ideas.  Have you tried timing it with a smaller input
> file (e.g. doc/info-stnd.texi)?  That could detect whether the slowdown
> depends on the size of the input, or if it is a single slowdown to do
> with initialisation/shutdown.

The times seem to be roughly proportional to the size of the generated
Info file, yes.

> Another change is that xspara.c uses btowc now.  I hardly see how it makes
> a difference, but here is something to try:
> 
> diff xspara.c{.old,} -u
> --- xspara.c.old2023-10-22 20:59:03.801498451 +0100
> +++ xspara.c2023-10-22 20:59:29.189031067 +0100
> @@ -730,7 +730,7 @@
>if (!strchr (end_sentence_characters
> after_punctuation_characters, *p))
>  {
> -  if (!PRINTABLE_ASCII(*p))
> +  if (1 || !PRINTABLE_ASCII(*p))
>  {
>wchar_t wc = L'\0';
>mbrtowc (, p, len, NULL);
> @@ -1013,7 +1013,7 @@
>  }
>  
>/** Not a white space character. */
> -  if (!PRINTABLE_ASCII(*p))
> +  if (1 || !PRINTABLE_ASCII(*p))
>  {
>char_len = mbrtowc (, p, len, NULL);
>  }
> 
> This means that all calls go via the MinGW-specific mbrtowc implementation
> in xspara.c.

Bingo.  This brings the time for producing the ELisp manual down to
15.4 sec, 5 sec faster than v7.0.3.

I see that btowc linked into the XSParagraph module is a MinGW
specific implementation, not from the Windows-standard MSVCRT (where
it is absent).  My conclusion is that the MinGW btowc is extremely
inefficient.



Re: Texinfo 7.1 released

2023-10-22 Thread Eli Zaretskii
> From: Gavin Smith 
> Date: Sun, 22 Oct 2023 19:35:11 +0100
> Cc: bug-texinfo@gnu.org
> 
> One thing to try would to eliminate dTHX calls.  If these are
> time-consuming on MinGW/MS-Windows, then extra calls will greatly slow
> down the program, due to the number of times the paragraph formatting
> functions are called.
> 
> This patch, applied to 7.1, removes the recently added dTHX calls,
> but also removes the fprintf calls that were preventing compilation
> without it:

It doesn't help: 1:20.7 instead of 1:21.2.

> I have looked for other differences in xspara.c between Texinfo 7.0.3
> and Texinfo 7.1 and cannot really see anything suspicious.

XSParagraph inscludes other source files in addition to xspara.c --
could the changes in those other files be the cause?

> The only other thing that comes to mind that there could have been a
> change in imported gnulib modules.
> 
> Failing that, the only I idea I have is to use some kind of source-level
> profiler to find out why so much time is spent in this module.

Hmm...



Re: Texinfo 7.1 released

2023-10-22 Thread Eli Zaretskii
> From: Gavin Smith 
> Date: Sun, 22 Oct 2023 18:41:34 +0100
> Cc: bug-texinfo@gnu.org
> 
> > > Surprise: running with TEXINFO_XS=omit _reduces_ the elapsed time of
> > > producing the Emacs ELisp manual from 1:21.16 to 0:36.97.
> > 
> > Another data point: running with TEXINFO_XS_PARSER=0 takes 1:34.4 min,
> > so it sounds like the slowdown is due to some XS module other than the
> > parser module.  Is there a way to disable additional modules one by
> > one?
> 
> This is most surprising, but promising that we're getting close to the
> problem.
> 
> The simplest way to disable XS modules would be to delete or rename
> the libtool files that are used for loading them.  If you run with
> TEXINFO_XS=debug, you can see which modules are loaded.  With Texinfo 7.1,
> lines like the following would be printed:
> 
> found ../tp/../tp/Texinfo/XS/Parsetexi.la
> found ../tp/../tp/Texinfo/XS/MiscXS.la
> found ../tp/../tp/Texinfo/XS/XSParagraph.la
> 
> You could then disable modules with e.g.
> 
> mv ../tp/../tp/Texinfo/XS/XSParagraph.la{,.disable}
> 
> or
> 
> mv ../tp/../tp/Texinfo/XS/MiscXS.la{,.disable}

Thanks.  Looks like the slowdown is in XSParagraph: without it, I get
21.8 sec, only slightly slower than Texinfo 7.0.3.  Disabling MiscXS
as well yields almost the same time (0.05 sec longer) as with MiscXS,
and disabling Parsetexi gets us back to 37 sec, the same as with
TEXINFO_XS=omit.

Beyond the fact that XSParagraph seems to be th culprit, I wonder why
MiscXS doesn't speed up the processing.  Is this expected?

Anyway, what's next?



Re: Texinfo 7.1 released

2023-10-22 Thread Eli Zaretskii
> Date: Sun, 22 Oct 2023 17:30:15 +0300
> From: Eli Zaretskii 
> Cc: bug-texinfo@gnu.org
> 
> > From: Gavin Smith 
> > Date: Sun, 22 Oct 2023 14:23:53 +0100
> > Cc: bug-texinfo@gnu.org
> > 
> > > > First, check that the Perl extension modules are actually being used.  
> > > > Try
> > > > setting the TEXINFO_XS environment variable to "require" or "debug".
> > > 
> > > I don't need to do that, I already verified that extensions are used
> > > when I worked on the pretests (which, as you might remember, caused
> > > Perl to crash at first).
> > 
> > I'd expected so, just wanted to make sure.
> 
> Surprise: running with TEXINFO_XS=omit _reduces_ the elapsed time of
> producing the Emacs ELisp manual from 1:21.16 to 0:36.97.

Another data point: running with TEXINFO_XS_PARSER=0 takes 1:34.4 min,
so it sounds like the slowdown is due to some XS module other than the
parser module.  Is there a way to disable additional modules one by
one?



Re: Texinfo 7.1 released

2023-10-22 Thread Eli Zaretskii
> From: Gavin Smith 
> Date: Sun, 22 Oct 2023 14:23:53 +0100
> Cc: bug-texinfo@gnu.org
> 
> > > First, check that the Perl extension modules are actually being used.  Try
> > > setting the TEXINFO_XS environment variable to "require" or "debug".
> > 
> > I don't need to do that, I already verified that extensions are used
> > when I worked on the pretests (which, as you might remember, caused
> > Perl to crash at first).
> 
> I'd expected so, just wanted to make sure.

Surprise: running with TEXINFO_XS=omit _reduces_ the elapsed time of
producing the Emacs ELisp manual from 1:21.16 to 0:36.97.  Disabling
Unicode::Collate on top of that has almost no effect (about 1 sec).

What do you think I should try next?



Re: Texinfo 7.1 released

2023-10-22 Thread Eli Zaretskii
> From: Gavin Smith 
> Date: Sun, 22 Oct 2023 13:35:19 +0100
> Cc: bug-texinfo@gnu.org
> 
> On Sun, Oct 22, 2023 at 12:06:21PM +0300, Eli Zaretskii wrote:
> >   . makeinfo is painfully slow.  For example, building the ELisp
> > manual that is part of Emacs takes a whopping 82.3 sec.  By
> > contrast, Texinfo-7.0.3 takes just 20.7 sec.  And this is with
> > Perl extensions being used!  What could explain such a performance
> > regression? perhaps the use of libunistring or some other code
> > that handles non-ASCII characters?
> 
> It could be the use of Unicode collation for sorting document indices.

Can index sorting take more than a 1 minute?

> First, check that the Perl extension modules are actually being used.  Try
> setting the TEXINFO_XS environment variable to "require" or "debug".

I don't need to do that, I already verified that extensions are used
when I worked on the pretests (which, as you might remember, caused
Perl to crash at first).

> Otherwise, the easiest way of turning off the Unicode collation is
> patching the source code:
> 
> --- a/tp/Texinfo/Structuring.pm
> +++ b/tp/Texinfo/Structuring.pm
> @@ -2604,7 +2604,7 @@ sub setup_sortable_index_entries($;$)
>my $collator;
>eval { require Unicode::Collate; Unicode::Collate->import; };
>my $unicode_collate_loading_error = $@;
> -  if ($unicode_collate_loading_error eq '') {
> +  if (0 || $unicode_collate_loading_error eq '') {
>  $collator = Unicode::Collate->new(%collate_options);
>} else {
>  $collator = Texinfo::CollateStub->new();
> 
> This should use the 'cmp' Perl operator instead of the more complicated
> Unicode collation algorithm.

How can I run makeinfo uninstalled, from the texinfo-7.1 source tree?
The version that is currently installed here is v7.0.3, as I must be
able to produce manuals in reasonable times as part of my work on
Emacs and other projects, so I uninstalled 7.1 when I found these
problems.

> (Incidently 20.7 seconds for Texinfo 7.0.3 is still longer than I would
> expect.  On my system the same manual is processed in 5-6 seconds, on
> GNU/Linux on a fairly cheap Acer laptop.)

That is of secondary importance for me at this time.



Re: Texinfo 7.1 released

2023-10-22 Thread Eli Zaretskii
> From: Gavin Smith 
> Date: Wed, 18 Oct 2023 15:07:26 +0100
> Cc: bug-texinfo@gnu.org
> 
> We have released version 7.1 of Texinfo, the GNU documentation format.

I'm sorry to say that makeinfo in this new release of Texinfo has
serious problems, when built with MinGW on MS-Windows.  Here are the 2
problems I immediately saw in real-life usage of this version, as soon
as I installed it:

  . makeinfo is painfully slow.  For example, building the ELisp
manual that is part of Emacs takes a whopping 82.3 sec.  By
contrast, Texinfo-7.0.3 takes just 20.7 sec.  And this is with
Perl extensions being used!  What could explain such a performance
regression? perhaps the use of libunistring or some other code
that handles non-ASCII characters?

  . makeinfo seems to ignore @documentencoding, at least in some
places.  Specifically, it consistently produces ASCII equivalents
of some punctuation characters, like quotes “..” and ’, en-dash –,
etc.  Curiously, other punctuation characters, and even the above
ones in some contexts, _are_ produced.  As an example, makeinfo
7.1 produces

 If you don't customize ‘auth-sources’, you'll have to live with the
  defaults: the unencrypted netrc file ‘~/.authinfo’ will be used for any
  host and any port.

where 7.0.3 produced

 If you don’t customize ‘auth-sources’, you’ll have to live with the
  defaults: the unencrypted netrc file ‘~/.authinfo’ will be used for any
  host and any port.

Note how ’ in "don’t" and "you’ll" produced the ASCII ', whereas
‘auth-sources’ and ‘~/.authinfo’ are quoted with non-ASCII quote
characters.  Why this difference?  Texinfo 7.0.3 produces
non-ASCII quotes in both cases.

The above basically means I'm unable to upgrade to 7.1, and will need
to keep using v7.0.3 for the time being.

I'm sorry I didn't try this version on the Emacs docs when it was in
pretest.  To my defense, I never before saw such issues once the test
suite runs successfully.  Any suggestions for debugging the above two
issues will be welcome.



Re: branch master updated: * info/info.c (get_initial_file), * info/infodoc.c (info_get_info_help_node), * info/nodes.c (info_get_node_with_defaults): Use strcmp or strcasecmp instead of mbscasecmp in

2023-10-19 Thread Eli Zaretskii
> From: Gavin Smith 
> Date: Thu, 19 Oct 2023 14:10:56 +0100
> Cc: bug-texinfo@gnu.org
> 
> On Thu, Oct 19, 2023 at 03:26:51PM +0300, Eli Zaretskii wrote:
> > > diff --git a/info/info.c b/info/info.c
> > > index 8ca4a17e58..d7a6afaa2c 100644
> > > --- a/info/info.c
> > > +++ b/info/info.c
> > > @@ -250,7 +250,7 @@ get_initial_file (int *argc, char ***argv, char 
> > > **error)
> > >  {
> > >/* If they say info info (or info -O info, etc.), show them 
> > >   info-stnd.texi.  (Get info.texi with info -f info.) */
> > > -  if ((*argv)[0] && mbscasecmp ((*argv)[0], "info") == 0)
> > > +  if ((*argv)[0] && strcmp ((*argv)[0], "info") == 0)
> > >  (*argv)[0] = "info-stnd";
> > 
> > This could produce regressions on case-insensitive filesystems, where
> > we could have INFO.EXE, for example.  Do we really no longer care
> > about those?
> 
> (*argv)[0] here is not the name of the program but what was given on the
> command line.  It should mean that "INFO.EXE info" works as before if
> "INFO.EXE" is the name of the info program, whereas "INFO.EXE INFO" wouldn't.

On MS-DOS and MS-Windows, argv[0] is usually NOT what the user types
on the command line, it's what the OS fills in, and it usually puts
there the full absolute file name of the executable.

> > >/* If the node not found was "Top", try again with different case. */
> > > -  if (!node && (nodename && mbscasecmp (nodename, "Top") == 0))
> > > +  if (!node && (nodename && strcasecmp (nodename, "Top") == 0))
> > 
> > Are there no Info manuals that have "Top" with a different
> > letter-case?
> 
> It is strcasecmp here, not strcmp.  This should support other
> capitalisations, like "TOP" or "ToP".

Right, sorry.  So I think it's good enough.



Re: branch master updated: * info/info.c (get_initial_file), * info/infodoc.c (info_get_info_help_node), * info/nodes.c (info_get_node_with_defaults): Use strcmp or strcasecmp instead of mbscasecmp in

2023-10-19 Thread Eli Zaretskii
> Date: Thu, 19 Oct 2023 08:20:49 -0400
> From: "Gavin D. Smith" 
> 
> +2023-10-19  Gavin Smith 
> +
> + * info/info.c (get_initial_file),
> + * info/infodoc.c (info_get_info_help_node),
> + * info/nodes.c (info_get_node_with_defaults):
> + Use strcmp or strcasecmp instead of mbscasecmp in several
> + cases where we do not care about case-insensitive matching with
> + non-ASCII characters.
> +
>  2023-10-19  Gavin Smith 
>  
>   * tp/maintain/change_perl_modules_version.sh:
> diff --git a/info/info.c b/info/info.c
> index 8ca4a17e58..d7a6afaa2c 100644
> --- a/info/info.c
> +++ b/info/info.c
> @@ -250,7 +250,7 @@ get_initial_file (int *argc, char ***argv, char **error)
>  {
>/* If they say info info (or info -O info, etc.), show them 
>   info-stnd.texi.  (Get info.texi with info -f info.) */
> -  if ((*argv)[0] && mbscasecmp ((*argv)[0], "info") == 0)
> +  if ((*argv)[0] && strcmp ((*argv)[0], "info") == 0)
>  (*argv)[0] = "info-stnd";

This could produce regressions on case-insensitive filesystems, where
we could have INFO.EXE, for example.  Do we really no longer care
about those?

> --- a/info/infodoc.c
> +++ b/info/infodoc.c
> @@ -357,8 +357,7 @@ DECLARE_INFO_COMMAND (info_get_info_help_node, _("Visit 
> Info node '(info)Help'")
>  for (win = windows; win; win = win->next)
>{
>  if (win->node && win->node->fullpath
> -&& !mbscasecmp ("info",
> -filename_non_directory (win->node->fullpath))
> +&& !strcmp (filename_non_directory (win->node->fullpath), "info")
>  && (!strcmp (win->node->nodename, "Help")
>  || !strcmp (win->node->nodename, "Help-Small-Screen")))

Likewise here.

>/* If the node not found was "Top", try again with different case. */
> -  if (!node && (nodename && mbscasecmp (nodename, "Top") == 0))
> +  if (!node && (nodename && strcasecmp (nodename, "Top") == 0))

Are there no Info manuals that have "Top" with a different
letter-case?



Re: MinGW "info" program broken?

2023-10-15 Thread Eli Zaretskii
> From: Bruno Haible 
> Cc: bug-texinfo@gnu.org
> Date: Sun, 15 Oct 2023 16:07:28 +0200
> 
> Eli Zaretskii wrote:
> > The stand-alone Info reader built with MinGW works
> > flawlessly for me.
> > 
> > > I had understood that "info" was running well on MinGW so it would be 
> > > worth
> > > understanding any differences between yours and Bruno's setup.
> > 
> > I'm indeed curious why this happens with the MSVC build.
> 
> It happens also with the mingw-w64 version 5.0.3 build. Let me investigate...

I guess you somehow trip on this code fragment from pcterm.c:

  /* Print STRING to the terminal at the current position. */
  static void
  pc_put_text (string)
   char *string;
  {
if (speech_friendly)
  fputs (string, stdout);
  #ifdef __MINGW32__
else if (hscreen == INVALID_HANDLE_VALUE)
  fputs (string, stdout);  <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
else if (output_cp == CP_UTF8 || output_cp == CP_UTF7)
  write_utf (output_cp, string, -1);
  #endif
else
  cputs (string);
  }

Which probably means the screen handle is somehow invalid?



Re: Texinfo 7.0.94 on native Windows

2023-10-15 Thread Eli Zaretskii
> From: Bruno Haible 
> Cc: gavinsmith0...@gmail.com, bug-texinfo@gnu.org
> Date: Sun, 15 Oct 2023 16:25:56 +0200
> 
> Eli Zaretskii wrote:
> > _popen accepts a MODE argument which can be used to control that, see
> > 
> >   
> > https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/popen-wpopen?view=msvc-170
> > 
> > We use this in the stand-alone Info reader, for example, in this
> > snippet from info/filesys.c:
> > 
> >   stream = popen (command, FOPEN_RBIN);
> 
> This is good. But there are these two occurrences of popen():
> 
> 1)
> info/man.c:
> fpipe = popen (cmdline, "r");
> 
> Should better use FOPEN_RBIN as well.

No, because any 'man' program on Windows is likely to produce CRLF
EOLs when it writes to stdout.

> 2)
> info/session.c:
> printer_pipe = fopen (++print_command, "w");
>   printer_pipe = popen (print_command, "w");
> 
> Should better use FOPEN_WBIN.

No, because we write to a 'lpr' work-alike, which on Windows should be
able to handle CRLF EOLs.



Re: MinGW "info" program broken?

2023-10-15 Thread Eli Zaretskii
> From: Bruno Haible 
> Cc: bug-texinfo@gnu.org
> Date: Sun, 15 Oct 2023 16:07:28 +0200
> 
> Eli Zaretskii wrote:
> > The stand-alone Info reader built with MinGW works
> > flawlessly for me.
> > 
> > > I had understood that "info" was running well on MinGW so it would be 
> > > worth
> > > understanding any differences between yours and Bruno's setup.
> > 
> > I'm indeed curious why this happens with the MSVC build.
> 
> It happens also with the mingw-w64 version 5.0.3 build. Let me investigate...

Is this build with UCRT or with MSVCRT?



Re: MinGW "info" program broken?

2023-10-15 Thread Eli Zaretskii
> From: Bruno Haible 
> Date: Sun, 15 Oct 2023 15:23:45 +0200
> 
> Gavin Smith wrote:
> > I had understood that "info" was running well on MinGW so it would be worth
> > understanding any differences between yours and Bruno's setup.
> 
> I'm usually building with mingw-w64 5.0.3.
> 
> Whereas Eli (AFAIK) often builds with the older mingw from the now-defunct
> mingw.org site. Correct me if I'm wrong, Eli.

You are not wrong, but both flavors should produce a working info.exe,
AFAIK.  We took care of that years ago.



Re: Texinfo 7.0.94 on native Windows

2023-10-15 Thread Eli Zaretskii
> From: Bruno Haible 
> Cc: gavinsmith0...@gmail.com, bug-texinfo@gnu.org
> Date: Sun, 15 Oct 2023 15:11:33 +0200
> 
> Eli Zaretskii wrote:
> > > For 'popen' and 'pclose', one needs the gnulib modules 'popen' and 
> > > 'pclose',
> > > respectively.
> > 
> > Windows has _popen and _pclose, which can be used instead.
> 
> _popen uses text mode, not binary mode, by default, AFAIK. This can be
> problematic.

_popen accepts a MODE argument which can be used to control that, see

  
https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/popen-wpopen?view=msvc-170

We use this in the stand-alone Info reader, for example, in this
snippet from info/filesys.c:

  stream = popen (command, FOPEN_RBIN);



Re: MinGW "info" program broken?

2023-10-15 Thread Eli Zaretskii
> From: Gavin Smith 
> Date: Sun, 15 Oct 2023 13:17:53 +0100
> Cc: bug-texinfo@gnu.org
> 
> On Sun, Oct 15, 2023 at 01:24:32PM +0200, Bruno Haible wrote:
> >   - The behaviour of the 'ginfo' program on MSVC is the same as on mingw,
> > albeit not really useful currently: './info -f texinfo.info' spits out
> > the entire manual to stdout at once. It looks like the device gets set
> > to stdout, or there is no knowledge about the terminal window's height,
> > or something like that.
> 
> Is that also true on the MinGW build you did, Eli?

No, of course not.  The stand-alone Info reader built with MinGW works
flawlessly for me.

> I had understood that "info" was running well on MinGW so it would be worth
> understanding any differences between yours and Bruno's setup.

I'm indeed curious why this happens with the MSVC build.



Re: Texinfo 7.0.94 pretest available

2023-10-15 Thread Eli Zaretskii
> From: Gavin Smith 
> Date: Sun, 15 Oct 2023 12:57:46 +0100
> Cc: bug-texinfo@gnu.org
> 
> On Sun, Oct 15, 2023 at 12:00:47PM +0300, Eli Zaretskii wrote:
> > Thanks.
> > 
> > This doesn't compile with MinGW, because some of the dTHX additions I
> > needed for the previous pretest were not installed(?), and are still
> > missing.  The patch I needed is below.
> 
> I hadn't installed those changes because I hadn't understood why they
> were necessary.  Looking at the code, I expect that it was due to the
> use of fprintf in those functions which the Perl headers must be doing
> something funny with.  Presumably you got an error message when compiling
> indicating this?

I show two such error messages below, I hope it will help you
understand the reason.

> The fprintf calls were added since the Texinfo 7.0 branch so will not
> have broken previously.
> 
> Although adding dTHX should be harmless, the paragraph formatting
> functions are very frequently called functions and adding the dTHX in
> there has a potential performance impact, especially in xspara__add_next.
> 
> However, I could not detect any performance difference in the testing I
> did so I have added them anyway.

Here are the error messages I promised.  Error #1:

 libtool: compile:  d:/usr/bin/gcc.exe -DHAVE_CONFIG_H -I. -I./parsetexi 
-I. -I./gnulib/lib -I./gnulib/lib -DDATADIR=\"d:/usr/share\" -Id:/usr/include 
-s -O2 -DWIN32 -DPERL_TEXTMODE_SCRIPTS -DUSE_SITECUSTOMIZE 
-DPERL_IMPLICIT_CONTEXT -DPERL_IMPLICIT_SYS -DUSE_PERLIO -fwrapv 
-fno-strict-aliasing -mms-bitfields -s -O2 -DVERSION=\"0\" -DXS_VERSION=\"0\" 
-ID:/usr/Perl/lib/CORE -MT parsetexi/Parsetexi_la-api.lo -MD -MP -MF 
parsetexi/.deps/Parsetexi_la-api.Tpo -c parsetexi/api.c  -DDLL_EXPORT -DPIC -o 
parsetexi/.libs/Parsetexi_la-api.o
 In file included from parsetexi/api.c:23:
 parsetexi/api.c: In function 'reset_parser':
 D:/usr/Perl/lib/CORE/perl.h:155:16: error: 'my_perl' undeclared (first use 
in this function)
   155 | #  define aTHX my_perl
   |^~~
 D:/usr/Perl/lib/CORE/embedvar.h:38:18: note: in expansion of macro 'aTHX'
38 | #define vTHX aTHX
   |  ^~~~
 D:/usr/Perl/lib/CORE/embedvar.h:65:20: note: in expansion of macro 'vTHX'
65 | #define PL_StdIO  (vTHX->IStdIO)
   |^~~~
 D:/usr/Perl/lib/CORE/iperlsys.h:207:4: note: in expansion of macro 
'PL_StdIO'
   207 |  (*PL_StdIO->pStderr)(PL_StdIO)
   |^~~~
 D:/usr/Perl/lib/CORE/XSUB.h:511:21: note: in expansion of macro 
'PerlSIO_stderr'

   511 | #define stderr  PerlSIO_stderr
   | ^~
 parsetexi/api.c:174:14: note: in expansion of macro 'stderr'
   174 | fprintf (stderr,
   |  ^~
 D:/usr/Perl/lib/CORE/perl.h:155:16: note: each undeclared identifier is 
reported only once for each function it appears in
   155 | #  define aTHX my_perl
   |^~~
 D:/usr/Perl/lib/CORE/embedvar.h:38:18: note: in expansion of macro 'aTHX'
38 | #define vTHX aTHX
   |  ^~~~
 D:/usr/Perl/lib/CORE/embedvar.h:65:20: note: in expansion of macro 'vTHX'
65 | #define PL_StdIO  (vTHX->IStdIO)
   |^~~~
 D:/usr/Perl/lib/CORE/iperlsys.h:207:4: note: in expansion of macro 
'PL_StdIO'
   207 |  (*PL_StdIO->pStderr)(PL_StdIO)
   |^~~~
 D:/usr/Perl/lib/CORE/XSUB.h:511:21: note: in expansion of macro 
'PerlSIO_stderr'

   511 | #define stderr  PerlSIO_stderr
   | ^~
 parsetexi/api.c:174:14: note: in expansion of macro 'stderr'
   174 | fprintf (stderr,
   |  ^~

Error #2:

 libtool: compile:  d:/usr/bin/gcc.exe -DHAVE_CONFIG_H -I. -I. 
-I./gnulib/lib -I./gnulib/lib -DDATADIR=\"d:/usr/share\" -Id:/usr/include -s 
-O2 -DWIN32 -DPERL_TEXTMODE_SCRIPTS -DUSE_SITECUSTOMIZE -DPERL_IMPLICIT_CONTEXT 
-DPERL_IMPLICIT_SYS -DUSE_PERLIO -fwrapv -fno-strict-aliasing -mms-bitfields -s 
-O2 -DVERSION=\"0\" -DXS_VERSION=\"0\" -ID:/usr/Perl/lib/CORE -MT xspara.lo -MD 
-MP -MF .deps/xspara.Tpo -c xspara.c  -DDLL_EXPORT -DPIC -o .libs/xspara.o
 In file included from xspara.c:39:
 xspara.c: In function 'xspara__print_escaped_spaces':
 D:/usr/Perl/lib/CORE/perl.h:155:16: error: 'my_perl' undeclared (first use 
in this function)
   155 | #  define aTHX my_perl
   |^~~
 D:/usr/Perl/lib/CORE/embedvar.h:38:18: note: in expansion of macro 'aTHX'
38 | #define vTHX aTHX
   |  ^~~~
 D:/usr/Perl/lib/CORE/embedvar.h:58:19: note: in expansion of macro 'vTHX'
58 | #define PL_Mem  

Re: Texinfo 7.0.94 on native Windows

2023-10-15 Thread Eli Zaretskii
> From: Bruno Haible 
> Date: Sun, 15 Oct 2023 13:24:32 +0200
> 
> For 'popen' and 'pclose', one needs the gnulib modules 'popen' and 'pclose',
> respectively.

Windows has _popen and _pclose, which can be used instead.  That's
what MinGW does, AFAIK.

But I'm not sure Texinfo should try supporting an MSVC build.  It's
enough to support MinGW, IMO.



Re: Texinfo 7.0.94 pretest available

2023-10-15 Thread Eli Zaretskii
> From: Gavin Smith 
> Date: Sat, 14 Oct 2023 14:27:36 +0100
> Cc: platform-test...@gnu.org
> 
> A pretest distribution for the next Texinfo release (7.1) has been
> uploaded to
>  
> https://alpha.gnu.org/gnu/texinfo/texinfo-7.0.94.tar.xz
> 
> There have not been many changes since the previous pretest.  We are
> making this pretest mainly to test build fixes for the MinGW platform.
> We hope to release this as Texinfo 7.1 in a few days' time, unless
> further problems are found.
> 
> Changes since 7.0.93:
> 
> * A bug has been fixed where a few document strings would not be
>   translated in texi2any's output.
> * Fix building Perl XS modules on MinGW/MS-Windows.
> * Fix install-info on the same platform.
> * Tests of texi2any changed to avoid differing results for different
>   implementations of the wcwidth function.
> 
> We make these pretests to help find any problems before we make an official
> release to a larger audience, so that the release will be as good as it
> can be.
> 
> Please send any feedback to .

Thanks.

This doesn't compile with MinGW, because some of the dTHX additions I
needed for the previous pretest were not installed(?), and are still
missing.  The patch I needed is below.

--- ./tp/Texinfo/XS/parsetexi/api.c~0   2023-10-07 19:12:05.0 +0300
+++ ./tp/Texinfo/XS/parsetexi/api.c 2023-10-15 11:30:26.924948100 +0300
@@ -158,6 +158,8 @@ reset_parser_except_conf (void)
 void
 reset_parser (int debug_output)
 {
+  dTHX;
+
   /* NOTE: Do not call 'malloc' or 'free' in this function or in any function
  called in this file.  Since this file (api.c) includes the Perl headers,
  we get the Perl redefinitions, which we do not want, as we don't use
--- ./tp/Texinfo/XS/xspara.c~0  2023-10-11 20:08:06.0 +0300
+++ ./tp/Texinfo/XS/xspara.c2023-10-15 11:29:14.853806700 +0300
@@ -565,6 +565,8 @@ xspara_get_pending (void)
 void
 xspara__add_pending_word (TEXT *result, int add_spaces)
 {
+  dTHX;
+
   if (state.word.end == 0 && !state.invisible_pending_word && !add_spaces)
 return;
 
@@ -640,6 +642,9 @@ char *
 xspara_end (void)
 {
   static TEXT ret;
+
+  dTHX;
+
   text_reset ();
   state.end_line_count = 0;
 
@@ -686,6 +691,8 @@ xspara_end (void)
 void
 xspara__add_next (TEXT *result, char *word, int word_len, int transparent)
 {
+  dTHX;
+
   int disinhibit = 0;
   if (!word)
 return;


After fixing the above, most of the tests pass, but 3 texi2any tests
fail:

 FAIL: test_scripts/layout_formatting_info_ascii_punctuation.sh
 FAIL: test_scripts/layout_formatting_info_disable_encoding.sh
 FAIL: test_scripts/layout_formatting_plaintext_ascii_punctuation.sh

I attach below the logs of these failures.



texi2any-tests-mingw.log.gz
Description: Binary data


Re: library for unicode collation in C for texi2any?

2023-10-14 Thread Eli Zaretskii
> From: Gavin Smith 
> Date: Sat, 14 Oct 2023 19:57:22 +0100
> 
> It's all in the future, but I am slightly concerned about is duplicating
> in Texinfo existing system facilities.  For example, for avoiding use of
> wcwidth, our use of which depends on setting a UTF-8 locale, and using
> the wchar_t type.  Is every program that uses wcwidth supposed to supply
> their own implementation instead, and isn't this wasteful?

What other locale-specific functions do we need in addition to
wcwidth?

If the list of those functions is short enough, we could replace them
all by the corresponding Gnulib/libunistring functions, and then we
could stop setting specific locales and relying on locale-specific
libc functions.  That will give us locale-independent code which will
work on all systems.

> I don't know if libunistring aspires to become a standard system library
> for handling UTF-8 data but if we use it for other UTF-8 processing it
> would make sense to use it for collation.
> 
> I suggest writing to Bruno Haible to ask if he has plans to include
> collation functionality in libunistring in the future.  I am currently
> reading through "Unicode Technical Standard #10" and although I don't
> understand a lot of it yet, it seems feasible that we could implement it
> in C.

It is feasible, but implementing it from scratch is a lot of work, and
needs a large database (which we could take from the CLDR).  But note
that CLDR is AFAIK locale-dependent; the only part of it that doesn't
depend on the locale is collation by Unicode codepoints.



Re: library for unicode collation in C for texi2any?

2023-10-14 Thread Eli Zaretskii
> Date: Sat, 14 Oct 2023 11:57:02 +0200
> From: Patrice Dumas 
> Cc: bug-texinfo@gnu.org
> 
> On Thu, Oct 12, 2023 at 06:13:34PM +0300, Eli Zaretskii wrote:
> > What you say is not detailed enough, but using my crystal ball I think
> > you can have this with glibc-based systems, and also on Windows (but
> > that requires using a special API for comparing strings).  Not sure
> > about the equivalent features on other systems, like *BSD and macOS.
> > You can see that in action in how GNU 'ls' sorts file names.
> 
> Looks like ls ultimately uses strcoll.  The problem is that it selects
> the current locale, we never want to use the current locale in Texinfo.
> We either want to use a 'generic' locale (which does not really exist
> as far as I can tell) or the @documentlanguage locale.

Yes, I know.  However, if the current locale's codeset is UTF-8, AFAIK
glibc uses the full Unicode CLDR, which is what I wanted to point out.

> There seems to be variants of strcoll and of strxfrm, strcoll_l and 
> strxfrm_l that allow to specify a locale, but it is not very well
> documented (these functions seem to be in the glibc, but are not
> documented, strcoll and strxfrm are), there are no gnulib modules, and I
> am not sure whether with "C" locale these functions really use the
> specified locale.

I don't think we want to depend on the locale in Texinfo.  The problem
is how to find or write an implementation that on the one side doesn't
use the locale-dependent collation rules, and OTOH ignores punctuation
and other "unimportant" characters.



Re: Texinfo 7.0.93 pretest available

2023-10-14 Thread Eli Zaretskii
> From: Gavin Smith 
> Date: Fri, 13 Oct 2023 22:14:32 +0100
> Cc: pertu...@free.fr, bug-texinfo@gnu.org
> 
> Eli, are you able to test this from git or do you need me to make another
> pretest release?

Git is a bit problematic, as some of the file names include non-ASCII
characters.  For this reason, and also for others (e.g., I have
already made too many changes to 7.0.93 sources), I'd prefer another
pretest.  I think it's also a better way in general, as non-trivial
changes were made since 7.0.93, so it would be prudent to let other
pretesters test the result.

Thanks.



Re: library for unicode collation in C for texi2any?

2023-10-13 Thread Eli Zaretskii
> Date: Fri, 13 Oct 2023 11:31:54 + (UTC)
> Cc: pertu...@free.fr, bug-texinfo@gnu.org
> From: Werner LEMBERG 
> 
> >> [...] Neither collation corresponds to Unicode codepoints.
> >
> > That's exactly what we should not do.
> 
> I strongly disagree.
> 
> > People who read German don't necessarily live in Germany, and
> > Texinfo is not a general-purpose system for typesetting documents,
> > it is a system for writing software documentation.
> 
> What you describe is certainly valid for a function index, say.
> However, a concept index – which is an essential part of any
> documentation IMHO – that doesn't sort as expected is at the border of
> being useless.

You are exaggerating, and that doesn't help.  In practice, the
problems are minor, and consistency is much more important.

> > Besides, which German are you talking about?  There are several
> > German-based locales, each one with its own local tailoring.
> 
> It doesn't matter.

If this "doesn't matter", then why do you insist on this?

>   There are zillions of German computer books that
> come with an index, and such books *are* read in all German-speaking
> countries and elsewhere, irrespective of a fine-tuned locale used for
> the exact index order.  *This* part can be easily standardized by
> making Texinfo support exactly one German locale ('de').
> 
> > So consistency in Texinfo is IMNSHO more important that fine-tuning
> > the order to a specific locale and language.
> 
> What good for is this consistency if it is extremely user-unfriendly?

It will be "user-unfriendly" anyway, if we use one flavor of German,
because users in a different locale will not expect that.

> What exactly is the problem if, say, an MS compilation produces a
> slightly different sorting order in the index?  Just add a sentence to
> the build instructions and tell the people what to expect.

You are wrong.  Your POV is skewed.  And that is all I can tell you on
this matter, since it looks like continuing this discussion is not
useful.



Re: library for unicode collation in C for texi2any?

2023-10-13 Thread Eli Zaretskii
> Date: Fri, 13 Oct 2023 07:31:29 + (UTC)
> Cc: pertu...@free.fr, bug-texinfo@gnu.org
> From: Werner LEMBERG 
> 
> 
> >> OK, no tailoring.  I wasn't aware of those differences, thanks for
> >> pointing me to it.
> >> 
> >> Hopefully, we agree that `@documentlanguage` should set a
> >> language-specific collation for the index.
> > 
> > Without tailoring, this basically means collation according to
> > Unicode codepoints.
> 
> Uh oh, this is not good.  As an example, consider the letter 'ä'.
> There are two possible collations that are considered as correct for
> German:
> 
> * Sort 'ä' right before 'b'.
> 
> * Handle 'ä' similar to 'ae' but sort it after 'ae'.
> 
> Neither collation corresponds to Unicode codepoints.

That's exactly what we should not do.  People who read German don't
necessarily live in Germany, and Texinfo is not a general-purpose
system for typesetting documents, it is a system for writing software
documentation.  Besides, which German are you talking about?  There
are several German-based locales, each one with its own local
tailoring.  So consistency in Texinfo is IMNSHO more important that
fine-tuning the order to a specific locale and language.



Re: library for unicode collation in C for texi2any?

2023-10-13 Thread Eli Zaretskii
> Date: Fri, 13 Oct 2023 07:08:36 + (UTC)
> Cc: pertu...@free.fr, bug-texinfo@gnu.org
> From: Werner LEMBERG 
> 
> 
> >> ... there is probably a misunderstanding on my side.  I don't know
> >> what you mean with 'tailoring', please give an example.
> > 
> > This subject is too large and complicated for me to answer this
> > question here.  So I will refer you to the relevant Unicode spec:
> > 
> >   https://unicode.org/reports/tr10/
> > 
> > Section 8 "Tailoring" there will probably answer your question.
> 
> OK, no tailoring.  I wasn't aware of those differences, thanks for
> pointing me to it.
> 
> Hopefully, we agree that `@documentlanguage` should set a
> language-specific collation for the index.

Without tailoring, this basically means collation according to Unicode
codepoints.



Re: library for unicode collation in C for texi2any?

2023-10-12 Thread Eli Zaretskii
> Date: Thu, 12 Oct 2023 20:30:47 + (UTC)
> Cc: pertu...@free.fr, bug-texinfo@gnu.org
> From: Werner LEMBERG 
> 
> >> > I don't recommend to tailor index sorting for the language
> >> > indicated by @documentlanguage, either.
> >> 
> >> This surprises me.  Why not?  For some languages, the alphabetical
> >> order differs enormously from English.
> > 
> > Because indices in a Texinfo document should not depend on details
> > of how the manual was produced.
> 
> Well, if I write a book in German, say, I most definitely want an
> index sorted with a German collation (there is more than a single one,
> BTW).  This collation should be used regardless of the input encoding.
> However, ...
> 
> > And note that I said "tailoring", which is minor adjustments to the
> > general collation, which is based on character Unicode codepoints.
> 
> ... there is probably a misunderstanding on my side.  I don't know
> what you mean with 'tailoring', please give an example.

This subject is too large and complicated for me to answer this
question here.  So I will refer you to the relevant Unicode spec:

  https://unicode.org/reports/tr10/

Section 8 "Tailoring" there will probably answer your question.

The main reason why I think we should not use language-specific
tailoring is that it is implemented differently by different system
libraries, and therefore the manuals produced by using that will be
different depending on what platform they were produced.  And that is
undesirable, IMO, from our POV.  As an example, I suggest to compare
the collation of file names in GNU 'ls', as implemented by glibc
(which basically implements the entire Unicode UTS#10 mentioned above
and uses its CLDR data set, http://unicode.org/cldr/), with the
corresponding MS-Windows API documented here:

  
https://learn.microsoft.com/en-us/windows/win32/api/stringapiset/nf-stringapiset-comparestringex

The results of collation using these disparate implementations is
similar, but not identical.  My point here is that Texinfo should IMO
try to avoid these subtle differences as much as possible.  Using code
that is independent of the current locale is a large step in that
direction, but there are additional smaller steps that we should take
after that, and avoiding too strong dependence on language-specific
collation, as implemented by the underlying libraries, is one of them.



Re: library for unicode collation in C for texi2any?

2023-10-12 Thread Eli Zaretskii
> Date: Thu, 12 Oct 2023 17:12:44 + (UTC)
> Cc: pertu...@free.fr, bug-texinfo@gnu.org
> From: Werner LEMBERG 
> 
> 
> > I don't recommend to tailor index sorting for the language indicated
> > by @documentlanguage, either.
> 
> This surprises me.  Why not?  For some languages, the alphabetical
> order differs enormously from English.

Because indices in a Texinfo document should not depend on details of
how the manual was produced.  And note that I said "tailoring", which
is minor adjustments to the general collation, which is based on
character Unicode codepoints.



Re: library for unicode collation in C for texi2any?

2023-10-12 Thread Eli Zaretskii
> Date: Thu, 12 Oct 2023 15:00:57 +0200
> From: Patrice Dumas 
> Cc: bug-texinfo@gnu.org
> 
> On Thu, Oct 12, 2023 at 01:29:27PM +0300, Eli Zaretskii wrote:
> > What is "smart sorting"? where is it described/documented?
> 
> It is, in general, any way to sort Unicode that takes into account
> natural languages words orders. In practice, what is used in
> Unicode::Collate is the 'Unicode Technical Standard #10' Unicode
> Collation Algorithm (a.k.a. UCA) described in
> http://www.unicode.org/reports/tr10.  In texi2any, we set an option of
> collation,
>   ( 'variable' => 'Non-Ignorable' )
> such that spaces and punctuation marks sort before letters.  This
> specific option is described in
> http://www.unicode.org/reports/tr10/#Variable_Weighting
> 
> It would be perfect if the same sorting could be obtained, but if
> C code does not follow exactly the same standard, I do not think
> that it is so problematic, as long as the sorting is sensible.  It could
> actually be problematic for tests, but if the output of texi2any is ok
> even if not fully reproducible, it would still be better than sorting
> according to the Unicode codepoint in a full C implementation.

What you say is not detailed enough, but using my crystal ball I think
you can have this with glibc-based systems, and also on Windows (but
that requires using a special API for comparing strings).  Not sure
about the equivalent features on other systems, like *BSD and macOS.
You can see that in action in how GNU 'ls' sorts file names.

> > In general, Unicode collation rules are locale- and
> > language-dependent.  My recommendation for Texinfo is not to use
> > locale-specific collation rules, so that the indices would come out
> > sorted identically no matter in which locale the user runs texi2any.
> 
> That's the plan.  The plan is to use the @documentlanguage information
> with Unicode::Collate::Locale in the future, but never use the locale.

I don't recommend to tailor index sorting for the language indicated
by @documentlanguage, either.

> This is still a TODO item, though, as Unicode::Collate::Locale is a perl
> core module since perl 5.14 only, released in 2011, so my plan was to
> wait for 2031 to use it and be able to assume that it is indeed present
> the same way we assume that Unicode::Collate is present.

We can have this in C today.



Re: library for unicode collation in C for texi2any?

2023-10-12 Thread Eli Zaretskii
> Date: Thu, 12 Oct 2023 11:39:14 +0200
> From: Patrice Dumas 
> 
> One thing I could not find easily in C is something to replace the
> Unicode::Collate perl module for index entries sorting using 'smart'
> rules for sorting, that could be either found in Gnulib, included easily
> in the Texinfo distribution or would be, in general, installed.  Unless
> I missed something, there is no such facility in libunistring, it seems
> to be in libICU, but I do not know how easy it could be
> integrated/shipped with Texinfo and I do not think that it is installed
> in the general case.
> 
> 
> Do you have information, on how to do 'smart' unicode sorting in
> C, including for tests, which could allow shipping of code as we already
> do with libunistring in gnulib in case it is not already installed, such
> that it is used in the general case?  Could also be example of projects
> that have managed to do that.

What is "smart sorting"? where is it described/documented?

In general, Unicode collation rules are locale- and
language-dependent.  My recommendation for Texinfo is not to use
locale-specific collation rules, so that the indices would come out
sorted identically no matter in which locale the user runs texi2any.



Re: Texinfo 7.0.93 pretest available

2023-10-12 Thread Eli Zaretskii
> From: Gavin Smith 
> Date: Wed, 11 Oct 2023 18:15:04 +0100
> Cc: Patrice Dumas 
> 
> On Wed, Oct 11, 2023 at 06:12:51PM +0100, Gavin Smith wrote:
> > I will send you a diff to try to see if it lets the tests pass, or if
> > we need to make any further changes.
> 
> Attached.

Thanks.  This solves some of the diffs, but not all of them.  In
addition, one test that previously passed now fails
(formatting_documentlanguage_cmdline.sh).  I attach below the
redirected output of all the failed tests, which shows the diffs
against the expected results.



tp-tests-patched-mingw.gz
Description: Binary data


Re: Texinfo 7.0.93 pretest available

2023-10-11 Thread Eli Zaretskii
> From: Gavin Smith 
> Date: Tue, 10 Oct 2023 20:24:47 +0100
> Cc: br...@clisp.org, bug-texinfo@gnu.org
> 
> On Tue, Oct 10, 2023 at 02:55:09PM +0300, Eli Zaretskii wrote:
> > > If this simple stub is preferable to the Gnulib implementation for
> > > MS-Windows, (e.g. it makes the tests pass) we could re-add it again.
> > 
> > We can do that, but I think we should first explore a better
> > alternative: use UTF-8 functions everywhere, without relying on the
> > locale-aware functions of libc, such as wcwidth.  For example, instead
> > of wcwidth, we could use uc_width.
> 
> Changing away from using wcwidth at this stage is a more significant
> change to be making.  I want to fix this issue in an easy and simple way.
> As far as I am aware these tests passed on MS-Windows with previous
> releases of Texinfo, so doing what we did before seems the simplest fix
> to me.

Then we need to understand why the tests are now failing when they
succeeded previously.

> I'm not sure of the easiest way to put in a replacement for wcwidth
> given that the wcwidth module is in use.  I tried the stub implementation
> as before with a different name, but this led to test failures, so may
> not be enough.  It's possible there have also been changes in the tests.
> Do you know the last released version of Texinfo that passed the test
> suite successfully?

Texinfo 7.0.3 succeeded to run the tests.

> I wonder if it is commit b9347e3db9d0 that is responsible (2022-11-11,
> Patrice Dumas), or other changes to tp/tests/coverage_macro.texi that
> change what is occurring in the line.

I doubt that, since the previous versions already included, for
example, the dotless j letter, which is one of those which cause
trouble.

> As I said before, one short-term fix I would be happy with is to split
> the content up so there are shorter lines.  Given that the purpose of
> these tests is not to test line-breaking in itself, and that this is
> a fragile part of texi2any's output, if line breaking is to be tested
> this should be part of a specialised test.  Any difference in the
> line breaking for the coverage_macro.texi tests leads to a mass of
> differences which are hard to interpret.  We could put any problematic
> characters on lines of their own, e.g.

This would be fine by me, if filling is not the issue being tested
there.



Re: Texinfo 7.0.93 pretest available

2023-10-10 Thread Eli Zaretskii
> From: Gavin Smith 
> Date: Tue, 10 Oct 2023 18:09:15 +0100
> Cc: Eli Zaretskii , bug-texinfo@gnu.org
> 
> On Mon, Oct 09, 2023 at 11:32:49PM +0200, Bruno Haible wrote:
> > Gavin Smith wrote:
> > > It is supposed to attempt to force the locale to a UTF-8 locale.  You
> > > can see the code in xspara_init that attempts to change the locale.  There
> > > is also a comment before xspara_add_text:
> > > 
> > >   "This function relies on there being a UTF-8 locale in LC_CTYPE for
> > >   mbrtowc to work correctly."
> > 
> > That's an inherently unportable thing. You can't just force an UTF-8
> > locale if the system does not have it.
> 
> The module shouldn't load if it can't switch to a UTF-8 locale.  xspara_init
> returns a different value if these attempts fail leading the code loading
> the module (in Texinfo::XSLoader) to fall back to the pure Perl version.

If the inability to load the UTF-8 locale means the modules cannot be
loaded, I consider that a serious problem, because Perl implementation
is slower.  We need every possible way of speeding up texi2any,
because the speed regression since Texinfo moved to the Perl
implementation is significant, so much so that some refuse to upgrade
from Texinfo 4.13 (and thus hold back usage of new Texinfo features in
the various GNU manuals).  We cannot afford losing speedups due to
such issues, especially since they are solvable using readily
available libraries.

> It would be good to get away from the attempts to switch to a UTF-8 locale
> but I doubt it is urgent to do before the release, as the current approach,
> however flawed, has been in place and worked fairly well for a long time
> (since the XS paragraph module was written).  At the time it seemed to be
> the only way to get the information from wcwidth.

Then what do you propose to do about this in the MinGW port of Texinfo
7.1?  And why is it urgent to release Texinfo 7.1 without fixing this
issue?



Re: Texinfo 7.0.93 pretest available

2023-10-10 Thread Eli Zaretskii
> From: Gavin Smith 
> Date: Mon, 9 Oct 2023 20:39:59 +0100
> Cc: Bruno Haible , bug-texinfo@gnu.org
> 
> > IOW, unless the locale's codeset is UTF-8, any character that is not
> > printable _in_the_current_locale_ will return -1 from wcwidth.  I'm
> > guessing that no one has ever tried to run the test suite in a
> > non-UTF-8 locale before?
> 
> It is supposed to attempt to force the locale to a UTF-8 locale.  You
> can see the code in xspara_init that attempts to change the locale.  There
> is also a comment before xspara_add_text:
> 
>   "This function relies on there being a UTF-8 locale in LC_CTYPE for
>   mbrtowc to work correctly."

You cannot force MS-Windows into using the UTF-8 locale (with the
possible exception of very recent Windows versions, which AFAIK still
don't support UTF-8 in full).

You also cannot force an arbitrary Posix system into using UTF-8,
because such a locale might not be installed.

> For MS-Windows there is the w32_setlocale function that may use something
> different:
> 
>   /* Switch to the Windows U.S. English locale with its default
>  codeset.  We will handle the non-ASCII text ourselves, so the
>  codeset is unimportant, and Windows doesn't support UTF-8 as the
>  codeset anyway.  */
>   return setlocale (category, "ENU");
> 
> mbrtowc has its own override which handle UTF-8.
> 
> As far as this relates to wcwidth, there used to be an MS-Windows specific
> stub implementation of this, removed in commit 5a66bc49ac032 (Patrice Dumas,
> 2022-08-19) which added a gnulib implementation of wcwidth:
> 
> diff --git a/tp/Texinfo/XS/xspara.c b/tp/Texinfo/XS/xspara.c
> index 93924a623c..bf4ef91650 100644
> --- a/tp/Texinfo/XS/xspara.c
> +++ b/tp/Texinfo/XS/xspara.c
> @@ -206,13 +206,6 @@ iswspace (wint_t wc)
>return 0;
>  }
>  
> -/* FIXME: Provide a real implementation.  */
> -int
> -wcwidth (const wchar_t wc)
> -{
> -  return wc == 0 ? 0 : 1;
> -}
> -
>  int
>  iswupper (wint_t wi)
>  {
> 
> 
> If this simple stub is preferable to the Gnulib implementation for
> MS-Windows, (e.g. it makes the tests pass) we could re-add it again.

We can do that, but I think we should first explore a better
alternative: use UTF-8 functions everywhere, without relying on the
locale-aware functions of libc, such as wcwidth.  For example, instead
of wcwidth, we could use uc_width.

Is it feasible to use UTF-8 in texi2any disregarding the locale, and
use libunistring or something similar for the few functions we need in
the extensions that are required to deal with non-ASCII characters?
If we can do that, it will work on all systems, including Windows.
(This is basically what Emacs does, but it does that on a much greater
scale, which is unnecessary in texi2any.)



Re: Texinfo 7.0.93 pretest available

2023-10-10 Thread Eli Zaretskii
> Date: Mon, 9 Oct 2023 21:17:28 +0200
> From: Patrice Dumas 
> 
> On Sun, Oct 08, 2023 at 06:29:23PM +0100, Gavin Smith wrote:
> > 
> > I remember that in the past, I broke up some of these lines to avoid
> > test failures on some platform that had different wcwidth results for
> > some characters.
> 
> Maybe an optionin the long term here would be not to use wcwidth at all,
> but use libunistring functions like u8_strwidth.  It would probably
> remove the issue of locale.  The only requirement would be to make sure
> that the input string is UTF-8 encoded such that it can be converted to
> uint8_t without risk of error.

Isn't makeinfo converts all non-ASCII text to UTF-8 anyway?  If so, we
should always use the UTF-8 functions, without relying on the locale
and libc locale-aware functions.



Re: Texinfo 7.0.93 pretest available

2023-10-09 Thread Eli Zaretskii
> From: Bruno Haible 
> Cc: gavinsmith0...@gmail.com, bug-texinfo@gnu.org
> Date: Mon, 09 Oct 2023 19:18:25 +0200
> 
> Eli Zaretskii wrote:
> > > I just tried it now: On Linux (Ubuntu 22.04), in a de_DE.UTF-8 locale,
> 
> Oops, typo: What I tested was the de_DE.ISO-8859-1 locale:
> $ export LC_ALL=de_DE.ISO-8859-1

So wcwidth in ISO-8895-1 locale returns 1 for U+0237?  Even though
U+0237 cannot be encoded in ISO-8895-1?  And iswprint returns non-zero
for it in that locale?

Or does the Texinfo test suite forces the locale to something UTF-8?

> > Since U+0237 is not printable in my locale (it isn't supported by the
> > system codepage), the value -1 is correct.  Am I missing something?
> 
> True. But why don't we see the same test failure on glibc and on FreeBSD
> systems, then, in a locale with ISO-8859-1 encoding?

Good question.  Maybe they interpret the Posix standards differently
(if the locale is not forced by the test suite).

> > > This "simpler approximation" would not return a good result when wc
> > > is a control character (such as CR, LF, TAB, or such). It is important
> > > that the caller of wcwidth() or wcswidth() is able to recognize that
> > > the string as a whole does not have a definite width.
> > 
> > It is still better than returning -1, don't you agree?
> 
> No, I don't agree. Returning -1 tells the caller "watch out, you cannot
> assume anything about printed outline of this string".

I meant "better for Texinfo when it generates Info manuals", not in
general.

> > But for some reason you completely ignored my more general comment
> > about what Texinfo needs from wcwidth.
> 
> That's because I am not familiar with the Texinfo code. I don't know
> whether and where Texinfo calls wcwidth(), and I don't know with which
> expectations it does so.

It calls wcwidth to know how many columns a character will take, in
order to fill lines, when it generates manuals in the Info format.



Re: Texinfo 7.0.93 pretest available

2023-10-09 Thread Eli Zaretskii
> From: Bruno Haible 
> Cc: bug-texinfo@gnu.org
> Date: Mon, 09 Oct 2023 18:15:05 +0200
> 
> Eli Zaretskii wrote:
> > unless the locale's codeset is UTF-8, any character that is not
> > printable _in_the_current_locale_ will return -1 from wcwidth.  I'm
> > guessing that no one has ever tried to run the test suite in a
> > non-UTF-8 locale before?
> 
> I just tried it now: On Linux (Ubuntu 22.04), in a de_DE.UTF-8 locale,
> texinfo 7.0.93 build fine and all tests pass.

de_DE.UTF-8 is a UTF-8 locale.  I asked about non-UTF-8 locales.  An
example would be de_DE.ISO8859-1.  Or what am I missing?

> > Yes, quite a few characters return -1 from wcwidth, in particular the
> > ȷ character above (which explains the above difference).
> 
> This character is U+0237 LATIN SMALL LETTER DOTLESS J. It *should* be
> recognized as having a width of 1 in all implementations of wcwidth.

But if U+0237 cannot be represented in the locale's codeset, its width
can not be 1, because it cannot be printed.  This is my interpretation
of the standard's language (emphasis mine):

  DESCRIPTION

  The wcwidth() function shall determine the number of column
  positions required for the wide character wc. The application
  shall ensure that the value of wc is a character representable
  as a wchar_t, and is a wide-character code corresponding to a
  valid character in the current locale.
  ^
  RETURN VALUE

  The wcwidth() function shall either return 0 (if wc is a null
  wide-character code), or return the number of column positions
  to be occupied by the wide-character code wc, or return -1 (if
  wc does not correspond to a printable wide-character code).
 ^^
Since U+0237 is not printable in my locale (it isn't supported by the
system codepage), the value -1 is correct.  Am I missing something?

> There's no reason for it to have a width of -1, since it's not a control
> character.
> There's no reason for it to have a width of 0, since it's not a combining
> mark or a non-spacing character.
> There's no reason for it to have a width of 2, since it's not a CJK character
> and not in a Unicode range with many CJK characters.

I think you assume that all the Unicode letter characters are always
printable in every locale.  That's not what I understand, and iswprint
agrees with me, because I get -1 for U+0237 due to this code:

> >   return wc == 0 ? 0 : iswprint (wc) ? 1 : -1;


> > I don't think the above logic in Gnulib's wcwidth (which basically
> > replicates the logic in any reasonable wcwidth implementation, so is
> > not specific to Gnulib) fits what Texinfo needs.  Texinfo needs to be
> > able to produce output independently of the locale.  What matters to
> > Texinfo is the encoding of the output document, not the locale's
> > codeset.  So I think we should call uc_width when the output document
> > encoding is UTF-8 (which is the default, including in the above test),
> > regardless of the locale's codeset.  Or we could use a simpler
> > approximation:
> > 
> >   return wc == 0 ? 0 : iswcntrl (wc) ? 0 : 1;
> 
> This "simpler approximation" would not return a good result when wc
> is a control character (such as CR, LF, TAB, or such). It is important
> that the caller of wcwidth() or wcswidth() is able to recognize that
> the string as a whole does not have a definite width.

It is still better than returning -1, don't you agree?

But for some reason you completely ignored my more general comment
about what Texinfo needs from wcwidth.



Re: Texinfo 7.0.93 pretest available

2023-10-09 Thread Eli Zaretskii
> From: Gavin Smith 
> Date: Sun, 8 Oct 2023 20:21:44 +0100
> Cc: bug-texinfo@gnu.org
> 
> Just comparing the first line in the hunk:
> 
> -(ì) @'{e} é (é) @'{@dotless{i}} í (í) @dotless{i} ı (ı) @dotless{j} ȷ
> +(ì) @'{e} é (é) @'{@dotless{i}} í (í) @dotless{i} ı (ı) @dotless{j} ȷ (ȷ)
> 
> the line you are getting is longer than the reference results.  
> 
> I wonder if for some of the non-ASCII characters wcwidth is returning 0 or
> -1 leading the line to be longer.

Yes, quite a few characters return -1 from wcwidth, in particular the
ȷ character above (which explains the above difference).

> It's also possible that other codepoints have inconsistent wcwidth results,
> especially for combining accents.
> 
> Do you know if it is the gnulib implementation of wcwidth that is being
> used or a MinGW one?

AFAIK, MinGW doesn't have wcwidth, so we are using the one from
Gnulib.  But what Gnulib does in this case is not what Texinfo
expects, I think:

int
wcwidth (wchar_t wc)
#undef wcwidth
{
  /* In UTF-8 locales, use a Unicode aware width function.  */
  if (is_locale_utf8_cached ())
{
  /* We assume that in a UTF-8 locale, a wide character is the same as a
 Unicode character.  */
  return uc_width (wc, "UTF-8");
}
  else
{
  /* Otherwise, fall back to the system's wcwidth function.  */
#if HAVE_WCWIDTH
  return wcwidth (wc);
#else
  return wc == 0 ? 0 : iswprint (wc) ? 1 : -1;
#endif
}
}

IOW, unless the locale's codeset is UTF-8, any character that is not
printable _in_the_current_locale_ will return -1 from wcwidth.  I'm
guessing that no one has ever tried to run the test suite in a
non-UTF-8 locale before?

I don't think the above logic in Gnulib's wcwidth (which basically
replicates the logic in any reasonable wcwidth implementation, so is
not specific to Gnulib) fits what Texinfo needs.  Texinfo needs to be
able to produce output independently of the locale.  What matters to
Texinfo is the encoding of the output document, not the locale's
codeset.  So I think we should call uc_width when the output document
encoding is UTF-8 (which is the default, including in the above test),
regardless of the locale's codeset.  Or we could use a simpler
approximation:

  return wc == 0 ? 0 : iswcntrl (wc) ? 0 : 1;

CC'ing Bruno who I think knows much more about this.



Re: Texinfo 7.0.93 pretest available

2023-10-08 Thread Eli Zaretskii
> From: Gavin Smith 
> Date: Sun, 8 Oct 2023 18:29:23 +0100
> Cc: bug-texinfo@gnu.org
> 
> On Sun, Oct 08, 2023 at 07:31:12PM +0300, Eli Zaretskii wrote:
> > I see a very large diff, full of non-ASCII characters.  A typical hunk
> > is below:
> > 
> >   -(ì) @'{e} é (é) @'{@dotless{i}} í (í) @dotless{i} ı (ı) @dotless{j} ȷ
> >   -(ȷ) ‘@H{a}’ a̋ ‘@dotaccent{a}’ ȧ (ȧ) ‘@ringaccent{a}’ å (å)
> >   -‘@tieaccent{a}’ a͡ ‘@u{a}’ ă (ă) ‘@ubaraccent{a}’ a̲ ‘@udotaccent{a}’ ạ
> >   -(ạ) ‘@v{a}’ ǎ (ǎ) @,c ç (ç) ‘@,{c}’ ç (ç) ‘@ogonek{a}’ ą (ą)
> >   +(ì) @'{e} é (é) @'{@dotless{i}} í (í) @dotless{i} ı (ı) @dotless{j} ȷ (ȷ)
> >   +‘@H{a}’ a̋ ‘@dotaccent{a}’ ȧ (ȧ) ‘@ringaccent{a}’ å (å) ‘@tieaccent{a}’ 
> > a͡
> >   +‘@u{a}’ ă (ă) ‘@ubaraccent{a}’ a̲ ‘@udotaccent{a}’ ạ (ạ) ‘@v{a}’ ǎ (ǎ)
> >   +@,c ç (ç) ‘@,{c}’ ç (ç) ‘@ogonek{a}’ ą (ą)
> > 
> > It looks like a filling problem to me, perhaps because something
> > counts bytes instead of characters?
> 
> It's almost certainly a problem with filling as you say.  In the C (XS)
> code, the return value of wcwidth is used for each character to get
> the width of each line.  The pure Perl code doesn't use the wcwidth
> function as far as I know but keeps a count for each line based on
> regex character classes.  The relevant code is in
> Texinfo/Convert/Unicode.pm, in the 'string_width' function.

So perhaps the wcwidth function is the culprit.  I'm guessing that it
returns 1 for every printable character in my case.

> Do you know whether the XS modules are in use?

Yes, they are.  That's why Perl crashed before the getdelim issue was
fixed, and the crash was inside Parsetexi.dll, which is an XS module.

> You could try "export TEXINFO_XS=omit" or "export TEXINFO_XS=require" to
> check if it makes a difference.  That would narrow it down to which version
> of the code had the problem (or if they both have a problem).

This command succeeds with status 0:

  $ TEXINFO_XS=omit test_scripts/coverage_formatting_info.sh



Re: Texinfo 7.0.93 pretest available

2023-10-08 Thread Eli Zaretskii
> Date: Sun, 08 Oct 2023 19:31:12 +0300
> From: Eli Zaretskii 
> Cc: bug-texinfo@gnu.org
> 
> I see a very large diff, full of non-ASCII characters.  A typical hunk
> is below:
> 
>   -(ì) @'{e} é (é) @'{@dotless{i}} í (í) @dotless{i} ı (ı) @dotless{j} ȷ
>   -(ȷ) ‘@H{a}’ a̋ ‘@dotaccent{a}’ ȧ (ȧ) ‘@ringaccent{a}’ å (å)
>   -‘@tieaccent{a}’ a͡ ‘@u{a}’ ă (ă) ‘@ubaraccent{a}’ a̲ ‘@udotaccent{a}’ ạ
>   -(ạ) ‘@v{a}’ ǎ (ǎ) @,c ç (ç) ‘@,{c}’ ç (ç) ‘@ogonek{a}’ ą (ą)
>   +(ì) @'{e} é (é) @'{@dotless{i}} í (í) @dotless{i} ı (ı) @dotless{j} ȷ (ȷ)
>   +‘@H{a}’ a̋ ‘@dotaccent{a}’ ȧ (ȧ) ‘@ringaccent{a}’ å (å) ‘@tieaccent{a}’ a͡
>   +‘@u{a}’ ă (ă) ‘@ubaraccent{a}’ a̲ ‘@udotaccent{a}’ ạ (ạ) ‘@v{a}’ ǎ (ǎ)
>   +@,c ç (ç) ‘@,{c}’ ç (ç) ‘@ogonek{a}’ ą (ą)
> 
> It looks like a filling problem to me, perhaps because something
> counts bytes instead of characters?

Or maybe the data about character width is incorrect/inconsistent?



Re: Texinfo 7.0.93 pretest available

2023-10-08 Thread Eli Zaretskii
> From: Gavin Smith 
> Date: Sun, 8 Oct 2023 17:04:36 +0100
> Cc: bug-texinfo@gnu.org
> 
> On Sun, Oct 08, 2023 at 04:55:24PM +0300, Eli Zaretskii wrote:
> > > Date: Sun, 08 Oct 2023 16:42:05 +0300
> > > From: Eli Zaretskii 
> > > CC: bug-texinfo@gnu.org
> > > 
> > > The next set of problems is in install-info: the new code in this
> > > version fails to close files, and then Windows doesn't let us
> > > remove/rename them.  The result is that almost all the install-info
> > > tests fail with Permission denied.  The patch below fixes that:
> > 
> > Finally, 8 tests in tp/tests fail:
> > 
> >test_scripts/coverage_formatting_info.sh
> >test_scripts/coverage_formatting_plaintext.sh
> >test_scripts/layout_formatting_info_ascii_punctuation.sh
> >test_scripts/layout_formatting_info_disable_encoding.sh
> >test_scripts/layout_formatting_plaintext_ascii_punctuation.sh
> >test_scripts/layout_formatting_fr.sh
> >test_scripts/layout_formatting_fr_info.sh
> >test_scripts/layout_formatting_fr_icons.sh
> > 
> > I don't think I understand how to debug this.  I tried to look at the
> > output and log files, but either I look at the wrong files or I
> > misunderstand how to interpret them.  Any help and advice will be
> > appreciated.
> 
> First change to the tp/tests subdirectory.  Then run the test script.
> For example:
> 
> test_scripts/coverage_formatting_info.sh

Thanks.

> This prints the texi2any command run, and if there are unexpected results
> these should be printed too.  On my system, here is what is printed for that
> test:
> 
> testdir: coverage/
> driving_file: ./coverage//list-of-tests
> made result dir: ./coverage//res_parser/
> 
> doing test formatting_info, src_file ./coverage//formatting.texi
> format_option: 
> texi2any.pl formatting_info -> coverage//out_parser/formatting_info
>  /usr/bin/perl -w ./..//texi2any.pl  --force --conf-dir ./../t/init/ 
> --conf-dir ./../init --conf-dir ./../ext -I ./coverage/ -I coverage// -I ./ 
> -I . -I built_input --error-limit=1000 -c TEST=1  --output 
> coverage//out_parser/formatting_info/ -D 'needcollationcompat Need collation 
> compatibility' --info ./coverage//formatting.texi > 
> coverage//out_parser/formatting_info/formatting.1 
> 2>coverage//out_parser/formatting_info/formatting.2
> 
> all done, exiting with status 0
> 
> If any of the output files or standard output or error differered from what
> was expected, this would be printed as a diff afterwards.

I see a very large diff, full of non-ASCII characters.  A typical hunk
is below:

  -(ì) @'{e} é (é) @'{@dotless{i}} í (í) @dotless{i} ı (ı) @dotless{j} ȷ
  -(ȷ) ‘@H{a}’ a̋ ‘@dotaccent{a}’ ȧ (ȧ) ‘@ringaccent{a}’ å (å)
  -‘@tieaccent{a}’ a͡ ‘@u{a}’ ă (ă) ‘@ubaraccent{a}’ a̲ ‘@udotaccent{a}’ ạ
  -(ạ) ‘@v{a}’ ǎ (ǎ) @,c ç (ç) ‘@,{c}’ ç (ç) ‘@ogonek{a}’ ą (ą)
  +(ì) @'{e} é (é) @'{@dotless{i}} í (í) @dotless{i} ı (ı) @dotless{j} ȷ (ȷ)
  +‘@H{a}’ a̋ ‘@dotaccent{a}’ ȧ (ȧ) ‘@ringaccent{a}’ å (å) ‘@tieaccent{a}’ a͡
  +‘@u{a}’ ă (ă) ‘@ubaraccent{a}’ a̲ ‘@udotaccent{a}’ ạ (ạ) ‘@v{a}’ ǎ (ǎ)
  +@,c ç (ç) ‘@,{c}’ ç (ç) ‘@ogonek{a}’ ą (ą)

It looks like a filling problem to me, perhaps because something
counts bytes instead of characters?

The diffs like above are followed by diffs in the Index part, where it
looks like the differences are just line counts:

   * Menu:

  -* truc:  chapter.(line 2236)
  +* truc:  chapter.(line 2234)

Probably due to the same problem of incorrect filling of lines?



Re: Texinfo 7.0.93 pretest available

2023-10-08 Thread Eli Zaretskii
> From: Gavin Smith 
> Date: Sun, 8 Oct 2023 16:33:22 +0100
> Cc: bug-texinfo@gnu.org
> 
> > > Hence, I propose to initialise n to 0, rather than 120 as in the patch
> > > below.
> > 
> > No, the value must be positive, otherwise it still crashes.  It's a
> > bug in MinGW implementation.
> 
> Can you refer to any discussion of this bug online anywhere?

I don't need any discussions, I simply read the code.  MinGW is a Free
Software, so the sources of its additions to the Microsoft runtime are
part of the MinGW distribution.  Once I understood that the build is
using the MinGW getdelim, I simply looked at the sources.

> I see on the POSIX specification:
> 
> https://pubs.opengroup.org/onlinepubs/9699919799/functions/getdelim.html
> 
> the wording is slightly different to the glibc manual:
> 
>If *n is non-zero, the application shall ensure that *lineptr either
>points to an object of size at least *n bytes, or is a null pointer.
>
>If *lineptr is a null pointer or if the object pointed to by *lineptr
>is of insufficient size, an object shall be allocated...
> 
> This implies that it is ok to have null *LINEPTR and positive *N.

Yes, it is OK.  It should be also OK to have *N be any garbage when
*LINEPTR is NULL, but the MinGW implementation fails to support that
case.

> I don't like using the value 120 as this is slightly larger than a
> default line length of 80, which is confusing as you might think it
> was that number for a reason and that we were supporting input line
> lengths up to 120 bytes, when in fact any positive number would have
> done.
> 
> I will change it to be 1 with a comment that it should be any positive
> number.

The value 1 works, I already tested that.

> This bug sounds like something that should be worked around with gnulib.
> Would you be able to send details of the bug to bug-gnu...@gnu.org as
> well as any information on the versions of MinGW affected?

Yes, when I have time.  I'm a bit busy these days; it's sheer luck I
had so much time today to work on the non-trivial problems in this
pretest.  (And it isn't over yet.)



Re: Texinfo 7.0.93 pretest available

2023-10-08 Thread Eli Zaretskii
> Date: Sun, 08 Oct 2023 16:42:05 +0300
> From: Eli Zaretskii 
> CC: bug-texinfo@gnu.org
> 
> The next set of problems is in install-info: the new code in this
> version fails to close files, and then Windows doesn't let us
> remove/rename them.  The result is that almost all the install-info
> tests fail with Permission denied.  The patch below fixes that:

Finally, 8 tests in tp/tests fail:

   test_scripts/coverage_formatting_info.sh
   test_scripts/coverage_formatting_plaintext.sh
   test_scripts/layout_formatting_info_ascii_punctuation.sh
   test_scripts/layout_formatting_info_disable_encoding.sh
   test_scripts/layout_formatting_plaintext_ascii_punctuation.sh
   test_scripts/layout_formatting_fr.sh
   test_scripts/layout_formatting_fr_info.sh
   test_scripts/layout_formatting_fr_icons.sh

I don't think I understand how to debug this.  I tried to look at the
output and log files, but either I look at the wrong files or I
misunderstand how to interpret them.  Any help and advice will be
appreciated.



Re: Texinfo 7.0.93 pretest available

2023-10-08 Thread Eli Zaretskii
> Date: Sun, 08 Oct 2023 14:39:36 +0300
> From: Eli Zaretskii 
> CC: bug-texinfo@gnu.org
> 
> Sorry, I was mistaken: the Gnulib getdelim is not used here.  Instead,
> this build uses the MinGW implementation of getdelim, and that one has
> a subtle bug, which rears its ugly head because the second argument to
> getline, here:
> 
>   status = getline (, , input_file);
> 
> is not initialized to any value.  The simple fix below avoids the
> crash and allows the build to run to completion:

The next set of problems is in install-info: the new code in this
version fails to close files, and then Windows doesn't let us
remove/rename them.  The result is that almost all the install-info
tests fail with Permission denied.  The patch below fixes that:

--- install-info/install-info.c~2023-09-13 20:17:33.0 +0300
+++ install-info/install-info.c 2023-10-08 16:28:21.51700 +0300
@@ -826,13 +826,15 @@ determine_file_type:
   /* Redirect stdin to the file and fork the decompression process
  reading from stdin.  This allows shell metacharacters in filenames. */
   char *command = concat (*compression_program, " -d", "");
+  FILE *f2;
 
   if (fclose (f) < 0)
 return 0;
-  f = freopen (*opened_filename, FOPEN_RBIN, stdin);
+  f2 = freopen (*opened_filename, FOPEN_RBIN, stdin);
   if (!f)
 return 0;
   f = popen (command, "r");
+  fclose (f2);
   if (!f)
 {
   /* Used for error message in calling code. */
@@ -904,7 +906,7 @@ readfile (char *filename, int *sizep,
   /* We need to close the stream, since on some systems the pipe created
  by popen is simulated by a temporary file which only gets removed
  inside pclose.  */
-  if (compression_program)
+  if (compression_program && *compression_program)
 pclose (f);
   else
 fclose (f);



Re: Texinfo 7.0.93 pretest available

2023-10-08 Thread Eli Zaretskii
> From: Gavin Smith 
> Date: Sun, 8 Oct 2023 12:50:51 +0100
> Cc: bug-texinfo@gnu.org
> 
> On Sun, Oct 08, 2023 at 02:39:36PM +0300, Eli Zaretskii wrote:
> > Sorry, I was mistaken: the Gnulib getdelim is not used here.  Instead,
> > this build uses the MinGW implementation of getdelim, and that one has
> > a subtle bug, which rears its ugly head because the second argument to
> > getline, here:
> > 
> >   status = getline (, , input_file);
> > 
> > is not initialized to any value.  The simple fix below avoids the
> > crash and allows the build to run to completion:
> 
> (I'd noticed that and checked the Gnulib implementation didn't need n
> to be defined if the first argument was null.)
> 
> According to the documentation for getline,
> 
>  If you set ‘*LINEPTR’ to a null pointer, and ‘*N’ to zero, before
>  the call, then ‘getline’ allocates the initial buffer for you by
>  calling ‘malloc’.  This buffer remains allocated even if ‘getline’
>  encounters errors and is unable to read any bytes.
> 
> Hence, I propose to initialise n to 0, rather than 120 as in the patch
> below.

No, the value must be positive, otherwise it still crashes.  It's a
bug in MinGW implementation.



Re: Texinfo 7.0.93 pretest available

2023-10-08 Thread Eli Zaretskii
> Date: Sun, 08 Oct 2023 12:41:19 +0300
> From: Eli Zaretskii 
> Cc: bug-texinfo@gnu.org
> 
>   Starting program: d:\usr\Perl\bin\perl.exe ../tp/texi2any.pl info-stnd.texi
> 
>   Program received signal SIGSEGV, Segmentation fault.
>   0x692a6fc6 in getdelim ()
>  from D:\gnu\texinfo-7.0.93\tp\Texinfo\XS\.libs\Parsetexi.dll
>   (gdb) bt
>   #0  0x692a6fc6 in getdelim ()
>  from D:\gnu\texinfo-7.0.93\tp\Texinfo\XS\.libs\Parsetexi.dll
>   #1  0x6928c993 in next_text ()
>  from D:\gnu\texinfo-7.0.93\tp\Texinfo\XS\.libs\Parsetexi.dll
>   #2  0x6928ba6a in parse_texi ()
>  from D:\gnu\texinfo-7.0.93\tp\Texinfo\XS\.libs\Parsetexi.dll
>   #3  0x6928bc58 in parse_texi_document ()
>  from D:\gnu\texinfo-7.0.93\tp\Texinfo\XS\.libs\Parsetexi.dll
>   #4  0x692840d0 in parse_file ()
>  from D:\gnu\texinfo-7.0.93\tp\Texinfo\XS\.libs\Parsetexi.dll
>   #5  0x6928219c in XS_Texinfo__Parser_parse_file ()
>  from D:\gnu\texinfo-7.0.93\tp\Texinfo\XS\.libs\Parsetexi.dll
>   #6  0x66c8b8bb in perl520!Perl_find_runcv () from 
> d:\usr\Perl\bin\perl520.dll
> 
> Source code information is not available in the debug info, but from
> looking at the disassembly of the code, I see that getdelim (from
> Gnulib) calls realloc, which resolves to the default realloc
> implementation of the MinGW libc.  Isn't that dangerous, given that at
> least some code in the extensions uses the Perl's malloc/free
> implementation?

Sorry, I was mistaken: the Gnulib getdelim is not used here.  Instead,
this build uses the MinGW implementation of getdelim, and that one has
a subtle bug, which rears its ugly head because the second argument to
getline, here:

  status = getline (, , input_file);

is not initialized to any value.  The simple fix below avoids the
crash and allows the build to run to completion:

--- tp/Texinfo/XS/parsetexi/input.c~2023-08-14 23:12:04.0 +0300
+++ tp/Texinfo/XS/parsetexi/input.c 2023-10-08 14:35:33.14200 +0300
@@ -395,7 +395,7 @@ next_text (ELEMENT *current)
 {
   ssize_t status;
   char *line = 0;
-  size_t n;
+  size_t n = 120;
   FILE *input_file;
 
   if (input_pushback_string)



Re: Texinfo 7.0.93 pretest available

2023-10-08 Thread Eli Zaretskii
> From: Gavin Smith 
> Date: Sun, 8 Oct 2023 08:50:51 +0100
> Cc: bug-texinfo@gnu.org
> 
> The program appears to crash after the "@include version-stnd.texi" line
> which read a new input file.  This suggests that the problem may be to
> do with reading input, somewhere in 'next_text' in input.c.
> 
> I suggest commenting out the "@include" line:
> 
> diff --git a/doc/info-stnd.texi b/doc/info-stnd.texi
> index 36d884a76a..883408ffcd 100644
> --- a/doc/info-stnd.texi
> +++ b/doc/info-stnd.texi
> @@ -4,7 +4,7 @@
>  @c file is made first, and texi2dvi must include . first in the path.
>  @comment %**start of header
>  @setfilename info-stnd.info
> -@include version-stnd.texi
> +@c @include version-stnd.texi
>  @settitle Stand-alone GNU Info @value{VERSION}
>  @syncodeindex vr cp
>  @syncodeindex fn cp
> 
> and trying the command again.  If it gets further, that would confirm
> there was a problem with included files.

Yes, it gets much further, I think to the very end?  It still crashes,
though, after printing this:

  GET_A_NEW_LINE
  NEW LINE @bye
  BEGIN LINE
  COMMAND @bye
  ABORT EMPTY in @appendix[A1][C4](p:1): empty_line; add || to ||
  FINISHED_TOTALLY
  GATHER AFTER BYE

> gdb /d/usr/Perl/bin/perl
> 
> Then at the gdb prompt, run
> 
> r ../tp/texi2any.pl info-stnd.texi
> 
> Hopefully it shows you where the crash occurs.

I have some info, see below.

> If the "parsetexi" module was compiled with debugging information, I have
> always found on GNU/Linux that it is possible to debug the module just as
> you would debug a standalone program. 

Alas, the default is not to compile parsetexi with debug info, at
least not a sufficient one (or maybe producing the shared library
doesn't keep the symbols).  Here's what I get from GDB's "bt" command
from the crash site:

  Starting program: d:\usr\Perl\bin\perl.exe ../tp/texi2any.pl info-stnd.texi

  Program received signal SIGSEGV, Segmentation fault.
  0x692a6fc6 in getdelim ()
 from D:\gnu\texinfo-7.0.93\tp\Texinfo\XS\.libs\Parsetexi.dll
  (gdb) bt
  #0  0x692a6fc6 in getdelim ()
 from D:\gnu\texinfo-7.0.93\tp\Texinfo\XS\.libs\Parsetexi.dll
  #1  0x6928c993 in next_text ()
 from D:\gnu\texinfo-7.0.93\tp\Texinfo\XS\.libs\Parsetexi.dll
  #2  0x6928ba6a in parse_texi ()
 from D:\gnu\texinfo-7.0.93\tp\Texinfo\XS\.libs\Parsetexi.dll
  #3  0x6928bc58 in parse_texi_document ()
 from D:\gnu\texinfo-7.0.93\tp\Texinfo\XS\.libs\Parsetexi.dll
  #4  0x692840d0 in parse_file ()
 from D:\gnu\texinfo-7.0.93\tp\Texinfo\XS\.libs\Parsetexi.dll
  #5  0x6928219c in XS_Texinfo__Parser_parse_file ()
 from D:\gnu\texinfo-7.0.93\tp\Texinfo\XS\.libs\Parsetexi.dll
  #6  0x66c8b8bb in perl520!Perl_find_runcv () from d:\usr\Perl\bin\perl520.dll

Source code information is not available in the debug info, but from
looking at the disassembly of the code, I see that getdelim (from
Gnulib) calls realloc, which resolves to the default realloc
implementation of the MinGW libc.  Isn't that dangerous, given that at
least some code in the extensions uses the Perl's malloc/free
implementation?

If the above information is not enough, I will try to build the
extensions with more extensive debug info, and see what GDB will tell
then.  Alternatively, maybe you have ideas to try some code changes
based on the above.

Thanks.



Re: Texinfo 7.0.93 pretest available

2023-10-07 Thread Eli Zaretskii
> From: Gavin Smith 
> Date: Sat, 7 Oct 2023 17:26:51 +0100
> Cc: bug-texinfo@gnu.org
> 
> I've changed xspara__print_escaped_spaces not to use malloc and free,
> although adding dTHX should be harmless.

Yes, I've seen that.  Applying that change doesn't prevent the
crashes.

> Try going into the doc directory and replicating the command to build
> the manual:
> 
> TEXINFO_DEV_SOURCE=1  top_srcdir=".."  top_builddir=".." /d/usr/Perl/bin/perl 
> ../tp/texi2any -c INFO_SPECIAL_CHARS_WARNING=0  -I . -o texinfo.info  
> texinfo.texi
> 
> and see if the problem replicates.

Crashes.

> More straightforwardly, try
> 
> /d/usr/Perl/bin/perl ../tp/texi2any.pl texinfo.texi

Also crashes.

> (which will output a harmless warning about a node name).
> 
> Then you could try with debugging output:
> 
> /d/usr/Perl/bin/perl ../tp/texi2any.pl texinfo.texi -c DEBUG=1
> 
> or for a smaller file,
> 
> /d/usr/Perl/bin/perl ../tp/texi2any.pl info-stnd.texi -c DEBUG=1
> 
> to get an idea of where the crash is occurring.

The output of the last command is:

  $ /d/usr/Perl/bin/perl ../tp/texi2any.pl info-stnd.texi -c DEBUG=1
   RESETTING THE PARSER !
  NEW LINE @c We must \input texinfo.tex instead of texinfo, otherwise make
  BEGIN LINE
  COMMAND @c
  ABORT EMPTY in (before_node_section)[C2](p:1): empty_line; add || to ||
  GET_A_NEW_LINE
  NEW LINE @c distcheck in the Texinfo distribution fails, because the texinfo 
Info
  BEGIN LINE
  COMMAND @c
  ABORT EMPTY in (before_node_section)[C3](p:1): empty_line; add || to ||
  GET_A_NEW_LINE
  NEW LINE @c file is made first, and texi2dvi must include . first in the path.
  BEGIN LINE
  COMMAND @c
  ABORT EMPTY in (before_node_section)[C4](p:1): empty_line; add || to ||
  GET_A_NEW_LINE
  NEW LINE @comment %**start of header
  BEGIN LINE
  COMMAND @comment
  ABORT EMPTY in (before_node_section)[C5](p:1): empty_line; add || to ||
  GET_A_NEW_LINE
  NEW LINE @setfilename info-stnd.info
  BEGIN LINE
  COMMAND @setfilename
  ABORT EMPTY in (before_node_section)[C6](p:1): empty_line; add || to ||
  ABORT EMPTY in (line_arg)[C1](p:1): internal_spaces_after_command; add || to 
| |
  NEW TEXT (merge): info-stnd|||
  MERGED TEXT: .||| in [T: info-stnd] last of (line_arg)[C1]
  MERGED TEXT: info||| in [T: info-stnd.] last of (line_arg)[C1]
  END LINE (line_arg)[C1] <- @setfilename
  MERGED TEXT:
  ||| in [T: info-stnd.info] last of (line_arg)[C1]
  ISOLATE SPACE p (line_arg)[C1]; c [T: info-stnd.info\n]
  MISC END setfilename
  GET_A_NEW_LINE
  NEW LINE @include version-stnd.texi
  BEGIN LINE
  COMMAND @include
  ABORT EMPTY in (before_node_section)[C7](p:1): empty_line; add || to ||
  ABORT EMPTY in (line_arg)[C1](p:1): internal_spaces_after_command; add || to 
| |
  NEW TEXT (merge): version-stnd|||
  MERGED TEXT: .||| in [T: version-stnd] last of (line_arg)[C1]
  MERGED TEXT: texi||| in [T: version-stnd.] last of (line_arg)[C1]
  END LINE (line_arg)[C1] <- @include
  MERGED TEXT:
  ||| in [T: version-stnd.texi] last of (line_arg)[C1]
  ISOLATE SPACE p (line_arg)[C1]; c [T: version-stnd.texi\n]
  MISC END include
  Included ./version-stnd.texi
  MARK include c: 1 p: 0 start no-add @setfilename[A1] (before_node_section)[C6]
  GET_A_NEW_LINE
  NEW LINE @set UPDATED 15 August 2023
  BEGIN LINE
  COMMAND @set
  ABORT EMPTY in (before_node_section)[C7](p:1): empty_line; add || to ||
  GET_A_NEW_LINE
  NEW LINE @set UPDATED-MONTH August 2023
  BEGIN LINE
  COMMAND @set
  ABORT EMPTY in (before_node_section)[C8](p:1): empty_line; add || to ||
  GET_A_NEW_LINE
  NEW LINE @set EDITION 7.0.93
  BEGIN LINE
  COMMAND @set
  ABORT EMPTY in (before_node_section)[C9](p:1): empty_line; add || to ||
  GET_A_NEW_LINE
  NEW LINE @set VERSION 7.0.93
  BEGIN LINE
  COMMAND @set
  ABORT EMPTY in (before_node_section)[C10](p:1): empty_line; add || to ||
  GET_A_NEW_LINE

and then it crashes.

Does this help?



Re: Texinfo 7.0.93 pretest available

2023-10-07 Thread Eli Zaretskii
> From: Gavin Smith 
> Date: Sat, 30 Sep 2023 17:16:57 +0100
> Cc: platform-test...@gnu.org
> 
> A pretest distribution for the next Texinfo release (7.1) has been
> uploaded to
> 
> https://alpha.gnu.org/gnu/texinfo/texinfo-7.0.93.tar.xz

This fails to build on MS-Windows with mingw.org's MinGW.  First, I
needed to add the missing dTHX in several places; patch below.  After
making those changes, the extensions compiled and linked, but Perl
crashed while running this command:

   make[3]: Entering directory `/d/gnu/texinfo-7.0.93/doc'
   restore=: && backupdir=".am$$" && \
   rm -rf $backupdir && mkdir $backupdir && \
   if (TEXINFO_DEV_SOURCE=1  top_srcdir=".."  top_builddir=".." 
/d/usr/Perl/bin/perl ../tp/texi2any --version) >/dev/null 2>&1; then \
 for f in texinfo.info texinfo.info-[0-9] texinfo.info-[0-9][0-9] 
texinfo.i[0-9] texinfo.i[0-9][0-9]; do \
   if test -f $f; then mv $f $backupdir; restore=mv; else :; fi; \
 done; \
   else :; fi && \
   if TEXINFO_DEV_SOURCE=1  top_srcdir=".."  top_builddir=".." 
/d/usr/Perl/bin/perl ../tp/texi2any -c INFO_SPECIAL_CHARS_WARNING=0  -I . \
-o texinfo.info `test -f 'texinfo.texi' || echo './'`texinfo.texi; \
   then \
 rc=0; \
   else \
 rc=$?; \
 $restore $backupdir/* `echo "./texinfo.info" | sed 's|[^/]*$||'`; \
   fi; \
   rm -rf $backupdir; exit $rc
   Makefile:1833: recipe for target `texinfo.info' failed
   make[3]: *** [texinfo.info] Error 5

The crash is inside parsetexi.dll, but I don't know where exactly.

Any ideas how to debug this?

Here's the patch I promised:

--- tp/Texinfo/XS/xspara.c.~1~  2023-08-14 21:47:01.0 +0300
+++ tp/Texinfo/XS/xspara.c  2023-10-07 15:48:18.90300 +0300
@@ -242,6 +242,9 @@ xspara__print_escaped_spaces (char *stri
 {
   static TEXT t;
   char *p = string;
+
+  dTHX;
+
   text_reset ();
   while (*p)
 {
@@ -566,6 +569,8 @@ xspara_get_pending (void)
 void
 xspara__add_pending_word (TEXT *result, int add_spaces)
 {
+  dTHX;
+
   if (state.word.end == 0 && !state.invisible_pending_word && !add_spaces)
 return;
 
@@ -641,6 +646,9 @@ char *
 xspara_end (void)
 {
   static TEXT ret;
+
+  dTHX;
+
   text_reset ();
   state.end_line_count = 0;
 
@@ -687,6 +695,8 @@ xspara_end (void)
 void
 xspara__add_next (TEXT *result, char *word, int word_len, int transparent)
 {
+  dTHX;
+
   int disinhibit = 0;
   if (!word)
 return;


--- tp/Texinfo/XS/parsetexi/api.c.~1~   2023-08-14 23:12:04.0 +0300
+++ tp/Texinfo/XS/parsetexi/api.c   2023-10-07 15:50:23.49675 +0300
@@ -158,6 +158,8 @@ reset_parser_except_conf (void)
 void
 reset_parser (int debug_output)
 {
+  dTHX;
+
   /* NOTE: Do not call 'malloc' or 'free' in this function or in any function
  called in this file.  Since this file (api.c) includes the Perl headers,
  we get the Perl redefinitions, which we do not want, as we don't use



Re: ignoring control characters in character width

2023-09-06 Thread Eli Zaretskii
> From: Gavin Smith 
> Date: Wed, 6 Sep 2023 02:51:47 +0100
> 
> On Tue, Sep 05, 2023 at 09:16:47PM +0200, Patrice Dumas wrote:
> > I think I understand what you don't understand, actually this is not
> > about displaying the characters, which is not really done by texi2any,
> > it is about situations where we need to count the width of characters
> > in texi2any.  For instance, this is to determine when to put end of
> > lines when formatting Info to compare with line width, or to format
> > multitable cells, or to determine the length of underlining * for a
> > heading string as in 
> > 
> > Some heading
> > 
> > 
> > Hope that it is clearer.
> 
> It would be wrong to output control characters in Info output.  It doesn't
> matter what the program does in this situation as it would mean that
> something wrong is happening somewhere else, e.g. malformed input.  Worrying
> about what we should do for vertical tabs or form feeds is a waste of time
> in my opinion.
> 
> So it doesn't matter what width is used for these characters, so we
> should do whatever is simplest in this part of the code.  Using 0 for
> the width seems as good as any choice.

Well, if you say that we should never-ever display these characters,
then obviously zero is a good value.

But is it really true that we never display any of them?  Not even
TAB?



Re: ignoring control characters in character width

2023-09-05 Thread Eli Zaretskii
> Date: Tue, 5 Sep 2023 21:16:47 +0200
> From: Patrice Dumas 
> Cc: bug-texinfo@gnu.org
> 
> I think I understand what you don't understand, actually this is not
> about displaying the characters, which is not really done by texi2any,
> it is about situations where we need to count the width of characters
> in texi2any.  For instance, this is to determine when to put end of
> lines when formatting Info to compare with line width, or to format
> multitable cells, or to determine the length of underlining * for a
> heading string as in 
> 
> Some heading
> 
> 
> Hope that it is clearer.  Also we need to make this choice without
> knowing precisely how the characters will be displayed.  In general
> the display is done by info readers for Info, but it could also be in a
> pager, a text editor for the diverse possibilities of plain text output.

OK, but in any case the width of control characters is not zero,
except for some of them, like newline.

Perhaps you should describe the problem you are trying to solve in
more detail?



Re: ignoring control characters in character width

2023-09-05 Thread Eli Zaretskii
> Date: Tue, 5 Sep 2023 20:19:40 +0200
> From: Patrice Dumas 
> Cc: bug-texinfo@gnu.org
> 
> On Tue, Sep 05, 2023 at 09:09:18PM +0300, Eli Zaretskii wrote:
> > > Date: Tue, 5 Sep 2023 20:01:53 +0200
> > > From: Patrice Dumas 
> > > 
> > > Currently, when counting the width of a line of character, we count
> > > control characters that are also spaces as having a width of 1.  I think
> > > that it is not good, as control characters either should not have a
> > > width, for end of line, form feed, carriage return, or have a width that
> > > is not well defined for vertical and horizontal tab.  I suggest to
> > > consider all the control characters as having a width of 0.  This will
> > > be consistent with libunistring u8_strwidth, which I intend to use in C
> > > code equivalent to perl code.
> > 
> > Please define "control characters" for this purpose.  Some of them are
> > definitely not zero-width, for example, TAB.
> 
> Characters whose unicode codepoints in decimal are in the range 0 to 31,
> and also 127 (Delete).  This includes the horizontal tab.  It
> corresponds to the [:cntrl:] character class.

Then I guess I still don't understand: how is TAB a zero-width
character?

> > Also, depending on how control characters are displayed, their width
> > could be even 4, for example if they are displayed as \nnn octal
> > escapes.
> 
> It is in a context where they are displayed as encoded bytes.

So what is the context of this discussion, if it is not display of
bytes?  I really don't understand, could you elaborate?

Control characters can also be displayed as ^C, for example, in which
case they take 2 columns.



Re: ignoring control characters in character width

2023-09-05 Thread Eli Zaretskii
> Date: Tue, 5 Sep 2023 20:01:53 +0200
> From: Patrice Dumas 
> 
> Currently, when counting the width of a line of character, we count
> control characters that are also spaces as having a width of 1.  I think
> that it is not good, as control characters either should not have a
> width, for end of line, form feed, carriage return, or have a width that
> is not well defined for vertical and horizontal tab.  I suggest to
> consider all the control characters as having a width of 0.  This will
> be consistent with libunistring u8_strwidth, which I intend to use in C
> code equivalent to perl code.

Please define "control characters" for this purpose.  Some of them are
definitely not zero-width, for example, TAB.

Also, depending on how control characters are displayed, their width
could be even 4, for example if they are displayed as \nnn octal
escapes.

So I think we need more context for this discussion.



Re: Texinfo 7.0.90 pretest on CentOS 8-stream (Unicode::Collate)

2023-08-18 Thread Eli Zaretskii
> From: Gavin Smith 
> Date: Fri, 18 Aug 2023 14:47:48 +0100
> Cc: bug-texinfo@gnu.org
> 
> As the log file shows, the Unicode::Collate module is not found.  I don't
> know what the solution to this is.
> 
> It is meant to be included (in the "perl core") with perl 5.26.3 (the perl
> version reported).
> 
> https://perldoc.perl.org/5.26.3/modules
> 
> But evidently we can't rely on this.
> 
> You can also run "corelist -a Unicode::Collate" to confirm it is a core
> module.

This says:

  Unicode::Collate was first released with perl v5.7.3

So I think you can safely rely on that being available.



Re: Texinfo 7.0.90 pretest on mingw

2023-08-17 Thread Eli Zaretskii
> From: Bruno Haible 
> Date: Fri, 18 Aug 2023 01:00:01 +0200
> 
> On mingw 5.0.3, I see 56 test failures:
> 
> FAIL: ii-0001-test
> FAIL: ii-0002-test

Isn't this a problem with different EOL conventions in the expected
results and in what install-info compiled for Windows produces?  If
so, you need to invoke Diff with the --strip-trailing-cr option.  (I
have long ago made 'diff' a shell script that invokes 'diff' the
program with that option, in the MSYS environment, because a lot of
test suites that come from GNU and Unix systems have the same
problem.)



Re: Inconsistency in writing apostrophe in info and html output with version 7.0.3

2023-06-05 Thread Eli Zaretskii
> Date: Mon, 5 Jun 2023 08:11:00 -0700
> From: Raymond Toy 
> 
> It appears not to be consistent. We have this in the texinfo source:
> 
> 
> @fnindex N'th previous output
> 
> with a real apostrohpe. The info file has
> 
> 
> * N'th previous output:  Functions and Variables for Command 
> Line.
> 
> and that’s also a real apostrophe. Don’t know what’s different between the 
> two cases.

I'm guessing that @fnindex is the index of function names, in which
case it generates a "code" typeface, where ASCII characters are not
converted to their Unicode typographical equivalents.

So this is consistent, jut not the kind of consistency you expected.



Re: Inconsistency in writing apostrophe in info and html output with version 7.0.3

2023-06-05 Thread Eli Zaretskii
> Date: Mon, 5 Jun 2023 07:18:00 -0700
> From: Raymond Toy 
> 
> Maxima grovels over the html file to find appropriate links to use for the 
> html version of the manual.
> This was working fine with 6.8 and earlier because I found appropriate 
> regexps to find the links.
> 
> This stopped working in 7.0.3 (and maybe earlier?). The regexps no longer 
> work. This is fine; there
> was no promise that the format of html links would be consistent.
> 
> The problem I’m seeing is that in the texi source, we have:
> 
> 
> @vrindex Euler's number
> 
> That apostrophe is really an apostrophe character, unicode U+27.
> 
> However, in the generated info file, the index has:
> 
> 
> * Euler’s number:Functions and Variables for 
> Constants.
> 
> In emacs , the apostrophe shows up as \342\200\231, which is 
> Right_Single_Quotation_Mark,
> unicode U+2019.

This is the default, but it is customizable, see the node "Other
Customization Variables" in the Texinfo manual.



Re: Document rendering of man pages on GNU Info manual?

2023-04-21 Thread Eli Zaretskii
[Please keep the list address on the CC.]

> From: Sebastian Carlos 
> Date: Fri, 21 Apr 2023 17:30:20 +0200
> 
> Stand-alone GNU Info for version 7.0.3, where I only found two mentions:
> 
> > --all, -a
> > Find all files matching manual. Three usage patterns are supported, as 
> > follows.
> > First, if --all is used together with --where, info prints the names of all 
> > matching files found on
> standard output (including ‘*manpages*’ if relevant) and exits.
> 
> and
> 
> > M-x man
> > Read the name of a man page to load and display. This uses the man command 
> > on your system to
> retrieve the contents of the requested man page. See also --raw-escapes.

And why is that not enough?



Re: Document rendering of man pages on GNU Info manual?

2023-04-21 Thread Eli Zaretskii
> From: Sebastian Carlos 
> Date: Fri, 21 Apr 2023 16:51:30 +0200
> 
> I noticed that GNU Info does render man pages, but there's no mention of this 
> feature on the GNU Info
> manual. And it's not entirely clear if there's a way to explicitly read man 
> pages with GNU Info, or if it's
> just a fallback during certain conditions.

It is documented.  Which Info manual you are reading, and where did
you get it?



Re: [PATCH] Silence compiler warnings with MinGW64

2023-04-06 Thread Eli Zaretskii
> From: Arash Esbati 
> Cc: gavinsmith0...@gmail.com,  bug-texinfo@gnu.org
> Date: Thu, 06 Apr 2023 20:13:32 +0200
> 
> Eli Zaretskii  writes:
> 
> > Btw, I just built Texinfo 7.0.3, and info.exe still displays Unicode
> > quotes as I expect: transliterated to ASCII characters.  So why it
> > doesn't work for your build on your system is still a mystery to me.
> 
> I presume you're using Msys/MinGW and not Msys2/MinGW64?

Yes.

> Maybe that makes a difference?

Maybe, but it's a mystery why it should.  AFAIK, the relevant code
doesn't depend on anything that could be different between those two
flavors.



Re: [PATCH] Silence compiler warnings with MinGW64

2023-04-06 Thread Eli Zaretskii
> From: Arash Esbati 
> Cc: gavinsmith0...@gmail.com,  bug-texinfo@gnu.org
> Date: Thu, 06 Apr 2023 14:18:50 +0200
> 
> Eli Zaretskii  writes:
> 
> > Why on Earth is a system header included only in msys2-runtime-devel?
> >
> > Also, is msys2-runtime-devel about building MSYS2 programs or MinGW
> > programs?  If the latter, it shouldn't be needed for your attempts to
> > build a MinGW port.
> 
> Sorry, I can't tell if msys2-runtime-devel is about building MSYS2
> programs or MinGW programs.  The package info is: MSYS2 headers and
> libraries.  I simply did
> 
>   pkgfile -s wait.h
> 
> which returned
> 
>   mingw64/mingw-w64-x86_64-arm-none-eabi-newlib
>   mingw64/mingw-w64-x86_64-postgresql
>   mingw64/mingw-w64-x86_64-python-autopxd2
>   mingw64/mingw-w64-x86_64-riscv64-unknown-elf-newlib
>   msys/msys2-runtime-3.3-devel
>   msys/msys2-runtime-devel
> 
> and took the package which looked sensible to me.

Hmm... MinGW build shouldn't need wait.h, it's a header that is not
present on Windows.  The packages you show above are either not
relevant to building MinGW programs to run them natively on Windows,
or are for MSYS2 development.  If you use such a wait.h, you could get
in trouble.

Texinfo 7.0.3 doesn't have any inclusions of wait.h, only sys/wait.h,
in Gnulib's stdlib.h and in in man.c.  The latter is guarded by
"#if defined (HAVE_SYS_WAIT_H)", and the former should not be used
on MinGW.  What problems do you get if wait.h is not available?

Btw, I just built Texinfo 7.0.3, and info.exe still displays Unicode
quotes as I expect: transliterated to ASCII characters.  So why it
doesn't work for your build on your system is still a mystery to me.



Re: [PATCH] Silence compiler warnings with MinGW64

2023-04-06 Thread Eli Zaretskii
> From: Arash Esbati 
> Cc: gavinsmith0...@gmail.com,  bug-texinfo@gnu.org
> Date: Thu, 06 Apr 2023 13:22:46 +0200
> 
> Eli Zaretskii  writes:
> 
> > The only Windows-specific issue I'm aware of is that the 'configure'
> > command should point to the native MS-Windows port of Perl, not to the
> > MSYS Perl.
> 
> That's true, as long as you want to build from a tarball.  If you want
> to build from git, you have to install other stuff like:
> 
>   automake-wrapper
>   msys2-runtime-devel
>   libtool
>   man-db
>   help2man

Texinfo doesn't have an INSTALL.REPO file or something to that
effect.  README-hacking might be it, but it seems to be notes by
developers for themselves.  Gavin's call, I'd say.

> Maybe the biggest issue is that the build process requires wait.h which
> is part of msys2-runtime-devel.

Why on Earth is a system header included only in msys2-runtime-devel?

Also, is msys2-runtime-devel about building MSYS2 programs or MinGW
programs?  If the latter, it shouldn't be needed for your attempts to
build a MinGW port.



Re: [PATCH] Silence compiler warnings with MinGW64

2023-04-06 Thread Eli Zaretskii
> From: Arash Esbati 
> Cc: bug-texinfo@gnu.org
> Date: Thu, 06 Apr 2023 08:55:22 +0200
> 
> I see that Texinfo doesn't have any instructions how to build and install
> Texinfo with MinGW64.  Should I make a proposal?

It depends on what MinGW64-specific nits you think should be there.

The only Windows-specific issue I'm aware of is that the 'configure'
command should point to the native MS-Windows port of Perl, not to the
MSYS Perl.  Everything else "just works", AFAIR.



Re: integer types

2023-04-05 Thread Eli Zaretskii
> From: Gavin Smith 
> Date: Wed, 5 Apr 2023 14:59:23 +0100
> Cc: pertu...@free.fr, ar...@gnu.org, bug-texinfo@gnu.org
> 
> Might it be better to round-trip through intptr_t rather than through
> a pointer type?

Yes, I think this will be better.  Cleaner, too.

> > I think you are forgetting the endianness.  With at least one of the
> > two possible endianness possibilities, isn't it true that casting to a
> > narrower type can result in catastrophic loss of significant bits, if
> > you cast between integers and pointers, or vice versa?
> 
> I've never heard of that before.  So you are saying if you have a small
> integer (like 5) stored in a narrow integer type, cast this to a wider
> pointer type, and then cast it back to the same integer type, then
> something catastrophic happens?  How does that work?

If the pointer is to a narrower type, then dereferencing it will take
only part of the bits of the integer value.  Depending on the
endianness, that part could be the LSB (good) or MSB (bad).

> > If we don't want to change the type, we can assign the value to a
> > variable of the suitable width:
> > 
> >   void *elptr = value;
> >   add_associated_info_key (e->extra_info, key, elptr, extra_integer);
> 
> How is that different to the following?
> 
> add_associated_info_key (e->extra_info, key, (void*) value, 
> extra_integer);

It avoids the problem with endianness, since all the significant bits
will be copied.



Re: integer types

2023-04-05 Thread Eli Zaretskii
> From: Gavin Smith 
> Date: Wed, 5 Apr 2023 12:24:58 +0100
> Cc: Patrice Dumas , ar...@gnu.org, bug-texinfo@gnu.org
> 
> GNU coding standards (Info node (standards)CPU Portability):
> 
>  You need not cater to the possibility that 'long' will be smaller
>   than pointers and 'size_t'.  We know of one such platform: 64-bit
>   programs on Microsoft Windows.  If you care about making your package
>   run on Windows using Mingw64, you would need to deal with 8-byte
>   pointers and 4-byte 'long', which would break this code ...
> 
> We don't need to make changes to stop these warnings if it is going
> to be difficult to do so.

But is it, in fact, difficult?  Are there any problems with using
intptr_t, or (if we fear of it being unsupported on some platform)
with defining a macro that will yield 'long' on all platforms except
Windows, where it will yield 'intptr_t'?  (Since MinGW uses GCC as the
compiler, we can rely on 'intptr_t' to be available there.)

Of course, it is your call as a the Texinfo maintainer.  You can
decide that you don't care (but see below), in which case whoever
wants to build a reliable binary on Windows will need to make a simple
change locally.

> As I have said before, the warnings are not about real problems, as
> the integers in question are always small in magnitude in practice
> (e.g. you do not have a @multitable with millions of columns).

I think you are forgetting the endianness.  With at least one of the
two possible endianness possibilities, isn't it true that casting to a
narrower type can result in catastrophic loss of significant bits, if
you cast between integers and pointers, or vice versa?

> I'm concerned that trying to fix this may have the potential to require
> many changes throughout the code, which may not be worth it for the
> sake of silencing a harmless warning.  It may not be as simple as
> changing the types of a few variables or adding casts in a few places.
> 
> For example, for this warning:
> 
> > parsetexi/extra.c: In function 'add_extra_integer':
> > parsetexi/extra.c:124:48: warning: cast to pointer from integer
> >   of different size [-Wint-to-pointer-cast]
> >   124 |   add_associated_info_key (e->extra_info, key, (ELEMENT *) value, 
> > extra_integer);
> 
> 
> The 'value' parameter here has type 'long' which is then cast to a pointer.
> (I don't see how this causes a problem, actually, if 'long' is 32 bits
> and the pointer type is 64 bits.)

See above.

If we don't want to change the type, we can assign the value to a
variable of the suitable width:

  void *elptr = value;
  add_associated_info_key (e->extra_info, key, elptr, extra_integer);

> One option may be to rewrite the code to use a union type.

That can still lose bits, I think.



Re: [PATCH] Silence compiler warnings with MinGW64

2023-04-05 Thread Eli Zaretskii
> Date: Wed, 5 Apr 2023 11:31:12 +0200
> From: Patrice Dumas 
> Cc: Arash Esbati , bug-texinfo@gnu.org
> 
> On Wed, Apr 05, 2023 at 11:47:08AM +0300, Eli Zaretskii wrote:
> > Those are real bugs: we should cast to inptr_t instead of long.
> 
> We already do that in some code, but we immediatly cast to another type,
> defined in perl, like
>   IV value = (IV) (intptr_t) k->value;
> 
> Is there a integer type we could cast to that represents integers that we
> are sure makes sense to cast from intptr_t?

I'm not sure I understand the question.  Maybe if you tell why
intptr_t doesn't fit this particular bill, I'll be able to give some
meaningful answer.

> For instance, is the
> following correct, or should long be replaced by something else?
>   long max_columns = (long) (intptr_t) k->value;

No, it's incorrect, because on 64-bit Windows 'long' is still 32-bit
wide, whereas a pointer is 64-bit wide.  That's why the compiler
emitted the warning that Arash reported in his environment in the
first place.

We could use 'long long' instead, but:

  . it might be less portable
  . on 32-bit platforms, it's overkill (and will slow the code even if
'long long' does exist)

AFAIU, this kind of problems is exactly the reason for intptr_t and
uintptr_t: they are integer types that are wide enough for both for
pointers and for integers.



Re: [PATCH] Silence compiler warnings with MinGW64

2023-04-05 Thread Eli Zaretskii
> From: Arash Esbati 
> Date: Wed, 05 Apr 2023 09:46:28 +0200
> 
> The only other warnings I get are (linebreaks added manually):
> 
> --8<---cut here---start->8---
> parsetexi/handle_commands.c: In function 'handle_other_command':
> parsetexi/handle_commands.c:399:31: warning: cast from pointer to
>   integer of different size [-Wpointer-to-int-cast]
>   399 | max_columns = (long) k->value;
>   |   ^
> parsetexi/handle_commands.c: In function 'handle_line_command':
> parsetexi/handle_commands.c:755:29: warning: cast from pointer to
>  integer of different size [-Wpointer-to-int-cast]
>   755 | level = (long) k->value + 1;
>   | ^
> parsetexi/extra.c: In function 'add_extra_integer':
> parsetexi/extra.c:124:48: warning: cast to pointer from integer
>   of different size [-Wint-to-pointer-cast]
>   124 |   add_associated_info_key (e->extra_info, key, (ELEMENT *) value, 
> extra_integer);
>   |^
> --8<---cut here---end--->8---

Those are real bugs: we should cast to inptr_t instead of long.



Re: info '(latex2e)\indent & \noindent' doesn't work with Msys2

2023-04-04 Thread Eli Zaretskii
> From: Arash Esbati 
> Cc: gavinsmith0...@gmail.com,  bug-texinfo@gnu.org
> Date: Tue, 04 Apr 2023 15:32:47 +0200
> 
> Sure.  And just to confirm: I opened a cmd.exe, adjusted %path% so I
> have the mingw64 directories for its .dll's included, adjusted
> %infopath% and did
> 
>   c:\path\to\my\native\info.exe dir
> 
> and I get `dirï (this is with codepage 850).  Next in cmd.exe, I do
> 
>   chcp 1252
>   c:\path\to\my\native\info.exe dir
> 
> and I get ‘dir’.  I hope I could spell it out clearly.

OK.  So at the very least you have a workaround: use chcp to set the
codepage of the console to be 1252.



Re: info '(latex2e)\indent & \noindent' doesn't work with Msys2

2023-04-04 Thread Eli Zaretskii
> From: Arash Esbati 
> Cc: gavinsmith0...@gmail.com,  bug-texinfo@gnu.org
> Date: Tue, 04 Apr 2023 15:05:01 +0200
> 
> >>   C:\>chcp
> >>   Aktive Codepage: 850.
> >
> > That's likely the problem: this codepage doesn't support Unicode
> > quotes.  What remains to be understood is why doesn't info.exe act
> > accordingly.
> 
> Yes, this seems to be the issue.  In my bash (MinGW64 shell running
> inside Windows Terminal), I did:
> 
>   $ chcp.com 1252
>   $ /c/pathto/my/native/info.exe dir-test-no-coding
> 
> which looks like this:

Which info.exe is that? the one you've built or the one from MSYS2?

> > Is your info.exe built with libiconv, btw?
> 
> I only pass a --prefix to configure and looking at config.log, it has:
> 
>   configure:11566: checking how to link with libiconv
>   configure:11568: result: -liconv
>   configure:11579: checking whether iconv is compatible with its POSIX 
> signature
>   configure:11604: gcc -c -g -O2  conftest.c >&5
>   configure:11604: $? = 0
>   configure:11613: result: yes
> 
> So I'd say yes.

Looks like that.  But here's how to be sure: from the shell prompt
type

  objdump -p /path/to/info.exe | fgrep "DLL Name:"

and you will see all the DLLs that the program was linked against.

> > "In bash" when using what console window?  (Please always state these
> > facts, because otherwise what you tell is ambiguous and can easily
> > mislead.  This issue is complicated and messy enough already, we don't
> > need more complications and confusions.)
> 
> Sure, sorry for being imprecise.  My setup is all the time bash in a
> MinGW64 shell using Windows Terminal.  I start bash like this:
> 
>   c:/msys64/msys2_shell.cmd -defterm -no-start -mingw64
> 
> where -defterm means don't use mintty.

OK.  But as long as we are debugging this problem with ‘dir’ being
displayed as `dirï, please run the info.exe you've built only in the
Command Prompt window and from the cmd.exe prompt, not in mintty and
not from Bash.  If we need to see how info.exe behaves in other
situations, we will mention this explicitly.  OK?



Re: info '(latex2e)\indent & \noindent' doesn't work with Msys2

2023-04-04 Thread Eli Zaretskii
> From: Gavin Smith 
> Date: Tue, 4 Apr 2023 13:56:23 +0100
> Cc: ar...@gnu.org, bug-texinfo@gnu.org
> 
> Ok, so assuming that this is all correct, we know that it is possible
> to build Texinfo for MSYS2 because it was done officially, and Arash
> had the official MSYS2 info program installed.
> 
> https://packages.msys2.org/base/texinfo
> 
> There must be some difference in the way it was built.

MSYS2 doesn't provide a native MinGW build of Texinfo.  They provide
only an MSYS2 build, which is basically the Unix code without all the
adaptations we have to the native Windows environment and
idiosyncrasies.

OTOH, I do use MSYS to build my MinGW ports of Texinfo (which can be
downloaded from the ezwinports site), so yes, MSYS _can_ be used to
build the native MinGW port of Texinfo, and that port then works well
for me in my day-to-day work on Windows.

> Or am I getting confused here between an MSYS2 Texinfo and a Texinfo built
> with MSYS2 (which would be a MinGW Texinfo)?  Are these two different
> things?

Yes, they are different.  See above.

MSYS is actually a fork of Cygwin with a few specialized changes
intended to allow invocation of native Windows programs.  Other than
that, MSYS programs are Cygwin programs: they issue Posix syscalls,
which are then converted to Windows by a special DLL on which all MSYS
programs depend.  That DLL is not used in native MinGW programs, which
instead use the stock Windows C runtime library.

> > The problem (at least the problem with Info showing Unicode quotes) is
> > not during the build, it's when Arash runs the Info he produced.  That
> > should work first and foremost on the Command Prompt window, which is
> > the native Windows terminal emulator.  Whether it also works inside
> > mintty, I don't know, but that could be a separate problem, and
> > making it work would be a bonus, because the MinGW build of Info
> > supports the Command Prompt as its main terminal.
> 
> Would this Info be expected to behave the same way as the Info program
> provided by the MSYS2 project?

No, see above: they are different ports.  In particular, MSYS2
programs support UTF-8 whereas native MinGW programs don't, at least
not easily.



Re: info '(latex2e)\indent & \noindent' doesn't work with Msys2

2023-04-04 Thread Eli Zaretskii
> Date: Tue, 04 Apr 2023 15:11:01 +0300
> From: Eli Zaretskii 
> Cc: ar...@gnu.org, bug-texinfo@gnu.org
> 
> There's an easier way:
> 
>   (gdb) ./ginfo.exe

Sorry, this was supposed to be

   $ gdb ./ginfo.exe



Re: info '(latex2e)\indent & \noindent' doesn't work with Msys2

2023-04-04 Thread Eli Zaretskii
> Date: Tue, 4 Apr 2023 12:48:14 +0200
> From: Patrice Dumas 
> 
> On Tue, Apr 04, 2023 at 10:59:53AM +0100, Gavin Smith wrote:
> > On Tue, Apr 04, 2023 at 09:35:07AM +0200, Arash Esbati wrote:
> > > Eli Zaretskii  writes:
> > > 
> > > > ??? What is your console output codepage set to?
> > > 
> > >   C:\>chcp
> > >   Aktive Codepage: 850.
> > 
> > The use of the codepage 850 instead of what 'locale' reports likely
> > comes from these lines in texi2any.pl:
> 
> I think that I did that for the 7 release, based on some code found on
> internet, but I do not really understand what it does.
> 
> > if (!defined($locale_encoding) and $^O eq 'MSWin32') {
> >   eval 'require Win32::API';
> >   if (!$@) {
> > Win32::API::More->Import("kernel32", "int GetACP()");
> > my $CP = GetACP();
> > if (defined($CP)) {
> >   $locale_encoding = 'cp'.$CP;
> > }
> >   }
> > }

It's the Windows equivalent of nl_langinfo(CODESET).  But the problem
is that, unlike Unix, where you have just one CODESET for an installed
locale, on Windows, we can have 3 different ones:

  . the ANSI codepage
  . the console input codepage
  . the console output codepage

Usually, the last two are identical, but different from the first.
The first one is used by programs for anything except writing to the
console, like encoding of file names.



Re: info '(latex2e)\indent & \noindent' doesn't work with Msys2

2023-04-04 Thread Eli Zaretskii
> From: Gavin Smith 
> Date: Tue, 4 Apr 2023 10:59:53 +0100
> Cc: Eli Zaretskii , bug-texinfo@gnu.org
> 
> On Tue, Apr 04, 2023 at 09:35:07AM +0200, Arash Esbati wrote:
> > Eli Zaretskii  writes:
> > 
> > > ??? What is your console output codepage set to?
> > 
> >   C:\>chcp
> >   Aktive Codepage: 850.
> 
> The use of the codepage 850 instead of what 'locale' reports likely
> comes from these lines in texi2any.pl:
> 
> 
> if (!defined($locale_encoding) and $^O eq 'MSWin32') {
>   eval 'require Win32::API';
>   if (!$@) {
> Win32::API::More->Import("kernel32", "int GetACP()");
> my $CP = GetACP();
> if (defined($CP)) {
>   $locale_encoding = 'cp'.$CP;
> }
>   }
> }

I think it's the other way around: GetACP is likely to report codepage
1252, the ANSI codepage of Windows localized for Western Europe
systems.  Codepage 850, OTOH, is the _console_ codepage for those
locales.  (Yes, it's a mess.)  Info makes a point to query the system
about the console output codepage:

  char *
  rpl_nl_langinfo (nl_item item)
  {
if (item == CODESET)
  {
static char buf[100];

/* We need all the help we can get from GNU libiconv, so we
   request transliteration as well.  */
sprintf (buf, "CP%u//TRANSLIT", GetConsoleOutputCP ()); <<<<<<<<<<<<<<
return buf;
  }
else
  return nl_langinfo (item);
  }




Re: info '(latex2e)\indent & \noindent' doesn't work with Msys2

2023-04-04 Thread Eli Zaretskii
> From: Gavin Smith 
> Date: Tue, 4 Apr 2023 10:56:28 +0100
> Cc: Eli Zaretskii , bug-texinfo@gnu.org
> 
> The other thing that I am confused about is that you said you were
> building on "Win10, Msys2/MinGW64".   Would Msys2 and MinGW64 not
> be two different architectures?

Yes.

> Is it appropriate to be running (or building) Msys2 programs in an
> MinGW64 shell?

Yes.  It's actually MSYS2' raison d'être.  MSYS2 is a set of tools and
builds of GNU software intended to allow you to build MinGW programs
using Unix shell scripts, Autoconf, and Makefiles that assume Unix
shells and Unix semantics.  The programs you build using MSYS2 will be
MinGW (a.k.a. "native Windows) programs if the compiler and linker you
invoke are MinGW compiler and linker.

> Mixing two similar but distinct systems could have very confusing results.

It _is_ confusing at times (as this discussion clearly shows), but
it's a necessary evil: you cannot build GNU and Unix packages on
Windows without using MSYS2.

> The MSYS2 website tells me they have their own terminal program called
> "mintty"; have you tried building or running in that terminal?

The problem (at least the problem with Info showing Unicode quotes) is
not during the build, it's when Arash runs the Info he produced.  That
should work first and foremost on the Command Prompt window, which is
the native Windows terminal emulator.  Whether it also works inside
mintty, I don't know, but that could be a separate problem, and
making it work would be a bonus, because the MinGW build of Info
supports the Command Prompt as its main terminal.



Re: info '(latex2e)\indent & \noindent' doesn't work with Msys2

2023-04-04 Thread Eli Zaretskii
> From: Gavin Smith 
> Date: Tue, 4 Apr 2023 10:34:23 +0100
> Cc: Eli Zaretskii , bug-texinfo@gnu.org
> 
> On Tue, Apr 04, 2023 at 09:35:07AM +0200, Arash Esbati wrote:
> > Eli Zaretskii  writes:
> > 
> > > What do you get from rpl_nl_langinfo in your case, and what happens in
> > > copy_converting, where degrade_utf8 is supposed to be called when
> > > Unicode quotes aren't supported?
> > 
> > Sorry, I don't follow.  What should I do in order to answer the question
> > above?
> 
> This would require using a debugger such as gdb or inserting debugging
> print statements into the source code of the program.

Yes.

> Because info uses the terminal for display, it is usually best to debug it
> from a separate terminal window.  I use a bash function for this:
> 
> function attach () {
> gdb $1 `pgrep $1`
> }
> 
> and run "attach ginfo" to attach to a running ginfo instance.

There's an easier way:

  (gdb) ./ginfo.exe
  (gdb) set new-console 1
  (gdb) run 

Then Info gets its own separate console, and you can use GDB
conveniently from its original terminal.  This works on Windows.

> This is harder than it used to be on many GNU/Linux distributions.
> Here's a Stack Exchange post I found about it:
> 
> https://askubuntu.com/questions/41629/after-upgrade-gdb-wont-attach-to-process

No such madness on Windows, thank goodness.  At least not yet.



Re: info '(latex2e)\indent & \noindent' doesn't work with Msys2

2023-04-04 Thread Eli Zaretskii
> From: Arash Esbati 
> Cc: gavinsmith0...@gmail.com,  bug-texinfo@gnu.org
> Date: Tue, 04 Apr 2023 09:35:07 +0200
> 
> > ??? What is your console output codepage set to?
> 
>   C:\>chcp
>   Aktive Codepage: 850.

That's likely the problem: this codepage doesn't support Unicode
quotes.  What remains to be understood is why doesn't info.exe act
accordingly.

> > What do you get from rpl_nl_langinfo in your case, and what happens in
> > copy_converting, where degrade_utf8 is supposed to be called when
> > Unicode quotes aren't supported?
> 
> Sorry, I don't follow.  What should I do in order to answer the question
> above?

Run info.exe under a debugger and step into the functions I mentioned
to see what's going on there.  Is your info.exe built with libiconv,
btw?

> > Also, what is the font you are using on the console? does it support
> > Unicode quotes?
> 
> In cmd.exe, it is Consolas, in Terminal, it is SourceCodePro.  They both
> support Unicode quotes.  But cmd.exe doesn't show them.  This small
> text file (dir.txt):
> 
>   10.2 ‘dir’: Briefly list directory contents
>   ===
> 
>   ‘dir’ is equivalent to ‘ls -C -b’; that is, by default files are listed
>   in columns, sorted vertically, and special characters are represented by
>   backslash escape sequences.
> 
>  *Note ‘ls’: ls invocation.
> 
> looks like this in cmd.exe with 'type dir.txt' or 'more dir.txt':

This is because the text is encoded in UTF-8, and cmd.exe assumes it's
encoded in codepage 850.  This is not relevant.

> cat dir.txt in bash works as expected.

"In bash" when using what console window?  (Please always state these
facts, because otherwise what you tell is ambiguous and can easily
mislead.  This issue is complicated and messy enough already, we don't
need more complications and confusions.)



Re: info '(latex2e)\indent & \noindent' doesn't work with Msys2

2023-04-03 Thread Eli Zaretskii
> Date: Mon, 03 Apr 2023 19:43:54 +0300
> From: Eli Zaretskii 
> Cc: gavinsmith0...@gmail.com, bug-texinfo@gnu.org
> 
> > and Msys2 info looks like this:
> 
> Your MSYS2 Info was built without libiconv, right?

Actually, that's not it: MSYS2 build is not a MinGW build at all, so
all the machinery used by the MinGW build to detect the console
encoding and convert to it is not used there.  MSYS2 probably uses a
UTF-8 locale or somesuch.

> > So the problem persists.  The only change I see is that Msys2 info shows
> > only ' for ‘ and ’.

But previously you have shown MSYS2 result (not an image) where the
quotes appeared literally?  That sounds like yet another confusing
mess.



  1   2   3   4   5   6   7   8   9   10   >