Re: Standalone Info reader cannot read Info files with CR-LF EOLs

2015-01-14 Thread Gavin Smith
On Wed, Jan 14, 2015 at 5:57 PM, Eli Zaretskii  wrote:
>> Date: Wed, 14 Jan 2015 10:42:08 +
>> From: Gavin Smith 
>> Cc: Texinfo 
>>
>> int
>> wcwidth (wchar_t wc)
>> #undef wcwidth
>> {
>>   /* In UTF-8 locales, use a Unicode aware width function.  */
>>   const char *encoding = locale_charset ();
>>   if (STREQ_OPT (encoding, "UTF-8", 'U', 'T', 'F', '-', '8', 0, 0, 0 ,0))
>> {
>>   /* We assume that in a UTF-8 locale, a wide character is the same as a
>>  Unicode character.  */
>>   return uc_width (wc, encoding);
>> }
>>   else
>> {
>>   /* Otherwise, fall back to the system's wcwidth function.  */
>> #if HAVE_WCWIDTH
>>   return wcwidth (wc);
>> #else
>>   return wc == 0 ? 0 : iswprint (wc) ? 1 : -1;
>> #endif
>> }
>> }
>>
>> locale_charset is always called every time.
>
> Yes, I know.  But only if gnulib's wcwidth is used.  Is it used on
> your platform?  AFAIK, glibc provides wcwidth, so I'd expect the
> gnulib version not to be used on your platform.

Right, I was forgetting that the gnulib version won't be used if the
function exists.

>
> In any case, I don't see why we need to call locale_charset for each
> and every character, over and over and over again.  We should call it
> once and then reuse the result, since it depends on the environment
> outside the reader, and will not change during the session, right?

I raised this issue on the gnulib list here:
http://lists.gnu.org/archive/html/bug-gnulib/2015-01/msg00040.html.

One idea is to add another function to gnulib where the encoding is
passed as a parameter.



Re: Standalone Info reader cannot read Info files with CR-LF EOLs

2015-01-14 Thread Eli Zaretskii
> Date: Wed, 14 Jan 2015 10:42:08 +
> From: Gavin Smith 
> Cc: Texinfo 
> 
> int
> wcwidth (wchar_t wc)
> #undef wcwidth
> {
>   /* In UTF-8 locales, use a Unicode aware width function.  */
>   const char *encoding = locale_charset ();
>   if (STREQ_OPT (encoding, "UTF-8", 'U', 'T', 'F', '-', '8', 0, 0, 0 ,0))
> {
>   /* We assume that in a UTF-8 locale, a wide character is the same as a
>  Unicode character.  */
>   return uc_width (wc, encoding);
> }
>   else
> {
>   /* Otherwise, fall back to the system's wcwidth function.  */
> #if HAVE_WCWIDTH
>   return wcwidth (wc);
> #else
>   return wc == 0 ? 0 : iswprint (wc) ? 1 : -1;
> #endif
> }
> }
> 
> locale_charset is always called every time.

Yes, I know.  But only if gnulib's wcwidth is used.  Is it used on
your platform?  AFAIK, glibc provides wcwidth, so I'd expect the
gnulib version not to be used on your platform.

> It must be slower under a Windows system. The implementation of
> locale_charset is in the localcharset.c file from gnulib, although I
> haven't looked at it in detail, and don't know why it would be slow
> under Windows.

If I comment out the call to locale_charset in gnulib's wcwidth, and
show that the slow-down goes away, will that convince you?

In any case, I don't see why we need to call locale_charset for each
and every character, over and over and over again.  We should call it
once and then reuse the result, since it depends on the environment
outside the reader, and will not change during the session, right?



Re: Standalone Info reader cannot read Info files with CR-LF EOLs

2015-01-14 Thread Gavin Smith
On Sat, Jan 10, 2015 at 3:31 PM, Eli Zaretskii  wrote:
>> Date: Thu, 8 Jan 2015 11:00:40 +
>> From: Gavin Smith 
>> Cc: Texinfo 
>>
>> On Sat, Jan 3, 2015 at 3:29 PM, Eli Zaretskii  wrote:
>> > As you see, wcwidth and locale_charset, both from gnulib in my build,
>> > take 75% of the time.
>> >
>> > I thought that perhaps the reason was that your locale is UTF-8, so
>> > Info doesn't need to convert text using libiconv in your locale.  But
>> > removing the UTF-8 encoding tag from the ELisp Info file didn't have
>> > any visible effect on the delay, so that's not it.
>> >
>> > Suggestions for further digging into this are welcome.
>>
>> One way to check would be to comment out the call to wcwidth, and
>> replace with the value "1", and see if it is still slow.
>
> Doing this completely solves the problem.
>
> Moreover, if I replace the line that calls wcwidth:
>
>   *pchars = wcwidth ((*iter).cur.wc);
>
> with what constitutes the body of the gnulib implementation, i.e.:
>
>   *pchars = (*iter).cur.wc ? (iswprint ((*iter).cur.wc) ? 1 : -1) : 0;
>
> I don't see any slowdown, either.
>
> My conclusion is that the reason for the slowdown is the call to
> locale_charset inside gnulib's wcwidth is the culprit, because my
> locale's charset is not UTF-8.
>
> Thanks.

Here's the body of the gnulib wcwidth:

int
wcwidth (wchar_t wc)
#undef wcwidth
{
  /* In UTF-8 locales, use a Unicode aware width function.  */
  const char *encoding = locale_charset ();
  if (STREQ_OPT (encoding, "UTF-8", 'U', 'T', 'F', '-', '8', 0, 0, 0 ,0))
{
  /* We assume that in a UTF-8 locale, a wide character is the same as a
 Unicode character.  */
  return uc_width (wc, encoding);
}
  else
{
  /* Otherwise, fall back to the system's wcwidth function.  */
#if HAVE_WCWIDTH
  return wcwidth (wc);
#else
  return wc == 0 ? 0 : iswprint (wc) ? 1 : -1;
#endif
}
}

locale_charset is always called every time. It must be slower under a
Windows system. The implementation of locale_charset is in the
localcharset.c file from gnulib, although I haven't looked at it in
detail, and don't know why it would be slow under Windows.



Re: Standalone Info reader cannot read Info files with CR-LF EOLs

2015-01-10 Thread Eli Zaretskii
> Date: Thu, 8 Jan 2015 11:00:40 +
> From: Gavin Smith 
> Cc: Texinfo 
> 
> On Sat, Jan 3, 2015 at 3:29 PM, Eli Zaretskii  wrote:
> > As you see, wcwidth and locale_charset, both from gnulib in my build,
> > take 75% of the time.
> >
> > I thought that perhaps the reason was that your locale is UTF-8, so
> > Info doesn't need to convert text using libiconv in your locale.  But
> > removing the UTF-8 encoding tag from the ELisp Info file didn't have
> > any visible effect on the delay, so that's not it.
> >
> > Suggestions for further digging into this are welcome.
> 
> One way to check would be to comment out the call to wcwidth, and
> replace with the value "1", and see if it is still slow.

Doing this completely solves the problem.

Moreover, if I replace the line that calls wcwidth:

  *pchars = wcwidth ((*iter).cur.wc);

with what constitutes the body of the gnulib implementation, i.e.:
 
  *pchars = (*iter).cur.wc ? (iswprint ((*iter).cur.wc) ? 1 : -1) : 0;

I don't see any slowdown, either.

My conclusion is that the reason for the slowdown is the call to
locale_charset inside gnulib's wcwidth is the culprit, because my
locale's charset is not UTF-8.

Thanks.



Re: Standalone Info reader cannot read Info files with CR-LF EOLs

2015-01-08 Thread Gavin Smith
On Sat, Jan 3, 2015 at 3:29 PM, Eli Zaretskii  wrote:
> As you see, wcwidth and locale_charset, both from gnulib in my build,
> take 75% of the time.
>
> I thought that perhaps the reason was that your locale is UTF-8, so
> Info doesn't need to convert text using libiconv in your locale.  But
> removing the UTF-8 encoding tag from the ELisp Info file didn't have
> any visible effect on the delay, so that's not it.
>
> Suggestions for further digging into this are welcome.

One way to check would be to comment out the call to wcwidth, and
replace with the value "1", and see if it is still slow. The calls to
mb_isprint (from gnulib) could be slow as well, although they were
there in the last release, I believe.



Re: Standalone Info reader cannot read Info files with CR-LF EOLs

2015-01-03 Thread Gavin Smith
On Sat, Jan 3, 2015 at 3:29 PM, Eli Zaretskii  wrote:
> The bottle-neck is clearly process_node_text, it takes more than 1 sec
> when the node is "Index" in the ELisp manual.
>
> I timed the loop in process_node_text, and it takes about 0.22 msec
> per line on the average, and there are 5700 lines in that node.
>
> I tried to find the culprit in that loop, but it's hard to time such
> small intervals reliably.  My gut feeling is that the call to
> printed_representation is the reason: we call that function once for
> each character on the line.  But I cannot prove that, and I cannot
> explain why you don't see the same delay.  Perhaps the reason is that
> some functions called by printed_representation, which in my build are
> supplied by gnulib, are much faster in glibc.  This is based on the
> following profile I get from gprof:
>
>   Each sample counts as 0.01 seconds.
> %   cumulative   self  self total
>time   seconds   secondscalls  Ts/call  Ts/call  name
>50.00  0.02 0.02 locale_charset
>25.00  0.03 0.01 wcwidth
>12.50  0.04 0.01 
> add_file_directory_to_path
>12.50  0.04 0.01 main
> 0.00  0.04 0.00   415064 0.00 0.00  printed_representation
> 0.00  0.04 0.00   229065 0.00 0.00  reset_conversion
> 0.00  0.04 0.0061093 0.00 0.00  text_buffer_alloc
> 0.00  0.04 0.0051681 0.00 0.00  text_buffer_iconv
> 0.00  0.04 0.0051681 0.00 0.00  text_buffer_space_left
> 0.00  0.04 0.0031362 0.00 0.00  skip_whitespace
> 0.00  0.04 0.0019793 0.00 0.00  read_quoted_string
>
> As you see, wcwidth and locale_charset, both from gnulib in my build,
> take 75% of the time.

The calls to wcwidth are new; previously there was no handling of
characters that span two display columns.

I am going to simplify the code in process_node_text to only deal with
calculating the line starts: it is generic code that was previously
used for the screen update as well. If that doesn't produce a speed-up
then it could be slow because it is calling wcwidth on every single
character.



Re: Standalone Info reader cannot read Info files with CR-LF EOLs

2015-01-03 Thread Eli Zaretskii
> Date: Fri, 02 Jan 2015 18:05:40 +0200
> From: Eli Zaretskii 
> Cc: bug-texinfo@gnu.org
> 
> > Date: Fri, 2 Jan 2015 15:47:50 +
> > From: Gavin Smith 
> > Cc: Texinfo 
> > 
> > On Fri, Jan 2, 2015 at 11:26 AM, Eli Zaretskii  wrote:
> > > However, I see an annoying delay when going to a different node.  For
> > > example, in the Emacs Lisp manual, go to the "Index" node, move to the
> > > end of the index, type RET on one of the last entries, then type 'l'
> > > to go back.  Bot going to Index and going back causes a visible delay,
> > > which is about 1.5 sec here (this is a Core-i7 box).
> > 
> > I couldn't reproduce this. There is a slight delay for me when loading
> > the Index node for the first time, but not when going back. I would
> > guess it is because it is the Index node is a long node and it is
> > doing something when displaying it that takes a long time (like
> > calculating the positions of the starts of lines).
> 
> I don't think so: the Info viewer from Texinfo 4.13 doesn't have this
> problem, I tried with the same files.
> 
> > I can't see how this would be related to CR removal. Do other nodes
> > in the file also this problem?
> 
> OK, I will take a closer look.

The bottle-neck is clearly process_node_text, it takes more than 1 sec
when the node is "Index" in the ELisp manual.

I timed the loop in process_node_text, and it takes about 0.22 msec
per line on the average, and there are 5700 lines in that node.

I tried to find the culprit in that loop, but it's hard to time such
small intervals reliably.  My gut feeling is that the call to
printed_representation is the reason: we call that function once for
each character on the line.  But I cannot prove that, and I cannot
explain why you don't see the same delay.  Perhaps the reason is that
some functions called by printed_representation, which in my build are
supplied by gnulib, are much faster in glibc.  This is based on the
following profile I get from gprof:

  Each sample counts as 0.01 seconds.
%   cumulative   self  self total
   time   seconds   secondscalls  Ts/call  Ts/call  name
   50.00  0.02 0.02 locale_charset
   25.00  0.03 0.01 wcwidth
   12.50  0.04 0.01 
add_file_directory_to_path
   12.50  0.04 0.01 main
0.00  0.04 0.00   415064 0.00 0.00  printed_representation
0.00  0.04 0.00   229065 0.00 0.00  reset_conversion
0.00  0.04 0.0061093 0.00 0.00  text_buffer_alloc
0.00  0.04 0.0051681 0.00 0.00  text_buffer_iconv
0.00  0.04 0.0051681 0.00 0.00  text_buffer_space_left
0.00  0.04 0.0031362 0.00 0.00  skip_whitespace
0.00  0.04 0.0019793 0.00 0.00  read_quoted_string

As you see, wcwidth and locale_charset, both from gnulib in my build,
take 75% of the time.

I thought that perhaps the reason was that your locale is UTF-8, so
Info doesn't need to convert text using libiconv in your locale.  But
removing the UTF-8 encoding tag from the ELisp Info file didn't have
any visible effect on the delay, so that's not it.

Suggestions for further digging into this are welcome.



Re: Standalone Info reader cannot read Info files with CR-LF EOLs

2015-01-02 Thread Eli Zaretskii
> Date: Fri, 2 Jan 2015 15:47:50 +
> From: Gavin Smith 
> Cc: Texinfo 
> 
> On Fri, Jan 2, 2015 at 11:26 AM, Eli Zaretskii  wrote:
> > However, I see an annoying delay when going to a different node.  For
> > example, in the Emacs Lisp manual, go to the "Index" node, move to the
> > end of the index, type RET on one of the last entries, then type 'l'
> > to go back.  Bot going to Index and going back causes a visible delay,
> > which is about 1.5 sec here (this is a Core-i7 box).
> 
> I couldn't reproduce this. There is a slight delay for me when loading
> the Index node for the first time, but not when going back. I would
> guess it is because it is the Index node is a long node and it is
> doing something when displaying it that takes a long time (like
> calculating the positions of the starts of lines).

I don't think so: the Info viewer from Texinfo 4.13 doesn't have this
problem, I tried with the same files.

> I can't see how this would be related to CR removal. Do other nodes
> in the file also this problem?

OK, I will take a closer look.

> Have you tried any Info files without CR-LF line endings?

Yes, it doesn't matter.  The delay is there regardless.



Re: Standalone Info reader cannot read Info files with CR-LF EOLs

2015-01-02 Thread Gavin Smith
On Fri, Jan 2, 2015 at 11:26 AM, Eli Zaretskii  wrote:
> However, I see an annoying delay when going to a different node.  For
> example, in the Emacs Lisp manual, go to the "Index" node, move to the
> end of the index, type RET on one of the last entries, then type 'l'
> to go back.  Bot going to Index and going back causes a visible delay,
> which is about 1.5 sec here (this is a Core-i7 box).

I couldn't reproduce this. There is a slight delay for me when loading
the Index node for the first time, but not when going back. I would
guess it is because it is the Index node is a long node and it is
doing something when displaying it that takes a long time (like
calculating the positions of the starts of lines). I can't see how
this would be related to CR removal. Do other nodes in the file also
this problem? Have you tried any Info files without CR-LF line
endings?



Re: Standalone Info reader cannot read Info files with CR-LF EOLs

2015-01-02 Thread Eli Zaretskii
> Date: Thu, 1 Jan 2015 18:38:07 +
> From: Gavin Smith 
> Cc: Texinfo 
> 
> On Sun, Dec 28, 2014 at 8:57 PM, Eli Zaretskii  wrote:
> >
> > There's still a problem of Info files produced by makeinfo 4.x on
> > Windows -- these will not be reliably readable with the new
> > stand-alone Info.  I think we need to have a solution for that as
> > well, at least as a user option, if not automatically.  (One way of
> > doing that automatically would be to detect the 4.x version from the
> > file's preamble.)
> 
> I've committed some changes that will likely work for these files as
> well as the other files. It worked for me on the files I've tried.
> Please have a try.

Thanks, it seems to work.

However, I see an annoying delay when going to a different node.  For
example, in the Emacs Lisp manual, go to the "Index" node, move to the
end of the index, type RET on one of the last entries, then type 'l'
to go back.  Bot going to Index and going back causes a visible delay,
which is about 1.5 sec here (this is a Core-i7 box).

Dies this have something to do with how you remove the CR characters
in the current trunk?

> (While testing these changes I noticed that there doesn't seem to be
> any support in the standalone Info reader for varying lengths of file
> preambles in a split file (some mention of that here:
> http://lists.gnu.org/archive/html/bug-texinfo/2013-08/msg00040.html),
> so there could be some more changes coming to do with tag table
> processing. )

Yes, different-size preambles are a PITA.

Thanks.



Re: Standalone Info reader cannot read Info files with CR-LF EOLs

2015-01-01 Thread Gavin Smith
On Sun, Dec 28, 2014 at 8:57 PM, Eli Zaretskii  wrote:
>
> There's still a problem of Info files produced by makeinfo 4.x on
> Windows -- these will not be reliably readable with the new
> stand-alone Info.  I think we need to have a solution for that as
> well, at least as a user option, if not automatically.  (One way of
> doing that automatically would be to detect the 4.x version from the
> file's preamble.)

I've committed some changes that will likely work for these files as
well as the other files. It worked for me on the files I've tried.
Please have a try.

(While testing these changes I noticed that there doesn't seem to be
any support in the standalone Info reader for varying lengths of file
preambles in a split file (some mention of that here:
http://lists.gnu.org/archive/html/bug-texinfo/2013-08/msg00040.html),
so there could be some more changes coming to do with tag table
processing. )



Re: Standalone Info reader cannot read Info files with CR-LF EOLs

2014-12-28 Thread Eli Zaretskii
> Date: Sun, 28 Dec 2014 20:06:54 +
> From: Gavin Smith 
> Cc: Texinfo 
> 
> On Fri, Dec 26, 2014 at 9:52 PM, Eli Zaretskii  wrote:
> > It's a broken file.  I have no idea how they produced it, but it
> > wasn't by stock makeinfo 4.8 on Windows, because that version already
> > did both count byte offsets in makeinfo disregarding the CR
> > characters, and had the EOL conversion function in the Info reader.  I
> > just checked its code, which I still have on my disk.
> >
> 
> I couldn't quickly find the code in C makeinfo for this - is it
> something to do with file modes under Windows?

Yes.  Makeinfo simply counted the bytes in memory, and the CR
characters were added by C library functions as result of text-mode
writes.

> You are probably right that it wasn't produced by makeinfo under
> Windows, but I did reproduce something similar when running makeinfo
> 4.13 under GNU/Linux with a Texinfo source file with CR-LF line
> endings. See the attached input and output files. The whitespace in
> the output Info file doesn't make a lot of sense, but the point is
> that the preamble of the info file does contain a line with a CR-LF
> ending, but the tag table doesn't take this into account - the node
> separator is at byte 113 of the file exactly. It's possible that this
> file was produced in a similar way.

Maybe.  There's of course any number of ways to produce a broken Info
file.

> There may be similar results if a file has mixed kinds of line endings
> (or if it includes other files with different line endings). We can't
> exactly say that the tag tables in files like these is "incorrect".
> Same goes for files produced under Windows where the CR bytes aren't
> counted. We're just left with the problem of loading the files that
> are out there properly.

The most important requirement is to be able to read and display files
that were produced from valid Texinfo sources, either on Unix or on
Windows, and do it in a way that will work in at least the stand-alone
Info and in Emacs.  The situation we have right now with texi2any
doesn't fulfill this requirement, which is not good, I think.

> > Its tag table accounts for the CR characters, which is wrong.  That's
> > why the Info reader from 4.13 cannot read it correctly.  And that's
> > exactly what will happen with Info files created by makeinfo 5.2 when
> > someone tries to read them with Info from 4.13.
> >
> > Moreover, the same problem will happen with the Emacs Info reader.
> > Emacs removes the CR characters when it reads files into buffers (any
> > files, not just Info files), so it must have the tag table with
> > offsets that disregard the CRs.
> 
> If it turns out there are files out there where the 1000-byte slack in
> looking for a node isn't enough, we could tweak it, maybe by
> increasing the slack as we get later on in the file. Maybe something
> similar could be done in Emacs Info. If we could stop makeinfo
> producing files with CR bytes it would stop this problem for newly
> produced files.

There's still a problem of Info files produced by makeinfo 4.x on
Windows -- these will not be reliably readable with the new
stand-alone Info.  I think we need to have a solution for that as
well, at least as a user option, if not automatically.  (One way of
doing that automatically would be to detect the 4.x version from the
file's preamble.)



Re: Standalone Info reader cannot read Info files with CR-LF EOLs

2014-12-28 Thread Gavin Smith
On Fri, Dec 26, 2014 at 9:52 PM, Eli Zaretskii  wrote:
> It's a broken file.  I have no idea how they produced it, but it
> wasn't by stock makeinfo 4.8 on Windows, because that version already
> did both count byte offsets in makeinfo disregarding the CR
> characters, and had the EOL conversion function in the Info reader.  I
> just checked its code, which I still have on my disk.
>

I couldn't quickly find the code in C makeinfo for this - is it
something to do with file modes under Windows?

You are probably right that it wasn't produced by makeinfo under
Windows, but I did reproduce something similar when running makeinfo
4.13 under GNU/Linux with a Texinfo source file with CR-LF line
endings. See the attached input and output files. The whitespace in
the output Info file doesn't make a lot of sense, but the point is
that the preamble of the info file does contain a line with a CR-LF
ending, but the tag table doesn't take this into account - the node
separator is at byte 113 of the file exactly. It's possible that this
file was produced in a similar way.

There may be similar results if a file has mixed kinds of line endings
(or if it includes other files with different line endings). We can't
exactly say that the tag tables in files like these is "incorrect".
Same goes for files produced under Windows where the CR bytes aren't
counted. We're just left with the problem of loading the files that
are out there properly.

> Its tag table accounts for the CR characters, which is wrong.  That's
> why the Info reader from 4.13 cannot read it correctly.  And that's
> exactly what will happen with Info files created by makeinfo 5.2 when
> someone tries to read them with Info from 4.13.
>
> Moreover, the same problem will happen with the Emacs Info reader.
> Emacs removes the CR characters when it reads files into buffers (any
> files, not just Info files), so it must have the tag table with
> offsets that disregard the CRs.

If it turns out there are files out there where the 1000-byte slack in
looking for a node isn't enough, we could tweak it, maybe by
increasing the slack as we get later on in the file. Maybe something
similar could be done in Emacs Info. If we could stop makeinfo
producing files with CR bytes it would stop this problem for newly
produced files.


cr-lf-endings-4.texi
Description: TeXInfo document


cr-lf-endings-4.info
Description: Binary data


Re: Standalone Info reader cannot read Info files with CR-LF EOLs

2014-12-26 Thread Eli Zaretskii
> Date: Fri, 26 Dec 2014 16:48:00 +
> From: Gavin Smith 
> Cc: Texinfo 
> 
> I discovered this problem with the "gnucobpg.info" file that is part
> of GNU Cobol (downloadable at
> http://opencobol.add1tocobol.com/guides/), which has many CR-LF line
> endings (but not consistently). I don't know exactly how this file was
> generated - the file preamble says
> 
> This is gnucobpg.info, produced by makeinfo version 4.8 from
> gnucobpg.texi.

It's a broken file.  I have no idea how they produced it, but it
wasn't by stock makeinfo 4.8 on Windows, because that version already
did both count byte offsets in makeinfo disregarding the CR
characters, and had the EOL conversion function in the Info reader.  I
just checked its code, which I still have on my disk.

> - anyway, I had the problem mentioned that I found I couldn't access
> later nodes in the file. I tested just now with info 4.13 and wasn't
> able to access the "Alphabet-Name-Clause" or anything later in the
> file. That's the only Info file I remember encountering containing
> many CR bytes.

Its tag table accounts for the CR characters, which is wrong.  That's
why the Info reader from 4.13 cannot read it correctly.  And that's
exactly what will happen with Info files created by makeinfo 5.2 when
someone tries to read them with Info from 4.13.

Moreover, the same problem will happen with the Emacs Info reader.
Emacs removes the CR characters when it reads files into buffers (any
files, not just Info files), so it must have the tag table with
offsets that disregard the CRs.

> Since this claims to be produced by the 4.8 version (not 5.x)  whether
> the CR characters are counted in the tag table must depend on other,
> unknown factors.

I don't think we can or should try fixing broken Info files.  We
certainly shouldn't introduce new breakage into valid files because of
that.

> It could be helpful to make the GNU Cobol developers aware of this.

Agreed.

> >  . fix texi2any to produce tag tables that assume the CR characters
> >are stripped from the Info file (my reading of the code is that it
> >should not count CR characters before LF for the purposes of
> >count_context value; or maybe it should simply open the Info output
> >in 'unix' mode)
> 
> The tag table containing the exact byte offsets is a lot simpler than
> having to remove all of the CR characters (or just CR characters
> before LF), and therefore less prone to incorrect implementation by
> any other Info-reading or -writing programs that might be written.

See above: we are breaking the Emacs Info reader, which is the other
reader important to the GNU project.  And we are creating an
interoperability problem vis-a-vis older versions of Texinfo.  I think
this is too high a price to pay.

> It enables accessing the correct place in the file without
> processing the entire file first. This could enable faster access of
> nodes by memory-mapping a file. Most of the time speed isn't an
> issue, but it's an idea I've had for speeding up searching the
> indices of all installed Info files at once. It could also be used
> to access a node of an Info file over a slow or expensive network
> connection without having to download the entire file.

I'm okay with these goals, but I don't think they are worth the
breakage mentioned above.  Some of the goals can be met even without
removing CRs, e.g., by using a larger slack when using the offsets
from the tag tables.

> I hope it's possible to make changes to the standalon Info reader to
> make it possible to access files with CR-LF line endings without
> having to interpret the tag table this way. At the same time, if it's
> easy to avoid outputting files with CR-LF line endings under Windows,
> then I think we should do so.

Changing makeinfo to output a Unix-style file will solve some of the
problems, yes.  I hope Patrice is reading this, and will comment.  But
the interoperability problem with files created by older makeinfo
versions will stay.  Maybe we should add an optional switch to Info to
give the user control of this.



Re: Standalone Info reader cannot read Info files with CR-LF EOLs

2014-12-26 Thread Gavin Smith
On Thu, Dec 25, 2014 at 3:51 PM, Eli Zaretskii  wrote:
> Today I discovered that the Info reader built from the current trunk
> cannot display any Info file that was produced natively on Windows (as
> opposed to Info files that come from distribution tarballs, which were
> produced on Unix).  The reader says it cannot find the Top node in any
> such Info file.
>
> It turned out this is because the code which stripped CR characters
> from CR-LF pairs, once the file was read, was #ifdef'ed away (in
> revision 5888), evidently due to a failure of a test that checks node
> accessibility through tag tables without the 1000-character slack.
>
> (I didn't find in bug-texinfo any discussion of the original problem
> or the change that was made to solve it.  Neither do I see anything
> pertinent in the bug database.  Did I miss something?  What or who
> triggered that change?)

I discovered this problem with the "gnucobpg.info" file that is part
of GNU Cobol (downloadable at
http://opencobol.add1tocobol.com/guides/), which has many CR-LF line
endings (but not consistently). I don't know exactly how this file was
generated - the file preamble says

This is gnucobpg.info, produced by makeinfo version 4.8 from
gnucobpg.texi.

- anyway, I had the problem mentioned that I found I couldn't access
later nodes in the file. I tested just now with info 4.13 and wasn't
able to access the "Alphabet-Name-Clause" or anything later in the
file. That's the only Info file I remember encountering containing
many CR bytes.

Since this claims to be produced by the 4.8 version (not 5.x)  whether
the CR characters are counted in the tag table must depend on other,
unknown factors.

(It could be helpful to make the GNU Cobol developers aware of this. I
haven't been able to quickly find an email address for them - if
anyone knows could they let them know?)

>  . fix texi2any to produce tag tables that assume the CR characters
>are stripped from the Info file (my reading of the code is that it
>should not count CR characters before LF for the purposes of
>count_context value; or maybe it should simply open the Info output
>in 'unix' mode)

The tag table containing the exact byte offsets is a lot simpler than
having to remove all of the CR characters (or just CR characters
before LF), and therefore less prone to incorrect implementation by
any other Info-reading or -writing programs that might be written. It
enables accessing the correct place in the file without processing the
entire file first. This could enable faster access of nodes by
memory-mapping a file. Most of the time speed isn't an issue, but it's
an idea I've had for speeding up searching the indices of all
installed Info files at once. It could also be used to access a node
of an Info file over a slow or expensive network connection without
having to download the entire file.

I hope it's possible to make changes to the standalon Info reader to
make it possible to access files with CR-LF line endings without
having to interpret the tag table this way. At the same time, if it's
easy to avoid outputting files with CR-LF line endings under Windows,
then I think we should do so.



Re: Standalone Info reader cannot read Info files with CR-LF EOLs

2014-12-26 Thread Eli Zaretskii
> Date: Thu, 25 Dec 2014 18:05:25 +
> From: Gavin Smith 
> Cc: Texinfo 
> 
> On Thu, Dec 25, 2014 at 3:51 PM, Eli Zaretskii  wrote:
> > Today I discovered that the Info reader built from the current trunk
> > cannot display any Info file that was produced natively on Windows (as
> > opposed to Info files that come from distribution tarballs, which were
> > produced on Unix).  The reader says it cannot find the Top node in any
> > such Info file.
> 
> I've commited a change that seemed to fix the problem on a small test
> file I made.

Thanks, it mostly did work for me too, although I did only limited
testing with --strict-node-location.  I say "mostly" because I did
find a few glitches:

 . Info still cannot find the Top node in a split manual (a manual
   produced with --no-split does work).

 . A cross-reference whose target node crosses a newline cannot be
   followed.  The example I have is in the Emacs Lisp manual, node
   "Lisp Data Types".  There's a cross-reference there to "Editing
   Types" which has a newline between the two words.  When I try
   following it, Info shows me this:

  File: *manpages*,  Node: Editing
   Types,  Up: (dir)

  No manual entry for Editing.
  No manual entry for Types.

 . The file's encoding is not recognized when the file has CR-LF EOLs

I looked through the sources, and saw a few additional places where \r
before \n might need to be handled (some of them might be the reason
for the above-mentioned problems):

  'avoid_see_see', where it skips whitespace
  'scan_reference_label', where it turns off underlining at EOL
  'get_file_character_encoding', in the call to 'search_forward'

There might be more places, I'm not sure.

> I'll comment more in coming days.

Thanks, I hope we will be able to discuss the broader picture then.
As things stand, I don't see how fixing Info to skip/ignore CR will
solve the fundamental issue of incompatibility between the Emacs and
the stand-alone Info readers wrt byte offsets in tag tables.



Re: Standalone Info reader cannot read Info files with CR-LF EOLs

2014-12-25 Thread Gavin Smith
On Thu, Dec 25, 2014 at 3:51 PM, Eli Zaretskii  wrote:
> Today I discovered that the Info reader built from the current trunk
> cannot display any Info file that was produced natively on Windows (as
> opposed to Info files that come from distribution tarballs, which were
> produced on Unix).  The reader says it cannot find the Top node in any
> such Info file.

I've commited a change that seemed to fix the problem on a small test
file I made. I'll comment more in coming days.