Re: Encoding error for HTML output (Solaris 11)

Gavin Smith Sat, 22 Nov 2025 08:12:50 -0800

On Sat, Nov 22, 2025 at 02:48:30PM +0100, Patrice Dumas wrote:
> > The difference is in the "_0028_0029" section of the header 'id' attribute.
> > This gives the ASCII values for "()".   On GNU/Linux it is _0028_003f_0029
> > which corresponds to "(?)" - here, "?" is evidently used as a replacement
> > character for the right arrow character.
> > 
> > Neither is good output.
> 
> The sectioning commands id are not specified.  They are only supposed to
> be consistent "internally", ie it should be the right id which is used
> in generated HTML (section commands id are not often used) or available
> in user-defined code using the HTML customization API.
> 
> Therefore, in normal runs, right now speed is favored over consistency
> and the iconv "us-ascii//TRANSLIT" output is used.


OK got it, there is no promise of stability for section anchors.

How does running UTF-8 into "us-ascii//TRANSLIT" iconv conversion
increase speed?  Could we not just skip that step?

For example, commenting out the call to unicode_to_transliterate in
normalize_transliterate_texinfo changes the output for @expansion{}
in the 'id' from _003f to _21a6:


diff --git a/tta/C/main/node_name_normalization.c 
b/tta/C/main/node_name_normalization.c
index d8f922edea..762edf2734 100644
--- a/tta/C/main/node_name_normalization.c
+++ b/tta/C/main/node_name_normalization.c
@@ -372,13 +372,14 @@ normalize_transliterate_texinfo (const ELEMENT *e, int 
external_translit,
 {
   char *converted_name = convert_to_normalized (e);
   char *normalized_name = normalize_NFC (converted_name);
-  char *transliterated = unicode_to_transliterate (normalized_name,
-                          external_translit, in_test, no_unidecode);
-  char *result = unicode_to_protected (transliterated);
+  // char *transliterated = unicode_to_transliterate (normalized_name,
+  //                         external_translit, in_test, no_unidecode);
+  //char *result = unicode_to_protected (transliterated);
+  char *result = unicode_to_protected (normalized_name);
 
   free (converted_name);
   free (normalized_name);
-  free (transliterated);
+  // free (transliterated);
   return result;
 }

(Obviously this function is called elsewhere as well so probably a new
function would have to be called, called something like 'normalize_texinfo'.)
 

> In tests, the Perl code is called.
> 
> > If I run with TEXINFO_XS=omit, the output is different: _0028_21a6_0029.
> > Here _21a6 refers to the correct character.  This is the same on both
> > Solaris 11 and GNU/Linux.
> 
> And with TEST set.
> 

I've checked and the error message isn't output with '-c TEST=1', as you
say.  I see from reading the code that with TEST, a call is made into
Perl to do the transliteration.

> At that time it was simply used as a C replacement for Text::Unidecode.
> Later on, I added the possibility to call Perl to do the transliteration
> reproducibly.  But as I said above, reproducibility is not offered for
> sectioning commands identifiers, so it remained as is in that case.

It seems that we don't need transliteration for sectioning commands,
regardless of whether it is in the C code or the Perl code.

Re: Encoding error for HTML output (Solaris 11)

Reply via email to