Re: Encoding error for HTML output (Solaris 11)

Gavin Smith Sat, 22 Nov 2025 02:05:26 -0800

On Sat, Nov 22, 2025 at 10:29:24AM +0100, Patrice Dumas wrote:
> On Fri, Nov 21, 2025 at 11:04:36PM +0000, Gavin Smith wrote:
> > Is there some internal conversion done on section titles that doesn't show
> > up in the output?
> 
> Indeed, there is, as shown by the trace in your other email, the
> normalization of 'HTML Cross-references' is used for section arguments
> to get a string that can be used as target.  This would also happen with
> @expansion on @node line.


Indeed, there is a difference between Solaris 11 output and GNU/Linux.
On Solaris 11:

<h2 class="chapter subsection-level-set-chapter" 
id="g_t_0040expansion_007b_007d-_0028_0029_003a-Indicating-an-Expansion"

On GNU/Linux:

<h2 class="chapter subsection-level-set-chapter" 
id="g_t_0040expansion_007b_007d-_0028_003f_0029_003a-Indicating-an-Expansion">

The difference is in the "_0028_0029" section of the header 'id' attribute.
This gives the ASCII values for "()".   On GNU/Linux it is _0028_003f_0029
which corresponds to "(?)" - here, "?" is evidently used as a replacement
character for the right arrow character.

Neither is good output.

If I run with TEXINFO_XS=omit, the output is different: _0028_21a6_0029.
Here _21a6 refers to the correct character.  This is the same on both
Solaris 11 and GNU/Linux.

Hence there is a clear bug here with inconsistent output between XS and
pure Perl code, with the pure Perl output being superior.

> Expansion to UTF-8 does not happen in the remaining of the output
> presumably because textual entities are used.  
> 
> You could set OUTPUT_CHARACTERS customization variable to have
> characters output instead of textual entities.  It could help determine
> if the issue is only with cross references normalization, or more
> general.  There are tests that test OUTPUT_CHARACTERS in the test suite.
> 
> I had a look at the opencsw test results and there is no messages/errors
> like that.  Maybe the iconv library used is different?

It appears to be from the use of the "us-ascii//TRANSLIT" encoding in
'unicode_to_transliterate' in main/node_name_normalization.c.  My
guess is that this system either doesn't have such an encoding or doesn't
support some characters for transliteration.

I tried with '-c OUTPUT_CHARACTERS=1' and it made no difference to the
error messages.

I found the use of this encoding was introduced in commit 1c9a5f283:

Author: Patrice Dumas <[email protected]>
Date:   2023-10-11 15:11:11 +0200

    * tp/Texinfo/Convert/HTML.pm (_set_root_commands_targets_node_files):
    remove unused $output_units argument.  Remove unused $no_unidecode.
    Put $extension in if.
    
    * tp/Texinfo/XS/main/errors.c (reallocate_error_messages)
    (message_list_line_error_internal)
    (message_list_document_error_internal, message_list_document_error)
    (message_list_document_warn), tp/Texinfo/XS/main/get_perl_info.c
    (html_converter_initialize): add message_list_document_warn and
    message_list_document_error and add error messages in converter.
    
    * tp/Texinfo/XS/main/convert_utils.c, tp/Texinfo/XS/main/utils.c
    (output_conversions, input_conversions, decode_string, encode_string):
    move output_conversions, input_conversions, decode_string, encode_string
    to utils.c.
    
    * tp/Texinfo/XS/parsetexi/input.c (parser_input_conversions): rename
    input_conversions as parser_input_conversions.
    
    * tp/Texinfo/XS/convert/convert_html.c (normalized_to_id)
    (normalized_label_id_file, unique_target)
    (new_sectioning_command_target, set_root_commands_targets_node_files)
    (html_prepare_conversion_units_targets),
    tp/Texinfo/XS/convert/converter.c (id_to_filename)
    (normalized_sectioning_command_filename, node_information_filename),
    tp/Texinfo/XS/main/call_perl_function.c
    (call_file_id_setting_label_target_name)
    (call_file_id_setting_node_file_name)
    (call_file_id_setting_sectioning_command_target_name),
    tp/Texinfo/XS/main/node_name_normalization.c
    (unicode_to_transliterate, normalize_transliterate_texinfo)
    (normalize_transliterate_texinfo_contents): implement
    set_root_commands_targets_node_files.

I don't get any understanding by looking at that commit why
"us-ascii//TRANSLIT" was used.  It seems likely that such an encoding
wouldn't be supported or work identically on different systems.

Re: Encoding error for HTML output (Solaris 11)

Reply via email to