On Wed, Oct 26, 2022 at 05:27:53AM +0300, Eli Zaretskii wrote: > > Date: Tue, 25 Oct 2022 21:49:04 +0200 > > From: [email protected] > > Cc: [email protected], [email protected] > > > > > The only part that is I think different on Windows is the encoding of > > > file names, because Windows doesn't treat file names as opaque > > > bytestreams. But anything that comes from a Texinfo source, even the > > > name of an included file, should be interpreted according to > > > @documentencoding. When accessing included files on Windows, we > > > should re-encode the file names to the locale's encoding, because > > > nothing else will work reliably. Is that what we do? > > > > Yes, but it does not work reliably either, as shown by the tests > > results. The test which uses the locale's encoding fails (formatting > > manual_include_accented_file_name_latin1), while the test in which the > > document encoding is used, (formatting > > manual_include_accented_file_name_latin1_explicit_encoding) does not > > fail. As analysed just before, it works because both Windows and Perl > > are consistently wrong, but still it seems to work better. > > Perhaps the logic of these tests fails on Windows? Can you perhaps > describe the logic of each of these tests? In general, I see no > reason why encoding file names using the locale's encoding should fail > on Windows if done correctly. The idea of maintaining file names in > UTF-8 internally and encoding them to the locale's encoding before > using in file I/O calls is correct, and should work on Windows.
Here is what happens for formatting manual_include_accented_file_name_latin1 which is the test that fails: Lets call LOC your locale. The setup is a manual encoded in Latin1, and an include file included_latîn1.texi. On your computer, the î in the include file is stored as 0x05DE, which is the conversion of 0xEE in the LOC codepage. This is not î, (which is 0x00EE) and the file name shows this character instead of î when viewed in the explorer. However, î is presented as 0xEE to Perl when accessing the file, which is what Perl is expecting for î in Latin1. On Windows, we set DOC_ENCODING_FOR_INPUT_FILE_NAME to 0 (set in other cases to 1). In the XS parser î in the @include line is converted from the Latin1 encoding of the Texinfo file to UTF-8, so 0xEE gets converted to 0x00EE (UTF-8 encoded). Then, when the time comes to include the file, encode_file_name from input.c is called. The input_file_name_encoding is not set, nor doc_encoding_for_input_file_name, therefore the locale, LOC here is used to recode the file name from UTF-8 to LOC. The 0x00EE character (UTF-8 encoded) cannot be converted to LOC, so either the conversion fails, or a replacement character is used. In any case 0x00EE (UTF-8 encoded) never ends up as being recoded to 0xEE, which would allow to find the file. In that case, decoding to the locale leads to not finding the file. If DOC_ENCODING_FOR_INPUT_FILE_NAME is set to 1, then the document encoding, Latin1, is used to convert the 0x00EE character (UTF-8 encoded) which lead to 0xEE and the file is found. Since DOC_ENCODING_FOR_INPUT_FILE_NAME is set to 1 in the default case for other platforms than Windows, the file is found in other platforms. Note that my point is that the same happens on GNU/Linux. In my UTF-8 locale, if I set -c DOC_ENCODING_FOR_INPUT_FILE_NAME=0 explicitly, the same as the default on Windows, I get the same result as on Windows, the include file is not found, as the file names remains UTF-8 encoded in the parser while the file name on the filesystem is Latin1 encoded. In the manual_include_accented_file_name_latin1_explicit_encoding test, INPUT_FILE_NAME_ENCODING is set to ISO-8859-1, which leads Latin1 being used as the encoding to 0x00EE (UTF-8 encoded) to and to 0xEE. On Windows, it emulates setting DOC_ENCODING_FOR_INPUT_FILE_NAME to 0. In the manual_include_accented_file_name_latin1_use_locale_encoding test, INPUT_FILE_NAME_ENCODING is set to UTF-8, which leads 0x00EE (UTF-8 encoded) to remain UTF-8 encoded, such that the input file name is not found. It emulates setting DOC_ENCODING_FOR_INPUT_FILE_NAME to 1 in an UTF-8 encoded locale. -- Pat
