Re: Trouble with output character sets from Win32 applications running under mksh

Thomas Wolff Tue, 04 Aug 2020 19:11:10 -0700


Am 04.08.2020 um 23:19 schrieb Michael Shay via Cygwin:

Michael

The contents of your mail responses is not recognizable due to utterlybroken formatting.

[This is not top-posting as there's nothing to respond to]




From:   "Brian Inglis" <brian.ing...@systematicsw.ab.ca>
To:     cygwin@cygwin.com
Date:   08/04/2020 08:32 AM
Subject:        Re: Trouble with output character sets from Win32
applications running under mksh
Sent by:        "Cygwin" <cygwin-boun...@cygwin.com>



On 2020-08-03 16:05, Michael Shay via Cygwin wrote:

On 2020-08-03 11:42, Andrey Repin wrote:

Doesn't help. I tried 65001 (UTF-8):

Because you're confusing things.
chcp has nothing to do with LANG or LC_*.
Et vice versa.

chcp sets console code page for native console applications.

Only for those supporting it. Many do not.
LANG sets output parameters for Cygwin applications (and other

programs

that look for it, but these are few).

You cut the significant statement at the top of the OP:

I'm having a problem with Cygwin 3.1.4, changing the character set on
the fly. It seems to work with Cygwin applications, but not with

Win32

applications.

He has problems with invalid characters only running win32 console
applications: I changed the subject to hopefully better reflect the

issue.

I am unsure where Cygwin 3.1.4 comes into Win32 applications - you have

to

use the Windows codepage conversion routines.

You can only change input character sets on the fly; output character

sets

will depend on mintty support of xterm-compatible character set support
and switching escape sequences; if you set up UTF16LE console output,
Windows and mintty should handle it.

Perhaps a better description of your environment, build tools, what you
are trying to do, what you expect as output, and what you are getting

as

output, could help us better understand and help with the issue you

see.

The script I sent changes the locale information i.e. LANG and LC_ALL

are

set to en_US.CP1252. i.e.

export LANG="en_US.CP1252"
export LC_ALL=en_US.CP1252

FYI the normal sequence and order to check is LANG, LC_CTYPE, LC_ALL,
where the
last var set wins, or the reverse where the first var set wins; the
default
locale may be POSIX C.ASCII or the effective Windows locale, depending on
your
startup.

Thanks, that's good to know.

Then, it runs a simple Win32 program that takes a single input argument,

ZÇ,

the second character being C-cedilla, an 8-bit character, hex value

0xc7.

The Win32 program transcodes the input Unicode argument using the Cygwin
character set to determine the codepage, 1252.

Do you mean using the environment variables to determine the codepage?

Yes. Our code does try to fetch the character set information from the
environment.


FYI the default character set if none is specified is the Unix equivalent
of the
default Windows "ANSI"/OEM code page, in English or many European locales
that
will be ISO-8859-1.

You may have to use cygpath -C OEM chars... or cygpath -C ANSI chars... to
convert a string to the required character set for console or GUI
programs.

Our production code uses the console to display error information in

the

appropriate character set, but our command-line utilities expect to be
able to take input strings encoded in the character set in use, which
may be an 8-bit SBCS like ISO-8849-1, Windows 1252, or a MBCS, like

UTF-8

or e.g. Windows 932. Using 'cygpath' isn't an option.

Please specify what you mean by "Unicode" in each context; that term means
a
standard for representing scripts in many writing systems with a large
character
glyph repertoire and a number of encodings, representations, and handling
rules:
in each use case, do you mean a char/wchar representation, and/or an
encoding
UTF16LE or UTF-8?
Similarly when MS uses "ANSI" they may mean an SBCS OEM code page.

Unicode == UTF-16 in all cases. This is the wide-character set used by

Microsoft

as far as I can tell in the wide-char version of their Win32 API

functions e.g.

CreateProcessW() vs. CreateProcessA().

To check what is available and what is in effect in Cygwin, try e.g.:

$ for o in system user no-unicode input format; do echo `locale --$o` $o;
done
en_US system
en_GB user
en_CA no-unicode
en_CA input
en_CA format
$ locale

on both Cygwin versions.

1.7.28 output
$for o in system user no-unicode input format; do echo `locale --$o` $o;

done

en_US system
en_US user
en_US no-unicode
locale: unknown option -- input
Try `locale --help' for more information.
input
en_US format
3.1.4 output
$for o in system user no-unicode input format; do echo `locale --$o` $o;

done

en_US system
en_US user
en_US no-unicode
en_US input
en_US format

FYI see:

                  https://cygwin.com/cygwin-ug-net/setup-locale.html

It then prints the transcoded characters to stdout, and the result

should be

ZÇ, identical to the input argument.
This works fine using Cygwin 1.7.28.

Which Windows version are you running Cygwin 1.7.28 on?
Please show output from cmd /c ver.

$cmd /c ver
Microsoft Windows [Version 10.0.18363.959]

That Cygwin version 1.7.28 is from 2014-02 and has been unsupported for
years.
That version may not have completely supported international character
sets and
may just assume that everything is in ISO-8859-1/Latin-1, which is similar
to
CP1252, so that may work, or your system default OEM codepage e.g. 437 or
850,
and pass it along.

Our code supports dozens of character sets, for international sales,

and that

includes many SBCS, and MBCS, as well as UTF-8. I can use any of the

codepages

supported by Windows and Cygwin and 1.7.28 handles them just fine.

Cygwin 3.1.4 is launching the Win32 application, and is responsible for
transcoding the arguments passed to it by mksh, in this case CP1252
characters ZÇ, into Unicode.

Do you mean you believe Cygwin should recode argument strings, and what do
you
mean by Unicode in this context?

When I launch a Win32 application that is using a character set other

than 7-bit ASCI

in a Cygwin shell, the shell passes the command and arguments in the

input character set.

So, for example, using CP 1252 as the character set, and passing 8-bit

single-byte characters

like e.g. ZÇ, the shell doesn't change the characters, it passes them

through to Cygwin

to launch the process. In my test, using gdb ($gdb --version GNU gdb

(GDB) (Cygwin 8.2.1-1) 8.2.1)

i.e. "gdb ksh.exe", then "(gdb) start -c 'cygtest.exe ZÇ', I can step

into spawnve() in spawn.cc.

At this point, examining the input arguments confirms that the input

argument 'ZÇ' is still

in the correct encoding i.e. 0x5a 0xc7. The real work of launching the

process is done in

child_info_spawn::worker(). Eventually, the code invokes

CreateProcessW(). The executable

path is already in UTF-16 format, so the only transcoding left to be

done is the

argument string. This is done in linebuf::wcs() function (winf.h) This

small method

invokes sys_mbstowcs(), in strfuncs.cc. So yes, I do believe Cygwin

should transcode

the argument strings from whatever their current character set is to

UTF-16. This is

what the ancient 1.7.28 did.

That means Cygwin has to use the mb-to-uc function for transcoding

codepage

1252 to Unicode.

I am unsure if Cygwin does any recoding internally except for input typed
on the
terminal console interface.
CP1252 is an SBCS not an MBCS so MB functions are not required.
What do you expect when you use Unicode here?

If Cygwin no longer does this internal transcoding, that's a

significant change

from previous versions. I only know 1.7.28 did the transcoding

correctly, and it's

certainly possible that at some point between that version and 3.1.4,

the behavior

changed. Yes, CP1252 is a SBCS, but it supports 8-bit characters,

unlike 7-bit ASCII

so requires a different mapping from UTF-16. Using either CP 1252 or

7-bit ASCII

though would require a different transcoding routine than the UTF-8 ->

UTF-16 that

gets used.

It does not. It uses the UTF-8 to Unicode function (I've seen this using
gdb). That function flags the Ç as an invalid UTF-8 sequence, not
surprisingly since it's not a UTF-8 character.

What Windows, Cygwin, gdb versions are you seeing this on and what is the
name
of the function you are seeing?

Windows - Microsoft Windows [Version 10.0.18363.959]
Cygwin - CYGWIN_NT-10.0 engr-cygwin-10vm 3.1.4(0.340/5/3) 2020-02-19

08:49 x86_64 Cygwin

gdb - GNU gdb (GDB) (Cygwin 8.2.1-1) 8.2.1)
As described above, spawnve() calls child_info_spawn::worker() to do

the real work of

launching a process, a Win32 or a Cygwin process. The conversion of the

process arguments

into UTF-16 is done through linebuf::wcs(), into sys_mbstowcs(). In the

latter function

the only work done is to check if the pointer to the MBCS to WCS is '

__ascii_mbtowc' and

if so, to instead set it to '__utf8_mbtowc'. It then invokes

sys_cp_mbstowcs() to do the

work.
However, the problem if there is one, must be occurring very early on.

dll_crt0_1()

which according to the comments "Take over from libc's crt0.o and start

the application."

fetches the locale from the environment:
  /* Set internal locale to the environment settings. */
      initial_setlocale ();
I suspect that it's here where either there's a problem, or Cygwin

behavior has changed from

1.7.28. I haven't tried to use gdb to step into that initialization

code.

No matter what character set I use in 'export LANG...' and 'export
LC_ALL...', Cygwin 3.1.4 always uses the uft8-to-wc transcoding function

in

sys

... what should be there and what is the name of the function used?

1.7.28 Uses the correct function.

What is the name of that function?

The function is sys_cp_mbstowcs(), which is invoked by sys_mbstowcs()

as it is in 3.1.4.

But the older version doesn't get the pointer to the mb-to-wc

transcoding function passed

it, it fetches the pointer and the character set from cygheap->locale

and passes those

to sys_cp_mbstowcs().

I'm not using mintty, I'm using mksh, a requirement since our software

uses

lots of shell scripts, and for legacy support, that means using a Korn

shell.

So that means that the mksh is running on the Windows console, and you are
not
running mintty.

Correct.

I could understand it if 1.7.28 didn't do the proper transcoding, but it
does.

You may just be seeing Cygwin 1.7.28 passing the character codes along
verbatim.

I don't think so. child_info_spawn::worker() has to translate the

CP1252 characters

into UTF-16. And it does, as I've seen using Windbg on the Windows side

of this.

I used:

         gdb mksh

to load mksh into the debugger, then started it with

         start -c 'cygtest.exe ZÇ'

Windows, Cygwin, and gdb versions?

Windows - Microsoft Windows [Version 10.0.18363.959]
Cygwin - CYGWIN_NT-10.0 engr-cygwin-10vm 3.1.4(0.340/5/3) 2020-02-19

08:49 x86_64 Cygwin

gdb - GNU gdb (GDB) (Cygwin 8.2.1-1) 8.2.1)

That allowed me to step into child_info_spawn::worker() and stop at the
call to CreateProcess(), where the command line (cygtest.exe) and

argument

(ZÇ) are translated into Unicode.

In this case you mean into a UTF16LE string?

Yes.

This is the code to which I'm referring, in strfuncs.cc, which is

supposed

to translate the command line and arguments from CP 1252 into Unicode.

   size_t __reg3
   sys_mbstowcs (wchar_t * dst, size_t dlen, const char *src, size_t nms)
   {
     mbtowc_p f_mbtowc = __MBTOWC;
     if (f_mbtowc == __ascii_mbtowc)
       {
         f_mbtowc = __utf8_mbtowc;       <<<< THE CODE CHANGES THE
'__ascii_mbtowc' TO '__utf8_mbtowc' EVERY TIME, REGARDLESS OF THE
CODEPAGE.
       }
     return sys_cp_mbstowcs (f_mbtowc, dst, dlen, src, nms);
   }

So 'f_mbtowc' is set to _ascii_mbtowc, the default.You said:

UTF-8 contains ASCII as the first 128 code points, so that is valid,
unless the
"ASCII" used isn't really, and has character codes > 127!

CP1252 supports 8-bit single-byte characters such as C-cedilla. The

UTF-8

representation is a 3-byte sequence that is not correct if the

character

set in use is CP1252.

You can only change input character sets on the fly;

The input character set to Cygwin should have been changed to CP 1252,

as

it was in 1.7.28. At least, that's what I would expect to happen. If it
does not, or if miintty is required, then that's a regression from

1.7.28.

As Cygwin packages are rolling releases, old releases are unsupported, and
you
must upgrade to the latest release, reproduce the problem with a simple
test
case, and other examples if you wish, and post that with a copy of the
output from:

                  $ cygcheck -hrsv > cygcheck.out

as a plain text attachment to your post.

I understand. We do not ship a stock Cygwin installation. I happen to

have an

unmodified 3.1.4 on a development machine and was able to reproduce the

problem

with it. But we cannot take frequent Cygwin updates, as it takes far

too long

to find and fix problems between Cygwin and our code. The version has

to be

stable for months before we can use it.
Thanks for the helpful suggestions and information. I'll send updates,

in case

anyone else sees a similar problem.
Michael Shay


--
Problem reports:      https://cygwin.com/problems.html
FAQ:                  https://cygwin.com/faq/
Documentation:        https://cygwin.com/docs.html
Unsubscribe info:     https://cygwin.com/ml/#unsubscribe-simple

Re: Trouble with output character sets from Win32 applications running under mksh

Reply via email to