On Monday 11 November 2024 22:10:01 Lasse Collin wrote:
> On 2024-11-11 Pali Rohár wrote:
> > If application do not want to fail then double quote in _acmdln must
> > not come from other (non double quote) character. Otherwise argv[]
> > would be wrongly constructed. And I think that argv[] splitting must
> > be done correctly.
> 
> I agree it makes sense to do splitting correctly even if parsing is
> otherwise permissive. It's unlikely enough that someone has used
> fullwidth double quotes in a .bat file to quote filenames and expecting
> them to work like ASCII double quotes. :-)
> 
> > If application wants to fail when conversion is not lossless then it
> > does not matter what would be filled in _acmdln at the time of
> > application abort / exit call.
> 
> Yes.
> 
> > So I think that as a first step overwriting _acmdln can be useful.
> > Second step could be to add an option to fail on non-lossless
> > conversion.
> 
> I'm strongly in favor of making the _exit(255) behavior the default
> and requiring opt-in to get permissive mode. Most apps will use
> whatever is the default. It's easier to fix a few apps that need the
> permissive mode than to teach all other apps to enable the strict mode.

Now when Martin posted that there are more issue reports, I'm not very
happy to have it by the default. At least not for now.

> If permissive mode can be enabled the same way as _dowildcard works,
> then argv[] could be constructed with the same code as in strict mode.
> The current crtexe.c in master has
> 
>   for (int i = 0; i < argc; ++i) {
>     BOOL conv_was_lossy = TRUE;
>     int size = WideCharToMultiByte(CP_ACP, WC_NO_BEST_FIT_CHARS,
>         wargv[i], -1,
>         NULL, 0,
>         NULL, &conv_was_lossy);
> 
> but it could become something like this:
> 
>   BOOL conv_was_lossy = FALSE;
>   BOOL *conv_was_lossy_ptr = _strict_argv ? &conv_was_lossy : NULL;
> 
>   for (int i = 0; i < argc; ++i) {
>     int size = WideCharToMultiByte(CP_ACP, WC_NO_BEST_FIT_CHARS,
>         wargv[i], -1,
>         NULL, 0,
>         NULL, conv_was_lossy_ptr);

This does not fix _acmdln, which as I pointed is used in other places of
mingw-w64. That code would be needed to move out its current place to
first fix _acmdln.

> It would allow apps to set _strict_argv = 0 to disable the _exit(255)
> usage.
> 
> I don't know if permissive mode should use WC_NO_BEST_FIT_CHARS.
> Keeping it can avoid some security risks. But maybe it could create new
> issues if best-fit conversion happened to be preferred (for example,
> command line argument is a message to show to a user).
> 
> If WC_NO_BEST_FIT_CHARS has compatibility concerns with very old Windows
> versions but one doesn't want best-fit mapping, I wonder if the
> lpDefaultChar argument is still usable. One could set it to, for
> example, "?" or "_".

At least every character which is by best-fit mapped to double quote has
to be replaced.

> The conversion in crtexe.c happens after possible wildcard expasion in
> the CRT. Thus a ? from charset conversion won't be a wildcard before
> main() is called. But if the app does wildcard expansion on its own,
> then ? might be a problematic replacement character. It would be
> possible to let app to customize which character to use but at some
> point the customization options get complicated. :-)
>
> If _acmdln was overridden and then CRT would parse the command line in
> narrow mode, best-fit mapping in wildcard expansion cannot be avoided
> or customized. I feel the above idea of using the same argv[] code in
> both strict and permissive mode is easier.

But it does not solve the problem in crtexewin.c file.

> One could still override _acmdln even if the startup code doesn't need
> it (in case some app reads it still). If best-fit mapping isn't needed,
> the simplest method could be using WideCharToMultiByte() to convert
> _wcmdln to _acmdln. One would use WC_NO_BEST_FIT_CHARS or set
> lpDefaultChar (perhaps avoiding "?" in case app treats it as a
> wildcard).
> 
> You pointed out that it's possible that Microsoft will fix something
> around this issue. I understand it might make sense to wait what they
> will do so that we don't create new problems by rushing a fix with
> possibly-incompatible behavior into MinGW-w64. :-| I don't have a clear
> opinion here, I hope MinGW-w64 maintainers do. :-)

Well, we still do not know how to fix this problem. And before doing any
future change I would rather think about it, to ensure that it handles
all cases, and also all source files in mingw-w64. Are those only
crtexe.c and crtexewin.c? Or are there also other files which parses
cmdline?

> > Does it makes sense to fix this problem in argv[] if we have exactly
> > same problem in FindFirstFileA()? argv[] would be just an partial and
> > incomplete fix of rather larger issue at all.
> 
> Hmm, I guess it still makes sense. Not all apps call FindFirstFileA().
> 
> I wonder how FindFirstFileA() could be fixed. Perhaps it could skip
> problematic filenames and remember that such a name was seen. Then
> after listing all files successfully, fail with some error code. This
> could create new issues though. :-/

The only idea for FindFirstFileA/FindNextFindA which I have is to return
"problematic" file names only in 8.3 notation. 8.3 notation is supported
in all WinAPI calls, it is always in ASCII and has 1:1 mapping with the
original UNICODE file name. The only downside is that it may have
totally different name as expected and can be confusing. But I think it
is still better than returning the "wrong" name which would alias with
other file.

> > > FindFirstFileA() and FindFirstFileExA() use best-fit conversion.
> > > With UTF-8 code page, only unpaired surrogates are a problem in
> > > terms of charset conversion. With UTF-8 one can run into MAX_PATH
> > > limitation of WIN32_FIND_DATAA.cFileName though.
> > 
> > I see, so MS chosen to translate all unpaired surrogates to to UNICODE
> > replacement character, and therefore made wchar_t[] to utf8_t[]
> > mapping non-bijective.
> 
> Yes. Documentation of WideCharToMultiByte() says that in WinXP the
> conversion was bijective but it was changed in Vista to not produce
> invalid UTF-8 (the code points of surrogates are invalid UTF-8).
> 
> I suppose that in practice it should often be good enough that sensible
> filenames are accessible via *A() APIs with UTF-8 code page. The lossy
> conversions are more troublesome as they can result in access of wrong
> files.

I see, I quite understand this requirement that conversion function to
UTF-8 has to return valid UTF-8 string. (Unpaired) surrogate is by
UNICODE definition (in UNICODE standard) invalid in UTF-8.

Anyway, gcc / GNU LD still does not have support for marking PE binary
as UTF-8 capable. So using UTF-8 code page for application linked by
GNU LD is not an option for now. The reason is that GNU LD is not able
to work with manifests (where are put all new PE flags / options).

In past I reported this missing UTF-8 support into GNU LD bugzilla:
https://sourceware.org/bugzilla/show_bug.cgi?id=30382


_______________________________________________
Mingw-w64-public mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/mingw-w64-public

Reply via email to