On Sunday 10 November 2024 20:56:05 Lasse Collin wrote:
> On 2024-11-10 Pali Rohár wrote:
> > On Sunday 10 November 2024 00:14:42 Pali Rohár wrote:
> > > > On 2024-11-09 LIU Hao wrote:  
> > > > > I have a crazy idea now. Does it make sense to overwrite
> > > > > `_acmdln` (for MSVCRT) or `*__p__acmdln()` (for UCRT) with a
> > > > > sanitized string, so existent argument parsing may be reused?  
> > > 
> > > It looks like that both _acmdln and _wcmdln are initialized in CRT
> > > DLL entry point. And these variables are used by all other calls,
> > > GetCommandLineA() or GetCommandLineW() are not used later.
> > > 
> > > So from this quick look, it should be enough to change _acmdln in
> > > mingw-w64 startup code as early as possible and then __getmainargs()
> > > should work fine (it also uses _acmdln and not GetCommandLineA(), at
> > > least in msvcrt.dll).  
> > 
> > I looked also on UCRT source code and seems that this should work.
> > UCRT's _configure_narrow_argv() also takes command line string from
> > _acmdln. And _acmdln is initialized in UCRT DLL entry point via
> > GetCommandLineA().
> > 
> > So in my opinion overwriting content of _acmdln in EXE entry point
> > should be enough and __getmainargs() then would work correctly.
> 
> If conversion cannot be done in a lossless way, it depends on the
> application what is a safe fallback, or if there are any useful
> fallbacks at all. Maybe I'm missing something but I don't see a useful
> generic method to overwrite _acmdln.

If application do not want to fail then double quote in _acmdln must not
come from other (non double quote) character. Otherwise argv[] would be
wrongly constructed. And I think that argv[] splitting must be done
correctly.

If application wants to fail when conversion is not lossless then it
does not matter what would be filled in _acmdln at the time of
application abort / exit call.

So I think that as a first step overwriting _acmdln can be useful.
Second step could be to add an option to fail on non-lossless
conversion.

> > Note that both _acmdln and _wcmdln are always initialized, for both
> > ANSI and UNICODE builds. So as a simple sanitization, something like
> > should be enough?
> > 
> > Iterate over (multibyte) string _acmdln and for every found double
> > quote check that it exists also at _wcmdln[iter] (where iter is
> > iteration in multibyte _acmdln string). If double quote is in
> > *_acmdln_iter but not in _wcmdln[iter] then change *_acmdln_iter to
> > some other character.
> > 
> > This could ensure that any code which parses _acmdln will see only the
> > original double quotes and not best fits of double quotes.
> 
> There could be other characters that can become ASCII double quote.
> Also, indexing of _wcmdln[iter] is too simple because, on Windows,
> wchar_t is a variable-length encoding.

Right, I forgot that wchar_t[] is on Windows variable length UTF-16
encoding. So yes, iteration code over _wcmdln would be different.

> Double quote sanitization would still leave other best-fit issues
> unfixed.

That is truth. But we need to begin with some point and I think that
this is something which would be needed for any solution.

> > Maybe stupid question, but what happens when you try to list folder
> > which contains files which names in active code page are all same?
> > Imagine that you have an application which does not use argv[] at all,
> > it list files in the current directory and from every file prints for
> > example first byte. What would happen in this case? Is not here same
> > problem as with wildcard expansion?
> 
> It's a very good question. :-) It's indeed the same problem. Wrong
> files might be accessed in your example.

Ahhh. So it means that FindFirstFileA() (and all functions which use it;
IIRC there are more of them in msvcrt.dll) is broken and applications
using it would have same issue.

Does it makes sense to fix this problem in argv[] if we have exactly
same problem in FindFirstFileA()? argv[] would be just an partial and
incomplete fix of rather larger issue at all.

> FindFirstFileA() and FindFirstFileExA() use best-fit conversion. With
> UTF-8 code page, only unpaired surrogates are a problem in terms of
> charset conversion. With UTF-8 one can run into MAX_PATH limitation
> of WIN32_FIND_DATAA.cFileName though.
> 
> -- 
> Lasse Collin

I see, so MS chosen to translate all unpaired surrogates to to UNICODE
replacement character, and therefore made wchar_t[] to utf8_t[] mapping
non-bijective.

Note that the difference between Linux filenames and Windows filenames
is only in that Linux using uint8_t[] for them and Windows uint16_t[].
So that is why filename on Windows can have unpaired surrogates.
With full-featured filesystem on Windows (not sure if somebody ever
implemented such; but it is theoretically possible in NT kernel) can be
case-sensitive without unicode normalization and with all characters
except backslash (yes, including null character, or filename "."). On
Linux is disallowed only slash and null. Neither FAT, nor NTFS are that
full featured. PS: FAT allows "." or ".." as filenames; and funny thing
happen if FindFirstFile()/FindNextFile() returns "." two times, once
as directory (current directory) and second time as a file (in the
current directory).


_______________________________________________
Mingw-w64-public mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/mingw-w64-public

Reply via email to