Re: [cmake-developers] [PATCH] Improve encoding handling on Windows

clinton Fri, 01 Jul 2016 15:55:33 -0700


----- On Jun 30, 2016, at 8:18 PM, Dāvis Mosāns davis...@gmail.com wrote:


> On Windows getenv uses ANSI codepage so it needs to be encoded to
> internally used encoding (eg. UTF-8). Here we use _wgetenv instead
> and encode that.
> 
> Also typically Windows applications (eg. MSVC compiler) use current
> console's codepage for output to pipes so we need to encode that
> to internally used encoding (KWSYS_ENCODING_DEFAULT_CODEPAGE).
> 
> Next, when we're outputing to console need to use wide functions.
> 
> This change allows that compilers such as MSVC on Windows can be
> installed in non-ASCII path and will work correctly for all
> console's codepages which supports that path's characters.

...

> +#if defined(_WIN32)
> +  // Kinda hack, with MSVC (and MinGW) for some reason std::wcout
> +  // (and all other std::w*) fails once it encounters non-ASCII
> +  // string unless locale is set.
> +  // Note that even with this, seems it can't output characters
> +  // which aren't present in ANSI codepage no matter what encoding
> +  // is used for console.
> +  // Also once any character outside of ANSI codepage is tried to
> +  // be outputed then after there anymore won't be output from
> +  // any of std::w* functions.
> +  _wsetlocale(LC_ALL, L"");
> +#endif

The _wsetlocale() may be a problem.

See:
https://gitlab.kitware.com/cmake/cmake/commit/87be2e142
https://cmake.org/Bug/view.php?id=16099


> @@ -30,8 +30,23 @@ namespace @KWSYS_NAMESPACE@
>       typedef std::basic_filebuf<CharType,Traits> my_base_type;
>       basic_filebuf *open(char const *s,std::ios_base::openmode mode)
>       {
> +        std::wstring wstr = Encoding::ToWide(s);
> +        const wchar_t *ws = wstr.c_str();
> +#if defined(_MSC_VER) && _MSC_VER >= 1400
> +        const wchar_t *ss = ws;
> +#else
> +        const char *ss = 0;
> +        size_t length = std::wcstombs(0, ws, 0);
> +        if (length != size_t(-1)) {
> +          std::vector<char> str(length + 1);
> +          ss = str.data();
> +          std::wcstombs((char *)ss, ws, str.size());
> +        } else {
> +          ss = s;
> +        }
> +#endif
>         return static_cast<basic_filebuf*>(
> -          my_base_type::open(Encoding::ToWide(s).c_str(), mode)
> +          my_base_type::open(ss, mode)
>           );

Yes.  It makes sense to convert from utf-8 to code page when we are not using 
the wide apis.
This would at least give you support for the current code page, beyond ascii.   
Beyond that, the wide functions should be used.
Which compiler are you trying to support here, which doesn't give a wide open()?


>       }
>   };
> diff --git a/Source/kwsys/ProcessWin32.c b/Source/kwsys/ProcessWin32.c
> index 2b93e69..208e725 100644
> --- a/Source/kwsys/ProcessWin32.c
> +++ b/Source/kwsys/ProcessWin32.c
> @@ -181,7 +181,7 @@ struct kwsysProcessPipeData_s
>   /* ------------- Data managed per call to Execute ------------- */
> 
>   /* Buffer for data read in this pipe's thread.  */
> -  char DataBuffer[KWSYSPE_PIPE_BUFFER_SIZE];
> +  char DataBuffer[KWSYSPE_PIPE_BUFFER_SIZE*2];

This "*2" assumes all characters cmake sees will fit in the Basic Multilingual 
Plane.
Are we OK assuming that?
If the conversion from a code page to utf-16 results in surrogate pairs, there 
may not be enough space.


> 
>   /* The length of the data stored in the buffer.  */
>   DWORD DataLength;
> @@ -1626,6 +1626,25 @@ void kwsysProcessPipeThreadReadPipe(kwsysProcess* cp,
> kwsysProcessPipeData* td)
>       KWSYSPE_DEBUG((stderr, "read closed %d\n", td->Index));
>       }
> 
> +    if (td->DataLength > 0) {
> +        UINT codepage = GetConsoleCP();
> +        if (!codepage) {
> +            codepage = GetACP();


Can this be stored in kwsysProcessCreateInformation, and not computed ever time 
we read data?



> +        }
> +        if (codepage != KWSYS_ENCODING_DEFAULT_CODEPAGE) {
> +            int wlength = MultiByteToWideChar(codepage, 0, td->DataBuffer,
> td->DataLength, NULL, 0);
> +            wchar_t* wdata = malloc(wlength * sizeof(wchar_t));
> +            int r = MultiByteToWideChar(codepage, 0, td->DataBuffer,
> td->DataLength, wdata, wlength);
> +            if (r > 0) {
> +                r = WideCharToMultiByte(KWSYS_ENCODING_DEFAULT_CODEPAGE, 0,
> wdata, wlength, td->DataBuffer, KWSYSPE_PIPE_BUFFER_SIZE * 2, NULL, NULL);
> +                if (r > 0) {
> +                    td->DataLength = r;
> +                }
> +            }
> +            free(wdata);
> +        }
> +    }

How about avoiding a malloc()/free() each time we read an array?

ProcessWin32.c already uses Encoding.h.  How about using the Encoding instead 
of 
WideCharToMultiByte(...)?  I'm fine either way.


Thanks for looking into this.
I had experimented with some of this as a next step, but didn't finish it.
Your general approach seems correct.

Clint
-- 

Powered by www.kitware.com

Please keep messages on-topic and check the CMake FAQ at: 
http://www.cmake.org/Wiki/CMake_FAQ

Kitware offers various services to support the CMake community. For more 
information on each offering, please visit:

CMake Support: http://cmake.org/cmake/help/support.html
CMake Consulting: http://cmake.org/cmake/help/consulting.html
CMake Training Courses: http://cmake.org/cmake/help/training.html

Visit other Kitware open-source projects at 
http://www.kitware.com/opensource/opensource.html

Follow this link to subscribe/unsubscribe:
http://public.kitware.com/mailman/listinfo/cmake-developers

Re: [cmake-developers] [PATCH] Improve encoding handling on Windows

Reply via email to