subject:"\[Development\] RFC\: Defaulting to or enforcing UTF\-8 locales on Unix systems"

Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems

2023-04-18 Thread Edward Welbourne via Development

Lars Knoll (18 April 2023 09:46) replied
>> I think this should be the goal, but I’d vote for a slightly faster
>> schedule.
>>
>> (a) and (b) are things we should be able to do right now.

I (18 April 2023 14:05) commented:
> Sounds sensible to me.

... so have opened QTBUG-112954 and QTBUG-112955 for the
opening move of making it possible for the user to opt in,

Eddy.
-- 
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development

Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems

2023-04-18 Thread Thiago Macieira

On Tuesday, 18 April 2023 00:46:26 PDT Lars Knoll wrote:
> > But anything that goes through QIODeivce::read or write (QProcess, QFile,
> > Q{Udp,Tcp,Local}Socket) will suffer if there's no agreement on what that
> > encoding is. Usually for sockets, the protocol is binary and obviate the
> > problem. For files, some file formats help. But in particular for
> > communicating with another process, there's no reliable way.
> 
> Communicating through a socket will always require that both sides agree on
> the encoding. That’s not really anything new.
> 
> The question is how they encode the data when writing to the socket. If they
> use QTextStream, the data will by default get written in utf8 already today
> (since Qt 6.0). If they explicitly convert the QString to and from a
> specific encoding using QStringConverter/QTextCodec nothing bad will happen
> neither.
> 
> So the remaining problem comes when they use QString::to/fromLocal8Bit(), as
> that might change from some windows locale to utf8. Not a problem when
> communicating with a socket between two Qt apps, but might be an issue when
> storing data in a file or communicating with an app that doesn’t use Qt.
> 
> But we could consider that a user error, as you really shouldn’t use
> local8bit for anything else than stdin/out and interfacing with 8bit system
> APIs.

Please don't focus on sockets, as we all agree the protocol will usually 
inform what the encoding is. Instead, let's focus on QProcess.

Here's a test: write an application that displays in GUI the output of:

  QProcess proc;
  proc.start("cmd.exe", { "/c", "dir" });

This is an uncommon scenario, but it is representative of any application that 
is or simulates a terminal. If you want to have a more realistic version of 
the above, replace "dir" with "nmake" or "ninja": all three will print the 
names of files.

Conversely, write the application that keeps its output unmodified so it can be 
consumed by its current consumers.

> We did enforce it on Unix systems though with Qt 6. I do believe we can over
> time enforce it on windows as well, or at least make it the default.

In time, I agree. But we are right now where Unix was in 2003-2005, and with 
differences. For Unix systems, there's no UTF-16 API, so the equivalent 
commands of the above could afford to be encoding-agnostic, so they were a 
pass-through of what the filesystem offered. In fact, it was only Qt 
applications that had problems because we converted to UTF-16 back in 3.0 
(since 2.0) -- that is STILL a complaint we've often heard about our FS API.

> > But I think we should:
> > a) do it for our own applications, since we do know our own code
> > b) advise users somehow that they should opt-in to this
> > c) decide if we want to change from opt-in to opt-out in the medium term
> > (7.0 for example)
> > 
> > d) decide if we want to enforce it in the long-term
> > 
> > Option (d) depends on (c). Option (c) informs whether we need a Qt CMake
> > API or whether we can simply say upstream CMake should handle it.
> 
> I think this should be the goal, but I’d vote for a slightly faster
> schedule.
>
> (a) and (b) are things we should be able to do right now. All our apps work
> fine one Unix systems with a utf8 locale, so there should be relatively few
> problems doing the switch on Windows. The only thing this requires is a bit
> of cake infrastructure work (that I believe has been mostly done already),
> and some documentation for our users.
> 
> (c) is something we should also announce with a time schedule right now. I
> would go and do this either for 6.8 or 6.9 (ie with the next LTS release or
> directly afterwards). If we announce it now, it gives our users 1.5 to 2
> years to adopt (and they can always opt out afterwards).

I don't think that's realistic because I think we'll find issues. I think we 
need to do the conversion of our own applications and tools first, figure out 
what the issues are for ourselves, before we make time promises.

I expect we'll need more than 1.5 year of advance notice that the opt-in will 
change to opt-out.

> (d) is something I would do for Qt 7, as that’s the correct time to do those
> changes and clean up our code base

I also think it's unrealistic for the same reason. That's a 4-6 year leniency, 
for something that Unix took 17 and had a single system-wide encoding (Windows 
has three).

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Cloud Software Architect - Intel DCAI Cloud Engineering

smime.p7s
Description: S/MIME cryptographic signature
-- 
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development

Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems

2023-04-18 Thread Edward Welbourne via Development

On 17 Apr 2023, at 18:16, Thiago Macieira  wrote:
>> But anything that goes through QIODeivce::read or write (QProcess,
>> QFile, Q{Udp,Tcp,Local}Socket) will suffer if there's no agreement on
>> what that encoding is.

And that's a cross-platform problem for anyone who has to consume data
produced by a (presumably non-Qt) source that's using legacy codecs.
At present our answer is to use Qt-with-ICU or some separate codec-converter.

>> [snip] What has changed is that the Windows API has matured to the
>> point that this is now a viable choice (previously, it was
>> experimental and known to cause issues). But it's still an
>> application choice; we can't enforce it.

But we *can* document how to do it as part of our "how to package your
application" instructions, thereby encouraging users of Qt to do so.

>> But I think we should:
>> a) do it for our own applications, since we do know our own code
>> b) advise users somehow that they should opt-in to this
>> c) decide if we want to change from opt-in to opt-out in the medium
>>term (7.0 for example)
>> d) decide if we want to enforce it in the long-term
>>
>> Option (d) depends on (c). Option (c) informs whether we need a Qt
>> CMake API or whether we can simply say upstream CMake should handle
>> it.

Lars Knoll (18 April 2023 09:46) replied
> I think this should be the goal, but I’d vote for a slightly faster
> schedule.
>
> (a) and (b) are things we should be able to do right now.

Sounds sensible to me.

Eddy.
-- 
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development

Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems

2023-04-18 Thread Lars Knoll

> On 17 Apr 2023, at 18:16, Thiago Macieira  wrote:
> 
> On Monday, 20 March 2023 08:44:30 CDT Edward Welbourne wrote:
>> Thiago Macieira (31 October 2019 22:11) wrote [0]:
>>> This RFC (...) is meant to discuss how we'll deal with locales on Unix
>>> systems on Qt 6. This does not apply to Windows because on Windows we
>>> cannot reasonably be expected to use UTF-8 for the 8-bit encoding.
>> 
>> [0]
>> https://lists.qt-project.org/pipermail/development/2019-October/037791.html
>> 
>> The GNU make mailing list currently has a thread (starts at [1]) about
>> handling of encodings on Windows.
>> 
>> [1] https://lists.gnu.org/archive/html/bug-make/2023-03/msg00066.html
>> 
>> The discussion there seems to indicate that setting the system code-page
>> to UTF-8 can be done in a way that interoperates gracefully with other
>> processes and the file system, presumably thanks to the system being
>> substantially UTF-16-based, so all 8-bit encodings go via that anyway.
> 
> That only works for the file names, not the file contents and other channels. 
> For QProcess, we're slightly fortunate that we have UTF-16 API, so the 
> encoding that the other application uses for its command-line is irrelevant 
> for us.
> 
> But anything that goes through QIODeivce::read or write (QProcess, QFile, 
> Q{Udp,Tcp,Local}Socket) will suffer if there's no agreement on what that 
> encoding is. Usually for sockets, the protocol is binary and obviate the 
> problem. For files, some file formats help. But in particular for 
> communicating 
> with another process, there's no reliable way.

Communicating through a socket will always require that both sides agree on the 
encoding. That’s not really anything new. 

The question is how they encode the data when writing to the socket. If they 
use QTextStream, the data will by default get written in utf8 already today 
(since Qt 6.0). If they explicitly convert the QString to and from a specific 
encoding using QStringConverter/QTextCodec nothing bad will happen neither.

So the remaining problem comes when they use QString::to/fromLocal8Bit(), as 
that might change from some windows locale to utf8. Not a problem when 
communicating with a socket between two Qt apps, but might be an issue when 
storing data in a file or communicating with an app that doesn’t use Qt.

But we could consider that a user error, as you really shouldn’t use local8bit 
for anything else than stdin/out and interfacing with 8bit system APIs.

> 
>> The means to achieve this appear [2] to hinge on setting the active
>> codepage for the application in a manifest file, that it gets combined
>> with after it is linked.
>> 
>> [2]
>> https://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-> 
>> code-page
> 
> That was already known at the time, in 2019. What has changed is that the 
> Windows API has matured to the point that this is now a viable choice 
> (previously, it was experimental and known to cause issues). But it's still 
> an 
> application choice; we can't enforce it.

We did enforce it on Unix systems though with Qt 6. I do believe we can over 
time enforce it on windows as well, or at least make it the default.
> 
>> There do appear to be some vagaries still, it may depend on UCRT and I'm
>> not sure I've really understood it all, but it looks like we may, in
>> time, be able to consistently use UTF-8 as 8-bit encoding on Windows.
> 
> Sorry, no, we can't force users to do it because we don't know if their code 
> is safe.
> 
> But I think we should:
> a) do it for our own applications, since we do know our own code
> b) advise users somehow that they should opt-in to this
> c) decide if we want to change from opt-in to opt-out in the medium term (7.0 
>  for example)
> d) decide if we want to enforce it in the long-term
> 
> Option (d) depends on (c). Option (c) informs whether we need a Qt CMake API 
> or whether we can simply say upstream CMake should handle it.

I think this should be the goal, but I’d vote for a slightly faster schedule. 

(a) and (b) are things we should be able to do right now. All our apps work 
fine one Unix systems with a utf8 locale, so there should be relatively few 
problems doing the switch on Windows. The only thing this requires is a bit of 
cake infrastructure work (that I believe has been mostly done already), and 
some documentation for our users.

(c) is something we should also announce with a time schedule right now. I 
would go and do this either for 6.8 or 6.9 (ie with the next LTS release or 
directly afterwards). If we announce it now, it gives our users 1.5 to 2 years 
to adopt (and they can always opt out afterwards).

(d) is something I would do for Qt 7, as that’s the correct time to do those 
changes and clean up our code base

Cheers,
Lars

-- 
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development

Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems

2023-04-17 Thread Thiago Macieira

On Monday, 20 March 2023 08:44:30 CDT Edward Welbourne wrote:
> Thiago Macieira (31 October 2019 22:11) wrote [0]:
> > This RFC (...) is meant to discuss how we'll deal with locales on Unix
> > systems on Qt 6. This does not apply to Windows because on Windows we
> > cannot reasonably be expected to use UTF-8 for the 8-bit encoding.
> 
> [0]
> https://lists.qt-project.org/pipermail/development/2019-October/037791.html
> 
> The GNU make mailing list currently has a thread (starts at [1]) about
> handling of encodings on Windows.
> 
> [1] https://lists.gnu.org/archive/html/bug-make/2023-03/msg00066.html
> 
> The discussion there seems to indicate that setting the system code-page
> to UTF-8 can be done in a way that interoperates gracefully with other
> processes and the file system, presumably thanks to the system being
> substantially UTF-16-based, so all 8-bit encodings go via that anyway.

That only works for the file names, not the file contents and other channels. 
For QProcess, we're slightly fortunate that we have UTF-16 API, so the 
encoding that the other application uses for its command-line is irrelevant 
for us.

But anything that goes through QIODeivce::read or write (QProcess, QFile, 
Q{Udp,Tcp,Local}Socket) will suffer if there's no agreement on what that 
encoding is. Usually for sockets, the protocol is binary and obviate the 
problem. For files, some file formats help. But in particular for communicating 
with another process, there's no reliable way.

> The means to achieve this appear [2] to hinge on setting the active
> codepage for the application in a manifest file, that it gets combined
> with after it is linked.
> 
> [2]
> https://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-> 
> code-page

That was already known at the time, in 2019. What has changed is that the 
Windows API has matured to the point that this is now a viable choice 
(previously, it was experimental and known to cause issues). But it's still an 
application choice; we can't enforce it.

> There do appear to be some vagaries still, it may depend on UCRT and I'm
> not sure I've really understood it all, but it looks like we may, in
> time, be able to consistently use UTF-8 as 8-bit encoding on Windows.

Sorry, no, we can't force users to do it because we don't know if their code 
is safe.

But I think we should:
a) do it for our own applications, since we do know our own code
b) advise users somehow that they should opt-in to this
c) decide if we want to change from opt-in to opt-out in the medium term (7.0 
  for example)
d) decide if we want to enforce it in the long-term

Option (d) depends on (c). Option (c) informs whether we need a Qt CMake API 
or whether we can simply say upstream CMake should handle it.

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Cloud Software Architect - Intel DCAI Cloud Engineering

smime.p7s
Description: S/MIME cryptographic signature
-- 
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development

Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems

2023-03-22 Thread Thiago Macieira

On Wednesday, 22 March 2023 09:48:05 HST Volker Hilsheimer via Development 
wrote:
> Even if one Qt 5 application and one Qt 6 application exchange data over a
> local socket, unwisely using to/fromLocal8Bit for the purpose - if the Qt 5
> application continues to run with the system code page, then the Qt 6
> application starting to sending UTF-8 encoded data will break this.

QLocalSocket is very rare on Windows. And any decent socket code that is 
prepared to work over networks has either used proper 8-bit tagging to 
indicate the encoding (since 2001) or plain UTF-8 (since 2003).

The console is already a mess on Windows because it's not just the ACP for 
Win32 "A" API, but also the legacy DOS encoding (the mess that renders my 
middle name JosÚ or JosΘ). Since that is already a mess, I don't particularly 
find it problematic to see JosÃ© now... wouldn't be the first time. Most 
Windows 
applications aren't console applications so this is a limited issue. It's also 
time-limited: those issues should smooth out easily with proper terminal 
applications, which is how we solved it in the Unix world too.

No, the far more likely scenario is interchange via files and via pipes to 
child processes. So yes, finding out what the legacy ACP is might be a useful 
piece of information. It shouldn't be the toLocal8Bit encoding, but it should 
be available should the need arise.

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Cloud Software Architect - Intel DCAI Cloud Engineering

smime.p7s
Description: S/MIME cryptographic signature
-- 
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development

Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems

2023-03-22 Thread Christian Ehrlicher


Am 22.03.2023 um 20:48 schrieb Volker Hilsheimer:

Indeed, the many hits in the sql code are mostly from warning output, thanks 
for checking.

But that Postgres supports UTF-8 doesn’t mean that an existing server is also 
configured to use it. If a server is configured to work with e.g. ISO_8859_5 
encoding, because all Qt clients (which are likely middleware servers, so fully 
controlled) run on Windows machines with a corresponding code page, then Qt 
deciding to encode in UTF-8 instead will break things, won’t it? And SQL is 
just one example.


No, the client encoding is completely unrelated to the encoding on the
server and the database. All three can differ. Even informix supported
this already 15 years ago iirc. The conversion happens between the
client and server.


Christian

--
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development

Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems

2023-03-22 Thread Volker Hilsheimer via Development

> On 22 Mar 2023, at 18:58, Christian Ehrlicher  wrote:
> 
> Am 22.03.2023 um 17:35 schrieb Volker Hilsheimer via Development:
>>  But we use toLocal8Bit in plenty of cases as well. For instance in our Qt 
>> SQL APIs.
> 
> The only plugin which really uses toLocal8Bit() is the IBase - Plugin.
> Postgres is using it as fallback but according the docs the utf-8
> encoding is supported by at least PostgreSQL 7.3 so the non utf-8 part
> should be removed.
> 
> The other usages are for qWarning() output.
> 
> 
> Will take a look on the IBase stuff to see if we can replace it.

Indeed, the many hits in the sql code are mostly from warning output, thanks 
for checking.

But that Postgres supports UTF-8 doesn’t mean that an existing server is also 
configured to use it. If a server is configured to work with e.g. ISO_8859_5 
encoding, because all Qt clients (which are likely middleware servers, so fully 
controlled) run on Windows machines with a corresponding code page, then Qt 
deciding to encode in UTF-8 instead will break things, won’t it? And SQL is 
just one example.

Even if one Qt 5 application and one Qt 6 application exchange data over a 
local socket, unwisely using to/fromLocal8Bit for the purpose - if the Qt 5 
application continues to run with the system code page, then the Qt 6 
application starting to sending UTF-8 encoded data will break this.

Volker

-- 
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development

Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems

2023-03-22 Thread Christian Ehrlicher



Am 22.03.2023 um 18:58 schrieb Christian Ehrlicher:

Am 22.03.2023 um 17:35 schrieb Volker Hilsheimer via Development:

  But we use toLocal8Bit in plenty of cases as well. For instance in
our Qt SQL APIs.


The only plugin which really uses toLocal8Bit() is the IBase - Plugin.


Correction: it's only used during open() and for the event notification.


Cheerst,

Christian

--
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development

Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems

2023-03-22 Thread Christian Ehrlicher


Am 22.03.2023 um 17:35 schrieb Volker Hilsheimer via Development:

  But we use toLocal8Bit in plenty of cases as well. For instance in our Qt SQL 
APIs.


The only plugin which really uses toLocal8Bit() is the IBase - Plugin.
Postgres is using it as fallback but according the docs the utf-8
encoding is supported by at least PostgreSQL 7.3 so the non utf-8 part
should be removed.

The other usages are for qWarning() output.


Will take a look on the IBase stuff to see if we can replace it.


Cheers,

Christian

--
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development

Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems

2023-03-22 Thread Thiago Macieira

On Wednesday, 22 March 2023 01:07:12 HST Alvin Wong via Development wrote:
> In reality, most of the debug messages are ASCII, so this issue rarely
> affects anything and I consider it just "a mild annoyance".

And also a Not Out Bug issue.

First, the debuggers should opt in to UTF-16 support, if they can. If they 
can't, they should be updated to understand CP_UTF8 manifest executables, if 
they are real debuggers.

That leaves debugview.exe which is not a debugger and therefore doesn't know 
where the messages are coming from. This should reduce the annoyance level.

Question: which category does Qt Creator fall into?

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Cloud Software Architect - Intel DCAI Cloud Engineering

smime.p7s
Description: S/MIME cryptographic signature
-- 
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development

Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems

2023-03-22 Thread Volker Hilsheimer via Development

> On 22 Mar 2023, at 12:07, Alvin Wong via Development 
>  wrote:
> On 22/3/2023 17:58, Lars Knoll wrote:
>> Hi,
>> 
>> 
>>> On 21 Mar 2023, at 17:46, Alvin Wong via Development 
>>>  wrote:
>>> 
>>> Hi,
>>> 
>>> Yes, embedding the manifest with activeCodePage set to UTF-8 is the only 
>>> thing need to enable UTF-8 as the ANSI code page (ACP) for the process.
>>> 
>>> Qt itself should work fine after the bug in QStringConverter had been fixed 
>>> [1] a while back. (You can also refer to the linked mail thread. [2]) 
>>> However, as this bug has shown, any code that uses`MultiByteToWideChar` 
>>> incorrectly or wrongly assumes that `CP_ACP` always refers to a charset in 
>>> which each characters are formed by no more than two bytes will break. 
>>> Therefore, before switching to UTF-8 as the ACP, application developers 
>>> have to check their code and other libraries to make sure everything will 
>>> still work properly after the switch.
>>> 
>>> [1]: https://codereview.qt-project.org/c/qt/qtbase/+/412208
>>> [2]: https://lists.qt-project.org/pipermail/interest/2022-May/038241.html
>>> 
>>> About the CRT, it is true that only UCRT fully supports UTF-8 locale. When 
>>> compiling with MSVC, you are almost always using UCRT so it should be fine.
>>> 
>>> MinGW-w64 is a bit more complicated -- when one gets a mingw-w64 toolchain, 
>>> the whole toolchain is already configured for a specific CRT. Usually it 
>>> will be the system MSVCRT. (If it's configured for UCRT, the toolchain 
>>> author will usually make it clear, because compiled programs will not run 
>>> out-of-the-box on Windows 8.1 or earlier.) I did not run tests myself, but 
>>> I would not trust MSVCRT to support UTF-8 ACP fully. mingw-builds [3] and 
>>> llvm-mingw [4] are some examples of mingw-w64 toolchains that ships UCRT 
>>> versions.
>>> 
>>> [3]: https://github.com/niXman/mingw-builds-binaries/releases
>>> [4]: https://github.com/mstorsjo/llvm-mingw
>>> 
>>> There are two more problems with enabling UTF-8 ACP using the manifest that 
>>> I have encountered so far. When a process is running with UTF-8 ACP, there 
>>> seems to be no API available to get the native system ACP. This can be an 
>>> issue if, for example some external tools write files using the system ACP 
>>> and your program wants to read those files. The other problem (a mild 
>>> annoyance) is that, some debuggers which isn't using updated APIs (gdb for 
>>> example) does not capture `OutputDebugString` messages in the correct 
>>> encoding, which affects QDebug output.
>>> 
>>> 
>> I’ve looked into that one when we did the work for Qt 6. The console has its 
>> own code page that can be set independently from the app, and I believe also 
>> independently from the system code page. qDebug() should be mostly fine, as 
>> we’re using OutputDebugStringW() internally and let Windows handle this mess.
>> 
>> What it does affect is writing to stdout/err and OutputDebugStringA(). 
>> 
> It is unfortunately a bit more messy. OutputDebugString communicates with the 
> debugger via a debug event which contains an address, then the debugger reads 
> the debug message from the memory space of the debuggee process.
> The documentation of OutputDebugStringW [1] states:
> "In the past, the operating system did not return Unicode strings through 
> OutputDebugStringW (ASCII strings were returned instead). To force 
> OutputDebugStringW to return Unicode strings, debuggers are required to call 
> the WaitForDebugEventEx function to opt into the new behavior. In this way, 
> the operating system knows that the debugger supports Unicode and is 
> specifically opting into receiving Unicode strings."
> "OutputDebugStringW converts the specified string based on the current system 
> locale information and passes it to OutputDebugStringA to be displayed. As a 
> result, some Unicode characters may not be displayed correctly."
> What happens with a debugger that does not call `WaitForDebugEventEx` (e.g. 
> gdb) is this: The debuggee calls OutputDebugStringW, which converts the debug 
> string to ACP (UTF-8 in this case) to be passed to OutputDebugStringA. Then 
> the debugger receives the event and tries to read the debug string from the 
> debuggee as ACP, but the debugger thinks ACP is the system ACP (Windows-1252, 
> CP950 or whatever) so it ends up displaying mojibake. The same also happens 
> with Sysinternals DebugView.
> In reality, most of the debug messages are ASCII, so this issue rarely 
> affects anything and I consider it just "a mild annoyance".
> [1]: 
> https://learn.microsoft.com/en-us/windows/win32/api/debugapi/nf-debugapi-outputdebugstringw
>> 
>>> (Console output encoding is separate from the ACP, so one might also need 
>>> to call `SetConsoleOutputCP(CP_UTF8)`, but the detail is a bit fuzzy to me.)
>>> 
>> Setting the code page for console output should help when writing to 
>> stdout/err. It’ll require a bit of testing again (it’s been a while since I 
>> looked into i

Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems

2023-03-22 Thread Alvin Wong via Development

Hi,

I’ve looked into that one when we did the work for Qt 6. The console has its
own code page that can be set independently from the app, and I believe also
independently from the system code page. qDebug() should be mostly fine, as
we’re using OutputDebugStringW() internally and let Windows handle this mess.

What it does affect is writing to stdout/err and OutputDebugStringA().

It is unfortunately a bit more messy. OutputDebugString communicates
with the debugger via a debug event which contains an address, then the
debugger reads the debug message from the memory space of the debuggee
process.

The documentation of OutputDebugStringW [1] states:

"In the past, the operating system did not return Unicode strings
through OutputDebugStringW (ASCII strings were returned instead). To
force OutputDebugStringW to return Unicode strings, debuggers are
required to call the WaitForDebugEventEx function to opt into the
new behavior. In this way, the operating system knows that the
debugger supports Unicode and is specifically opting into receiving
Unicode strings."

"OutputDebugStringW converts the specified string based on the
current system locale information and passes it to
OutputDebugStringA to be displayed. As a result, some Unicode
characters may not be displayed correctly."

What happens with a debugger that does not call `WaitForDebugEventEx`
(e.g. gdb) is this: The debuggee calls OutputDebugStringW, which
converts the debug string to ACP (UTF-8 in this case) to be passed to
OutputDebugStringA. Then the debugger receives the event and tries to
read the debug string from the debuggee as ACP, but the debugger thinks
ACP is the system ACP (Windows-1252, CP950 or whatever) so it ends up
displaying mojibake. The same also happens with Sysinternals DebugView.

In reality, most of the debug messages are ASCII, so this issue rarely
affects anything and I consider it just "a mild annoyance".

[1]:
https://learn.microsoft.com/en-us/windows/win32/api/debugapi/nf-debugapi-outputdebugstringw

Cheers,
Alvin

On 22/3/2023 17:58, Lars Knoll wrote:

Hi,

On 21 Mar 2023, at 17:46, Alvin Wong via
Development wrote:

Hi,

Yes, embedding the manifest with activeCodePage set to UTF-8 is the only thing
need to enable UTF-8 as the ANSI code page (ACP) for the process.

Qt itself should work fine after the bug in QStringConverter had been fixed [1]
a while back. (You can also refer to the linked mail thread. [2]) However, as
this bug has shown, any code that uses`MultiByteToWideChar` incorrectly or
wrongly assumes that `CP_ACP` always refers to a charset in which each
characters are formed by no more than two bytes will break. Therefore, before
switching to UTF-8 as the ACP, application developers have to check their code
and other libraries to make sure everything will still work properly after the
switch.

[1]:https://codereview.qt-project.org/c/qt/qtbase/+/412208
[2]:https://lists.qt-project.org/pipermail/interest/2022-May/038241.html

About the CRT, it is true that only UCRT fully supports UTF-8 locale. When
compiling with MSVC, you are almost always using UCRT so it should be fine.

MinGW-w64 is a bit more complicated -- when one gets a mingw-w64 toolchain, the
whole toolchain is already configured for a specific CRT. Usually it will be
the system MSVCRT. (If it's configured for UCRT, the toolchain author will
usually make it clear, because compiled programs will not run out-of-the-box on
Windows 8.1 or earlier.) I did not run tests myself, but I would not trust
MSVCRT to support UTF-8 ACP fully. mingw-builds [3] and llvm-mingw [4] are some
examples of mingw-w64 toolchains that ships UCRT versions.

[3]:https://github.com/niXman/mingw-builds-binaries/releases
[4]:https://github.com/mstorsjo/llvm-mingw

There are two more problems with enabling UTF-8 ACP using the manifest that I
have encountered so far. When a process is running with UTF-8 ACP, there seems
to be no API available to get the native system ACP. This can be an issue if,
for example some external tools write files using the system ACP and your
program wants to read those files. The other problem (a mild annoyance) is
that, some debuggers which isn't using updated APIs (gdb for example) does not
capture `OutputDebugString` messages in the correct encoding, which affects
QDebug output.

What it does affect is writing to stdout/err and OutputDebugStringA().

(Console output encoding is separate from the ACP, so one might also need to
call `SetConsoleOutputCP(CP_UTF8)`, but the detail is a bit fuzzy to me.)

Setting the code page for console output should help when writing to

Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems

2023-03-22 Thread Lars Knoll

Hi,

> On 21 Mar 2023, at 17:46, Alvin Wong via Development 
>  wrote:
> 
> Hi,
> 
> Yes, embedding the manifest with activeCodePage set to UTF-8 is the only 
> thing need to enable UTF-8 as the ANSI code page (ACP) for the process.
> 
> Qt itself should work fine after the bug in QStringConverter had been fixed 
> [1] a while back. (You can also refer to the linked mail thread. [2]) 
> However, as this bug has shown, any code that uses`MultiByteToWideChar` 
> incorrectly or wrongly assumes that `CP_ACP` always refers to a charset in 
> which each characters are formed by no more than two bytes will break. 
> Therefore, before switching to UTF-8 as the ACP, application developers have 
> to check their code and other libraries to make sure everything will still 
> work properly after the switch.
> 
> [1]: https://codereview.qt-project.org/c/qt/qtbase/+/412208
> [2]: https://lists.qt-project.org/pipermail/interest/2022-May/038241.html
> 
> About the CRT, it is true that only UCRT fully supports UTF-8 locale. When 
> compiling with MSVC, you are almost always using UCRT so it should be fine.
> 
> MinGW-w64 is a bit more complicated -- when one gets a mingw-w64 toolchain, 
> the whole toolchain is already configured for a specific CRT. Usually it will 
> be the system MSVCRT. (If it's configured for UCRT, the toolchain author will 
> usually make it clear, because compiled programs will not run out-of-the-box 
> on Windows 8.1 or earlier.) I did not run tests myself, but I would not trust 
> MSVCRT to support UTF-8 ACP fully. mingw-builds [3] and llvm-mingw [4] are 
> some examples of mingw-w64 toolchains that ships UCRT versions.
> 
> [3]: https://github.com/niXman/mingw-builds-binaries/releases
> [4]: https://github.com/mstorsjo/llvm-mingw
> 
> There are two more problems with enabling UTF-8 ACP using the manifest that I 
> have encountered so far. When a process is running with UTF-8 ACP, there 
> seems to be no API available to get the native system ACP. This can be an 
> issue if, for example some external tools write files using the system ACP 
> and your program wants to read those files. The other problem (a mild 
> annoyance) is that, some debuggers which isn't using updated APIs (gdb for 
> example) does not capture `OutputDebugString` messages in the correct 
> encoding, which affects QDebug output.
> 
I’ve looked into that one when we did the work for Qt 6. The console has its 
own code page that can be set independently from the app, and I believe also 
independently from the system code page. qDebug() should be mostly fine, as 
we’re using OutputDebugStringW() internally and let Windows handle this mess.

What it does affect is writing to stdout/err and OutputDebugStringA(). 

> (Console output encoding is separate from the ACP, so one might also need to 
> call `SetConsoleOutputCP(CP_UTF8)`, but the detail is a bit fuzzy to me.)

Setting the code page for console output should help when writing to 
stdout/err. It’ll require a bit of testing again (it’s been a while since I 
looked into it), but I believe console was mostly handling this fine 
independent of the codepage being used by it internally (ie. Windows would 
recode the string).

Cheers,
Lars

> 
> Cheers,
> Alvin
> 
> 
> On 20/3/2023 21:44, Edward Welbourne wrote:
>> Thiago Macieira (31 October 2019 22:11) wrote [0]:
>>> This RFC (...) is meant to discuss how we'll deal with locales on Unix
>>> systems on Qt 6. This does not apply to Windows because on Windows we
>>> cannot reasonably be expected to use UTF-8 for the 8-bit encoding.
>> [0] 
>> https://lists.qt-project.org/pipermail/development/2019-October/037791.html
>> 
>> The GNU make mailing list currently has a thread (starts at [1]) about
>> handling of encodings on Windows.
>> 
>> [1] https://lists.gnu.org/archive/html/bug-make/2023-03/msg00066.html
>> 
>> The discussion there seems to indicate that setting the system code-page
>> to UTF-8 can be done in a way that interoperates gracefully with other
>> processes and the file system, presumably thanks to the system being
>> substantially UTF-16-based, so all 8-bit encodings go via that anyway.
>> 
>> The means to achieve this appear [2] to hinge on setting the active
>> codepage for the application in a manifest file, that it gets combined
>> with after it is linked.
>> 
>> [2] 
>> https://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page
>> 
>> There do appear to be some vagaries still, it may depend on UCRT and I'm
>> not sure I've really understood it all, but it looks like we may, in
>> time, be able to consistently use UTF-8 as 8-bit encoding on Windows.
>> 
>>  Eddy.
>> 
> -- 
> Development mailing list
> Development@qt-project.org
> https://lists.qt-project.org/listinfo/development

-- 
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development

Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems

2023-03-21 Thread Alvin Wong via Development


Hi,

Yes, embedding the manifest with activeCodePage set to UTF-8 is the only 
thing need to enable UTF-8 as the ANSI code page (ACP) for the process.


Qt itself should work fine after the bug in QStringConverter had been 
fixed [1] a while back. (You can also refer to the linked mail thread. 
[2]) However, as this bug has shown, any code that 
uses`MultiByteToWideChar` incorrectly or wrongly assumes that `CP_ACP` 
always refers to a charset in which each characters are formed by no 
more than two bytes will break. Therefore, before switching to UTF-8 as 
the ACP, application developers have to check their code and other 
libraries to make sure everything will still work properly after the switch.


[1]: https://codereview.qt-project.org/c/qt/qtbase/+/412208
[2]: https://lists.qt-project.org/pipermail/interest/2022-May/038241.html

About the CRT, it is true that only UCRT fully supports UTF-8 locale. 
When compiling with MSVC, you are almost always using UCRT so it should 
be fine.


MinGW-w64 is a bit more complicated -- when one gets a mingw-w64 
toolchain, the whole toolchain is already configured for a specific CRT. 
Usually it will be the system MSVCRT. (If it's configured for UCRT, the 
toolchain author will usually make it clear, because compiled programs 
will not run out-of-the-box on Windows 8.1 or earlier.) I did not run 
tests myself, but I would not trust MSVCRT to support UTF-8 ACP fully. 
mingw-builds [3] and llvm-mingw [4] are some examples of mingw-w64 
toolchains that ships UCRT versions.


[3]: https://github.com/niXman/mingw-builds-binaries/releases
[4]: https://github.com/mstorsjo/llvm-mingw

There are two more problems with enabling UTF-8 ACP using the manifest 
that I have encountered so far. When a process is running with UTF-8 
ACP, there seems to be no API available to get the native system ACP. 
This can be an issue if, for example some external tools write files 
using the system ACP and your program wants to read those files. The 
other problem (a mild annoyance) is that, some debuggers which isn't 
using updated APIs (gdb for example) does not capture 
`OutputDebugString` messages in the correct encoding, which affects 
QDebug output.


(Console output encoding is separate from the ACP, so one might also 
need to call `SetConsoleOutputCP(CP_UTF8)`, but the detail is a bit 
fuzzy to me.)


Cheers,
Alvin


On 20/3/2023 21:44, Edward Welbourne wrote:

Thiago Macieira (31 October 2019 22:11) wrote [0]:

This RFC (...) is meant to discuss how we'll deal with locales on Unix
systems on Qt 6. This does not apply to Windows because on Windows we
cannot reasonably be expected to use UTF-8 for the 8-bit encoding.

[0] https://lists.qt-project.org/pipermail/development/2019-October/037791.html

The GNU make mailing list currently has a thread (starts at [1]) about
handling of encodings on Windows.

[1] https://lists.gnu.org/archive/html/bug-make/2023-03/msg00066.html

The discussion there seems to indicate that setting the system code-page
to UTF-8 can be done in a way that interoperates gracefully with other
processes and the file system, presumably thanks to the system being
substantially UTF-16-based, so all 8-bit encodings go via that anyway.

The means to achieve this appear [2] to hinge on setting the active
codepage for the application in a manifest file, that it gets combined
with after it is linked.

[2] 
https://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page

There do appear to be some vagaries still, it may depend on UCRT and I'm
not sure I've really understood it all, but it looks like we may, in
time, be able to consistently use UTF-8 as 8-bit encoding on Windows.

Eddy.


--
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development

Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems

2023-03-21 Thread Ilya Fedin

On Mon, 20 Mar 2023 13:44:30 +
Edward Welbourne via Development  wrote:

> The means to achieve this appear [2] to hinge on setting the active
> codepage for the application in a manifest file, that it gets combined
> with after it is linked.
> 
> [2]
> https://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page

setlocale has support to set UTF-8 locale as well:
https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/setlocale-wsetlocale?source=recommendations&view=msvc-170#utf-8-support
-- 
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development

Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems

2023-03-20 Thread Thiago Macieira

On Monday, 20 March 2023 03:44:30 HST Edward Welbourne via Development wrote:
> There do appear to be some vagaries still, it may depend on UCRT and I'm
> not sure I've really understood it all, but it looks like we may, in
> time, be able to consistently use UTF-8 as 8-bit encoding on Windows.

That is indeed the long-term objective, both ours and Microsoft's.  The 
question is only when we will be ready.

Do we need to do something to our DLLs? Can we start suggesting the manifest 
flag for user applications with our CMake support (like windeployqt)? And can 
we do it now for our own applications?

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Cloud Software Architect - Intel DCAI Cloud Engineering

smime.p7s
Description: S/MIME cryptographic signature
-- 
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development

Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems

2023-03-20 Thread Edward Welbourne via Development

Thiago Macieira (31 October 2019 22:11) wrote [0]:
> This RFC (...) is meant to discuss how we'll deal with locales on Unix
> systems on Qt 6. This does not apply to Windows because on Windows we
> cannot reasonably be expected to use UTF-8 for the 8-bit encoding.

[0] https://lists.qt-project.org/pipermail/development/2019-October/037791.html

The GNU make mailing list currently has a thread (starts at [1]) about
handling of encodings on Windows.

[1] https://lists.gnu.org/archive/html/bug-make/2023-03/msg00066.html

The discussion there seems to indicate that setting the system code-page
to UTF-8 can be done in a way that interoperates gracefully with other
processes and the file system, presumably thanks to the system being
substantially UTF-16-based, so all 8-bit encodings go via that anyway.

The means to achieve this appear [2] to hinge on setting the active
codepage for the application in a manifest file, that it gets combined
with after it is linked.

[2] 
https://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page

There do appear to be some vagaries still, it may depend on UCRT and I'm
not sure I've really understood it all, but it looks like we may, in
time, be able to consistently use UTF-8 as 8-bit encoding on Windows.

Eddy.
-- 
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development

Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems

2020-04-28 Thread Simon Hausmann

Hi,

AFAICS (from the public coin logs that dump the entire set of environment 
variables), nothing specifically sets any of the locale environment variables. 
In fact, no variables are set. You can check for yourself for example from the 
macOS log from one of the recent qtbase dev integrations:

https://testresults.qt.io/coin/integration/qt/qtbase/tasks/1588073951

The regular Apple Terminal appears to be the entity that sets LC_CTYPE before 
launching the shell - however the CI system is not using the Apple Terminal.

I see two options:

(1) Either assume that macOS is UTF-8.

(2) We add a script to the provisioning of macOS to always set the LC_CTYPE 
environment variable to have the value UTF-8 (or any other environment variable 
that you'd like).

Can you think of any other ways to resolve this?

Simon

From: Development  on behalf of Thiago 
Macieira 
Sent: Tuesday, April 28, 2020 17:42
To: development@qt-project.org 
Subject: Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on 
Unix systems

On Tuesday, 28 April 2020 07:20:33 PDT Thiago Macieira wrote:
> On Monday, 27 April 2020 13:54:13 PDT Simon Hausmann wrote:
> > I looked at the patch again and searched a bit around. I think nl_langinfo
> > is “broken” on macOS but it doesn’t matter: everything seems to be utf-8,
> > all system APIs expect it. I think the CI is well configured and the patch
> > should treat Darwin like Android
>
> nl_langinfo is not broken on Mac. I tested it on 10.14 and 10.15 and it
> works just fine. More importantly, setlocale() obeys the LC_ALL behaviour
> to change the locale of the POSIX functions just fine.
>
> What I need is that the CI set LANG or LC_ALL to "UTF-8". Somehow, the CI
> either unset that or was run from an environment that didn t have it set in
> he first place.

Another possibility is that some script overrode LC_ALL to "C" so as to get
non-localised output. Please fix it to override to "C.UTF-8" or something that
works on a Mac.

--
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel System Software Products

___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development
___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development

Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems

2020-04-28 Thread Thiago Macieira

On Tuesday, 28 April 2020 08:42:17 PDT Thiago Macieira wrote:
> Another possibility is that some script overrode LC_ALL to "C" so as to get
> non-localised output. Please fix it to override to "C.UTF-8" or something
> that works on a Mac.

Found it: it's the test itself.

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel System Software Products



___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development

Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems

2020-04-28 Thread Thiago Macieira

On Tuesday, 28 April 2020 07:20:33 PDT Thiago Macieira wrote:
> On Monday, 27 April 2020 13:54:13 PDT Simon Hausmann wrote:
> > I looked at the patch again and searched a bit around. I think nl_langinfo
> > is “broken” on macOS but it doesn’t matter: everything seems to be utf-8,
> > all system APIs expect it. I think the CI is well configured and the patch
> > should treat Darwin like Android
> 
> nl_langinfo is not broken on Mac. I tested it on 10.14 and 10.15 and it
> works just fine. More importantly, setlocale() obeys the LC_ALL behaviour
> to change the locale of the POSIX functions just fine.
> 
> What I need is that the CI set LANG or LC_ALL to "UTF-8". Somehow, the CI
> either unset that or was run from an environment that didn t have it set in
> he first place.

Another possibility is that some script overrode LC_ALL to "C" so as to get 
non-localised output. Please fix it to override to "C.UTF-8" or something that 
works on a Mac.

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel System Software Products



___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development

Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems

2020-04-28 Thread Thiago Macieira

On Monday, 27 April 2020 13:54:13 PDT Simon Hausmann wrote:
> I looked at the patch again and searched a bit around. I think nl_langinfo
> is “broken” on macOS but it doesn’t matter: everything seems to be utf-8,
> all system APIs expect it. I think the CI is well configured and the patch
> should treat Darwin like Android 

nl_langinfo is not broken on Mac. I tested it on 10.14 and 10.15 and it works 
just fine. More importantly, setlocale() obeys the LC_ALL behaviour to change 
the locale of the POSIX functions just fine.

What I need is that the CI set LANG or LC_ALL to "UTF-8". Somehow, the CI 
either unset that or was run from an environment that didn t have it set in he 
first place.

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel System Software Products

___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development

Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems

2020-04-27 Thread Simon Hausmann

Hi,

I looked at the patch again and searched a bit around. I think nl_langinfo is 
“broken” on macOS but it doesn’t matter: everything seems to be utf-8, all 
system APIs expect it. I think the CI is well configured and the patch should 
treat Darwin like Android :-)

Simon

Am 27.04.2020 um 19:09 schrieb Simon Hausmann :

Hi,

I can't really think of anything that's changed in the default macOS setup that 
would affect the locale encoding.

The scripts that are run are here:

https://code.qt.io/cgit/qt/qt5.git/tree/coin/provisioning/qtci-macos-10.14-x86_64

but I'm not even sure that it's possible to "misconfigure" a macOS installation 
to not use utf-8.

Can you think of any setting to check? Or do you have a little test program to 
run to verify?

Simon

From: Development  on behalf of Thiago 
Macieira 
Sent: Monday, April 27, 2020 18:13
To: development@qt-project.org 
Subject: Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on 
Unix systems

On Sunday, 26 April 2020 09:22:00 PDT Thiago Macieira wrote:
> On Thursday, 31 October 2019 14:11:05 PDT Thiago Macieira wrote:
> > Re: https://codereview.qt-project.org/c/qt/qtbase/+/275152 (WIP: Move
> > QTextCodec support out of QtCore)
> > See also: https://www.python.org/dev/peps/pep-0538/
> >
> > https://www.python.org/dev/peps/pep-0540/
>
> Just sending to the mailing list to get more attention:
>
> The change above cannot integrate because the new warning breaks the QtTest
> self-tests because the environment where the tests are run is not UTF-8. Can
> the CI be fixed, please?

Apologies, I replied thinking the link above was to my change, but that was
Rainer's that has since been superseded by Lars's. The change I want to
integrate is:

https://codereview.qt-project.org/c/qt/qtbase/+/282359

The error from the CI is:

 FAIL!  : tst_Selftests::runSubTest(assert lightxml + stdout junitxml)
'err.isEmpty()' returned FALSE. (Detected system locale encoding (US-ASCII,
locale "C") is not UTF-8.
Qt shall use a UTF-8 locale ("UTF-8") instead. If this causes problems,
reconfigure your locale. See the locale(1) manual for more information.
)

Note this warning is on a Mac, which is an UTF-8 system. Can the CI please set
up the environment properly?

--
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel System Software Products

___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development
___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development

Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems

2020-04-27 Thread Simon Hausmann

Hi,

I can't really think of anything that's changed in the default macOS setup that 
would affect the locale encoding.

The scripts that are run are here:

https://code.qt.io/cgit/qt/qt5.git/tree/coin/provisioning/qtci-macos-10.14-x86_64

but I'm not even sure that it's possible to "misconfigure" a macOS installation 
to not use utf-8.

Can you think of any setting to check? Or do you have a little test program to 
run to verify?

Simon

From: Development  on behalf of Thiago 
Macieira 
Sent: Monday, April 27, 2020 18:13
To: development@qt-project.org 
Subject: Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on 
Unix systems

On Sunday, 26 April 2020 09:22:00 PDT Thiago Macieira wrote:
> On Thursday, 31 October 2019 14:11:05 PDT Thiago Macieira wrote:
> > Re: https://codereview.qt-project.org/c/qt/qtbase/+/275152 (WIP: Move
> > QTextCodec support out of QtCore)
> > See also: https://www.python.org/dev/peps/pep-0538/
> >
> > https://www.python.org/dev/peps/pep-0540/
>
> Just sending to the mailing list to get more attention:
>
> The change above cannot integrate because the new warning breaks the QtTest
> self-tests because the environment where the tests are run is not UTF-8. Can
> the CI be fixed, please?

Apologies, I replied thinking the link above was to my change, but that was
Rainer's that has since been superseded by Lars's. The change I want to
integrate is:

https://codereview.qt-project.org/c/qt/qtbase/+/282359

The error from the CI is:

 FAIL!  : tst_Selftests::runSubTest(assert lightxml + stdout junitxml)
'err.isEmpty()' returned FALSE. (Detected system locale encoding (US-ASCII,
locale "C") is not UTF-8.
Qt shall use a UTF-8 locale ("UTF-8") instead. If this causes problems,
reconfigure your locale. See the locale(1) manual for more information.
)

Note this warning is on a Mac, which is an UTF-8 system. Can the CI please set
up the environment properly?

--
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel System Software Products

___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development
___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development

Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems

2020-04-27 Thread Thiago Macieira

On Sunday, 26 April 2020 09:22:00 PDT Thiago Macieira wrote:
> On Thursday, 31 October 2019 14:11:05 PDT Thiago Macieira wrote:
> > Re: https://codereview.qt-project.org/c/qt/qtbase/+/275152 (WIP: Move
> > QTextCodec support out of QtCore)
> > See also: https://www.python.org/dev/peps/pep-0538/
> > 
> > https://www.python.org/dev/peps/pep-0540/
> 
> Just sending to the mailing list to get more attention:
> 
> The change above cannot integrate because the new warning breaks the QtTest
> self-tests because the environment where the tests are run is not UTF-8. Can
> the CI be fixed, please?

Apologies, I replied thinking the link above was to my change, but that was 
Rainer's that has since been superseded by Lars's. The change I want to 
integrate is:

https://codereview.qt-project.org/c/qt/qtbase/+/282359

The error from the CI is:

 FAIL!  : tst_Selftests::runSubTest(assert lightxml + stdout junitxml) 
'err.isEmpty()' returned FALSE. (Detected system locale encoding (US-ASCII, 
locale "C") is not UTF-8.
Qt shall use a UTF-8 locale ("UTF-8") instead. If this causes problems,
reconfigure your locale. See the locale(1) manual for more information.
)

Note this warning is on a Mac, which is an UTF-8 system. Can the CI please set 
up the environment properly?

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel System Software Products

___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development

Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems

2020-04-26 Thread Thiago Macieira

On Thursday, 31 October 2019 14:11:05 PDT Thiago Macieira wrote:
> Re: https://codereview.qt-project.org/c/qt/qtbase/+/275152 (WIP: Move
> QTextCodec support out of QtCore)
> See also: https://www.python.org/dev/peps/pep-0538/
> https://www.python.org/dev/peps/pep-0540/

Just sending to the mailing list to get more attention:

The change above cannot integrate because the new warning breaks the QtTest 
self-tests because the environment where the tests are run is not UTF-8. Can 
the CI be fixed, please?

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel System Software Products

___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development

Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems

2019-11-20 Thread Thiago Macieira

On Sunday, 17 November 2019 01:55:32 CET Thiago Macieira wrote:
> I don't know why QTextCodec is being removed. I don't remember any decisions
> in prior QtCS or this mailing list about removing it. We definitely
> discussed removing the CJK codecs and their big tables and that can still
> be done, with no effect in the API, since QTextCodec is backed by ICU's
> ucnv. We may have discussed removing it, but I don't remember a firm
> decision. And even if it is firm, after looking at the consequences of
> doing so, we may want to reverse our decision.

Update: after talking to Lars during QtCS, he said that he thinks the 
QTextCodec API is poorly designed and should be replaced. I agree. But that 
doesn't mean we need to remove the *functionality*, just deprecate the API.

I'll bring this up during the QtCore session tomorrow to see if we want to 
invest time creating a new API, hopefully for 5.15, so code can begin porting 
before the 6.0 time. That way, we could move QTextCodec out of QtCore.

> 1) QTextCodec in the API
> I think we cannot do without it, it'll have to stay in one way or another.
> So the question reduces to whether it should stay in QtCore or be moved to
> another library. Given the QXmlStreamReader problem above, it's probably
> best to keep it in QtCore, actually.
> 
> QTextCodec has some API limitations but they can be fixed. It's not
> necessary for us to remove it: it's not *that* broken.

This is now TBD, depending on finding a good design and whether it can be done 
incrementally in QTextCodec.

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel System Software Products

___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development

Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems

2019-11-18 Thread Rainer Keller

> I see no reason why we can't keep the QTextCodec _interface_ in Qt Core,
> together with some interface to register new codecs, provide UTF-* directly,
> and let the "fancy" ones live on in a seperate module, plugging them in
> at runtime.

My opinion is the same. Keep QTextCodec in QtCore with only UTF 
encodings. All others, like ICU and the conversion tables, move to a 
module and are only enabled when the users choose to do so.

Rainer
___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development

Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems

2019-11-18 Thread Thiago Macieira

On Tuesday, 19 November 2019 00:23:52 CET Thiago Macieira wrote:
> I wasn't referring to QTextCodec.
> 
> I was referring to these files:

Sorry, race condition. The Ctrl for Ctrl+F1 was pressed too early and matched 
the Enter for the next line causing Ctrl+Enter (Send).

$ ls -1 src/corelib/codecs/*~*qtextcodec*~*icu*~*utf*~*windows*
src/corelib/codecs/codecs.pri
src/corelib/codecs/codecs.qdoc
src/corelib/codecs/cp949codetbl_p.h
src/corelib/codecs/qbig5codec.cpp
src/corelib/codecs/QBIG5CODEC_LICENSE.txt
src/corelib/codecs/qbig5codec_p.h
src/corelib/codecs/QBKCODEC_LICENSE.txt
src/corelib/codecs/qeucjpcodec.cpp
src/corelib/codecs/QEUCJPCODEC_LICENSE.txt
src/corelib/codecs/qeucjpcodec_p.h
src/corelib/codecs/qeuckrcodec.cpp
src/corelib/codecs/QEUCKRCODEC_LICENSE.txt
src/corelib/codecs/qeuckrcodec_p.h
src/corelib/codecs/qgb18030codec.cpp
src/corelib/codecs/qgb18030codec_p.h
src/corelib/codecs/qiconvcodec.cpp
src/corelib/codecs/qiconvcodec_p.h
src/corelib/codecs/qisciicodec.cpp
src/corelib/codecs/qisciicodec_p.h
src/corelib/codecs/qjiscodec.cpp
src/corelib/codecs/QJISCODEC_LICENSE.txt
src/corelib/codecs/qjiscodec_p.h
src/corelib/codecs/qjpunicode.cpp
src/corelib/codecs/qjpunicode_p.h
src/corelib/codecs/qlatincodec.cpp
src/corelib/codecs/qlatincodec_p.h
src/corelib/codecs/qsimplecodec.cpp
src/corelib/codecs/qsimplecodec_p.h
src/corelib/codecs/qsjiscodec.cpp
src/corelib/codecs/QSJISCODEC_LICENSE.txt
src/corelib/codecs/qsjiscodec_p.h
src/corelib/codecs/qt_attribution.json
src/corelib/codecs/qtsciicodec.cpp
src/corelib/codecs/QTSCIICODEC_LICENSE.txt
src/corelib/codecs/qtsciicodec_p.h


-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel System Software Products



___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development

Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems

2019-11-18 Thread Thiago Macieira

On Monday, 18 November 2019 19:48:24 CET André Pönitz wrote:
> > But we should not keep the our codecs (aside from the UTF ones) because of
> > that.
> 
> Why not?
> 
> I see no reason why we can't keep the QTextCodec _interface_ in Qt Core,
> together with some interface to register new codecs, provide UTF-* directly,
> and let the "fancy" ones live on in a seperate module, plugging them in at
> runtime.

I wasn't referring to QTextCodec.

I was referring to these files:

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel System Software Products



___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development

Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems

2019-11-18 Thread André Pönitz

On Mon, Nov 18, 2019 at 07:09:30PM +0100, Thiago Macieira wrote:
> On Monday, 18 November 2019 17:05:29 CET Lars Knoll wrote:
> > > On 18 Nov 2019, at 17:00, Kevin Kofler  wrote:
> > > 
> > > Thiago Macieira wrote:
> > > 
> > >> The codecs we want to remove are just big tables of mapping old, legacy
> > >> codecs to UTF-16. We can easily remove those.
> > >> 
> > >> After that, removal of QTextCodec itself is not a big gain.
> > > 
> > > 
> > > So let me ask once again: Is ICU not already a hard requirement for Qt on
> > > 
> > > *nix systems? So why can we not just rely on ICU's tables?
> > 
> > 
> > No, it’s not a hard requirement. And especially for low end embedded
> > systems, we also want to keep it that way.
> 
> But we should not keep the our codecs (aside from the UTF ones) because of 
> that.

Why not?

I see no reason why we can't keep the QTextCodec _interface_ in Qt Core,
together with some interface to register new codecs, provide UTF-* directly,
and let the "fancy" ones live on in a seperate module, plugging them in
at runtime.

Andre'
___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development

Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems

2019-11-18 Thread Thiago Macieira

On Monday, 18 November 2019 17:05:29 CET Lars Knoll wrote:
> > On 18 Nov 2019, at 17:00, Kevin Kofler  wrote:
> > 
> > Thiago Macieira wrote:
> > 
> >> The codecs we want to remove are just big tables of mapping old, legacy
> >> codecs to UTF-16. We can easily remove those.
> >> 
> >> After that, removal of QTextCodec itself is not a big gain.
> > 
> > 
> > So let me ask once again: Is ICU not already a hard requirement for Qt on
> > 
> > *nix systems? So why can we not just rely on ICU's tables?
> 
> 
> No, it’s not a hard requirement. And especially for low end embedded
> systems, we also want to keep it that way.

But we should not keep the our codecs (aside from the UTF ones) because of 
that.

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel System Software Products



___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development

Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems

2019-11-18 Thread Lars Knoll

> On 18 Nov 2019, at 17:00, Kevin Kofler  wrote:
> 
> Thiago Macieira wrote:
>> The codecs we want to remove are just big tables of mapping old, legacy
>> codecs to UTF-16. We can easily remove those.
>> 
>> After that, removal of QTextCodec itself is not a big gain.
> 
> So let me ask once again: Is ICU not already a hard requirement for Qt on 
> *nix systems? So why can we not just rely on ICU's tables?

No, it’s not a hard requirement. And especially for low end embedded systems, 
we also want to keep it that way.

Cheers,
Lars

___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development

Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems

2019-11-18 Thread Kevin Kofler

Thiago Macieira wrote:
> The codecs we want to remove are just big tables of mapping old, legacy
> codecs to UTF-16. We can easily remove those.
> 
> After that, removal of QTextCodec itself is not a big gain.

So let me ask once again: Is ICU not already a hard requirement for Qt on 
*nix systems? So why can we not just rely on ICU's tables?

Kevin Kofler

___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development

Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems

2019-11-18 Thread Thiago Macieira

On Monday, 18 November 2019 00:12:19 CET Giuseppe D'Angelo via Development 
wrote:
> I don't know either. Is it to make QtCore smaller? Wasn't the feature
> system ("Qt Lite") supposed to address that? Or is it to make it less of
> a "kitchen sink", and split it in smaller libraries? Could that mean
> having QTextCodec in its own library, and QXmlStreamReader in another
> (that depends on the former)?

The codecs we want to remove are just big tables of mapping old, legacy codecs 
to UTF-16. We can easily remove those.

After that, removal of QTextCodec itself is not a big gain.

> > Related to that is the discussion of whether UTF-8 is the only acceptable
> > locale on Unix systems. If we don't have QTextCodec, then we have to have
> > something fixed for QString::fromLocal8Bit and it would necessarily be
> > UTF-8. But even if we do have QTextCodec, that's still a reasonable
> > question: should assume it is UTF-8? And should we enforce it? Those were
> > the questions in my OP.
> 
> Should fromLocal8Bit be following the locale environment instead
> (LC_CTYPE, LC_MESSAGES or similar)?

That's what it does today. The question is whether we can assume those imply 
UTF-8, like we do when QT_LOCALE_IS_UTF8 is defined.

> > If QTextCodec is not in QtCore, then most likely you can't affect how
> > QtCore and almost all other Qt classes decode 8-bit data into QString,
> > including QTextStream.
> 
> See above -- it also means QTextStream goes in some I/O lib that
> contains or depends on the codecs lib.

Or we remove the ability in QTextStream to specify the codec, which is what 
the proposed change would do. I don't think we can move QTextStream out of 
QtCore.

> Why do we bother about "saving the world"? A misconfigured system is the
> user's mistake. They should be in charge of fixing it in order to
> address the problem.

That is an option and this is what the qFatal I mentioned would do.

> > For #2, the sub-questions of the OP apply:
> >   a) What should Qt 6 assume the locale to be, if no locale is set?
> >   b) In case a non-UTF-8 locale is set, what should we do?
> >   c) Should we propagate our decision to child processes?
> > 
> > My preferences were:
> >   a) C.UTF-8
> >   b) override it to force UTF-8 on the same locale
> >   c) yes
> 
> How about
> 
> a) either C / C.UTF-8, but warning the user; but I'd up the ante, and
> say: just assert/crash.
> 
> b) keep the choice. Silently changing it sounds like a bad idea; we
> should never override the user choices silently.

That means keeping QTextCodec and the ability to work with an arbitrary codec.

> c) no. We shouldn't "fix" subprocesses. They have the right to make
> their own independent decisions.

This is not about fixing the subprocess, but about ensuring that it can talk 
to the current process. And it's only necessary if in (b) we override, 
selecting UTF-8. If we don't override or if we forbid running with a non-UTF-8 
locale, then we don't need to set the environment.

> Or, on the other hand: what is the chance that a system comes without a
> locale set? What is more likely to conclude, that it's an accident or a
> deliberate setting? If it's an accident, why not being *very* verbose
> about it?

It's extremely unlikely that a Qt application, especially a Qt 6 one, will be 
run with no locale set. So if the locale isn't set to UTF-8, then it's 
explicit. The question is whether it was *intentional* to change the codec.

As I've argued time and again, changing the locale to English is standard 
practice in any tool parsing another tool's output. But did they mean to 
change the codec too?

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel System Software Products

___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development

Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems

2019-11-18 Thread Eike Ziller



> On 18. Nov 2019, at 00:12, Giuseppe D'Angelo via Development 
>  wrote:
> 
> Il 17/11/19 01:55, Thiago Macieira ha scritto:
>> Hi
>> Sorry, it looks like this thread is not progressing in a calm and reasoned
>> manner, the way it was meant to be. And I'm very much to blame. So I 
>> apologise
>> for the strong language and passionate opinions. I'm deleting most of what I
>> had written as a reply so we can start over.
>> Let's start with your questions:
>> On Saturday, 16 November 2019 10:50:13 PST André Pönitz wrote:
>>> You have not yet answered
>>> 
>>>   - why this decision was made
>> You know, I don't know. To be frank, I don't know that a decision *was* made.
>> It all started with a change (see OP) about removing QTextCodec from the API
>> and from QtCore. It seemed reasonable enough but it turned up quite a few
>> kinks that hadn't been predicted. One of them, which may still be a
>> showstopper, is QXmlStreamReader's inability to handle XML data encoded in
>> anything except UTF-8, though a thorough search of all XML files in my system
>> turned up exactly zero such files.
>> I don't know why QTextCodec is being removed. I don't remember any decisions
>> in prior QtCS or this mailing list about removing it. We definitely discussed
>> removing the CJK codecs and their big tables and that can still be done, with
>> no effect in the API, since QTextCodec is backed by ICU's ucnv. We may have
>> discussed removing it, but I don't remember a firm decision. And even if it 
>> is
>> firm, after looking at the consequences of doing so, we may want to reverse
>> our decision.
> 
> I don't know either. Is it to make QtCore smaller? Wasn't the feature system 
> ("Qt Lite") supposed to address that? Or is it to make it less of a "kitchen 
> sink", and split it in smaller libraries? Could that mean having QTextCodec 
> in its own library, and QXmlStreamReader in another (that depends on the 
> former)?

In QtCore it seems to be used by the MIME database support, and in a 
serialization backend.
So, one would need to think about what to do with these at least.
Then, looking at qtbase, it’s also used for DBUS and in androiddeployqt / 
-testrunner (for e.g. the manifest file), and RCC of course.

>> Why does Qt Creator need other codecs?

Qt Creator is a generic text editor. A generic text editor is expected to be 
able to read and write files in different encodings.

-- 
Eike Ziller
Principal Software Engineer

The Qt Company GmbH
Erich-Thilo-Straße 10
D-12489 Berlin
eike.zil...@qt.io
http://qt.io
Geschäftsführer: Mika Pälsi,
Juha Varelius, Mika Harjuaho
Sitz der Gesellschaft: Berlin, Registergericht: Amtsgericht Charlottenburg, HRB 
144331 B

___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development

Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems

2019-11-17 Thread Giuseppe D'Angelo via Development


Il 17/11/19 01:55, Thiago Macieira ha scritto:

Hi

Sorry, it looks like this thread is not progressing in a calm and reasoned
manner, the way it was meant to be. And I'm very much to blame. So I apologise
for the strong language and passionate opinions. I'm deleting most of what I
had written as a reply so we can start over.

Let's start with your questions:

On Saturday, 16 November 2019 10:50:13 PST André Pönitz wrote:

You have not yet answered

   - why this decision was made


You know, I don't know. To be frank, I don't know that a decision *was* made.
It all started with a change (see OP) about removing QTextCodec from the API
and from QtCore. It seemed reasonable enough but it turned up quite a few
kinks that hadn't been predicted. One of them, which may still be a
showstopper, is QXmlStreamReader's inability to handle XML data encoded in
anything except UTF-8, though a thorough search of all XML files in my system
turned up exactly zero such files.

I don't know why QTextCodec is being removed. I don't remember any decisions
in prior QtCS or this mailing list about removing it. We definitely discussed
removing the CJK codecs and their big tables and that can still be done, with
no effect in the API, since QTextCodec is backed by ICU's ucnv. We may have
discussed removing it, but I don't remember a firm decision. And even if it is
firm, after looking at the consequences of doing so, we may want to reverse
our decision.


I don't know either. Is it to make QtCore smaller? Wasn't the feature 
system ("Qt Lite") supposed to address that? Or is it to make it less of 
a "kitchen sink", and split it in smaller libraries? Could that mean 
having QTextCodec in its own library, and QXmlStreamReader in another 
(that depends on the former)?




Related to that is the discussion of whether UTF-8 is the only acceptable
locale on Unix systems. If we don't have QTextCodec, then we have to have
something fixed for QString::fromLocal8Bit and it would necessarily be UTF-8.
But even if we do have QTextCodec, that's still a reasonable question: should
assume it is UTF-8? And should we enforce it? Those were the questions in my
OP.


Should fromLocal8Bit be following the locale environment instead 
(LC_CTYPE, LC_MESSAGES or similar)?




2) QtCore size
As I said above, removing the legacy codecs we have code for is not a problem.
They are already disabled in Qt builds where ICU is present, so we'd
additionally remove them from all other builds. Where ICU is present, there's
no loss of functionality for user applications, since ICU provides far more
codecs than we do. For those without ICU, it stands to reason that the user
chose size so they are aware of the limitations. Plus, one can always
instantiate their own QTextCodec and add to the list (at least, with today's
implementation).

If QTextCodec is not in QtCore, then most likely you can't affect how QtCore
and almost all other Qt classes decode 8-bit data into QString, including
QTextStream.


See above -- it also means QTextStream goes in some I/O lib that 
contains or depends on the codecs lib.




and 3) misconfigured locale systems and filename handling
This is probably the biggest problem. As it is right now, when the locale
isn't set on a Unix system or if it is explicitly set to C, we *cannot* decode
any file names with the 8th bit set. Those file names are considered
filesystem corruption. And yet they are quite commonly created by the user
outside of English-speaking jurisdictions.


Why do we bother about "saving the world"? A misconfigured system is the 
user's mistake. They should be in charge of fixing it in order to 
address the problem.






I get the impression that this thread was not started as an RFC for an
open-ended discussion, but as a staged attempt to provide a figleaf for
a pre-determined decision.


That was not the intention. That's why I am re-starting it so we can come back
to a reasoned approach.

Anyway, the two independent (but related) decisions we need to make are:
1) do we keep QTextCodec in QtCore?
2) do we want to change we handle legacy (non-UTF8) locales?

For #2, the sub-questions of the OP apply:
  a) What should Qt 6 assume the locale to be, if no locale is set?
  b) In case a non-UTF-8 locale is set, what should we do?
  c) Should we propagate our decision to child processes?

My preferences were:
  a) C.UTF-8
  b) override it to force UTF-8 on the same locale
  c) yes


How about

a) either C / C.UTF-8, but warning the user; but I'd up the ante, and 
say: just assert/crash.


b) keep the choice. Silently changing it sounds like a bad idea; we 
should never override the user choices silently.


c) no. We shouldn't "fix" subprocesses. They have the right to make 
their own independent decisions.



But I think we should. My arguments are that UTF-8 locales are the default in
all desktop Linux distributions, all BSDs and on macOS and have been for 15
years. Most embedded systems from the last 5 years at least also have i

Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems

2019-11-17 Thread Thiago Macieira

On Sunday, 17 November 2019 13:19:27 CET Kevin Kofler wrote:
> Please be warned that C.UTF-8 is a recent introduction. (Has upstream glibc
> even accepted it yet?) So setting the locale to C.UTF-8 will produce warning
> spam or even fatal errors (depending on the application) on many older
> distributions and possibly even on some current ones. (E.g., Fedora has
> introduced this in Fedora 24 and in updates to Fedora 22 and 23. I don't
> know whether this was backported to RHEL releases up to RHEL 7. RHEL 8 has
> probably inherited it from recent Fedora, at least.)

Given that Fedora 31 is current, Fedora 24 is 3 years old. It's probably old 
enough.

And Python sets LANG to it if the environment is unset.

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel System Software Products



___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development

Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems

2019-11-17 Thread Kevin Kofler

Thiago Macieira wrote:
> 2) QtCore size
> As I said above, removing the legacy codecs we have code for is not a
> problem. They are already disabled in Qt builds where ICU is present, so
> we'd additionally remove them from all other builds. Where ICU is present,
> there's no loss of functionality for user applications, since ICU provides
> far more codecs than we do. For those without ICU, it stands to reason
> that the user chose size so they are aware of the limitations. Plus, one
> can always instantiate their own QTextCodec and add to the list (at least,
> with today's implementation).

Isn't ICU already a hard requirement on *nix? Since we are talking about 
locales on *nix systems only, we should be able to assume a Qt build with 
ICU, shouldn't we?

> Turns out, there's one locale that we can be sure that its non-UTF-8
> default is decodable under UTF-8 and that'st he "C" locale. So we don't
> *have* to qputenv "C.UTF-8" if the locale is explicitly "C" (as opposed to
> being unset).
> 
> But I think we should.

Please be warned that C.UTF-8 is a recent introduction. (Has upstream glibc 
even accepted it yet?) So setting the locale to C.UTF-8 will produce warning 
spam or even fatal errors (depending on the application) on many older 
distributions and possibly even on some current ones. (E.g., Fedora has 
introduced this in Fedora 24 and in updates to Fedora 22 and 23. I don't 
know whether this was backported to RHEL releases up to RHEL 7. RHEL 8 has 
probably inherited it from recent Fedora, at least.)

Kevin Kofler

___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development

Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems

2019-11-16 Thread Thiago Macieira

Hi

Sorry, it looks like this thread is not progressing in a calm and reasoned 
manner, the way it was meant to be. And I'm very much to blame. So I apologise 
for the strong language and passionate opinions. I'm deleting most of what I 
had written as a reply so we can start over.

Let's start with your questions:

On Saturday, 16 November 2019 10:50:13 PST André Pönitz wrote:
> You have not yet answered
> 
>   - why this decision was made

You know, I don't know. To be frank, I don't know that a decision *was* made. 
It all started with a change (see OP) about removing QTextCodec from the API 
and from QtCore. It seemed reasonable enough but it turned up quite a few 
kinks that hadn't been predicted. One of them, which may still be a 
showstopper, is QXmlStreamReader's inability to handle XML data encoded in 
anything except UTF-8, though a thorough search of all XML files in my system 
turned up exactly zero such files.

I don't know why QTextCodec is being removed. I don't remember any decisions 
in prior QtCS or this mailing list about removing it. We definitely discussed 
removing the CJK codecs and their big tables and that can still be done, with 
no effect in the API, since QTextCodec is backed by ICU's ucnv. We may have 
discussed removing it, but I don't remember a firm decision. And even if it is 
firm, after looking at the consequences of doing so, we may want to reverse 
our decision.

Related to that is the discussion of whether UTF-8 is the only acceptable 
locale on Unix systems. If we don't have QTextCodec, then we have to have 
something fixed for QString::fromLocal8Bit and it would necessarily be UTF-8. 
But even if we do have QTextCodec, that's still a reasonable question: should 
assume it is UTF-8? And should we enforce it? Those were the questions in my 
OP.

>   - who did it

Considering I don't know a decision *was* made, I don't think we can say who 
made it.

>   - what the actual problem to solve was

Three things being tackled, all related:

1) QTextCodec in the API
I think we cannot do without it, it'll have to stay in one way or another. So 
the question reduces to whether it should stay in QtCore or be moved to 
another library. Given the QXmlStreamReader problem above, it's probably best 
to keep it in QtCore, actually.

QTextCodec has some API limitations but they can be fixed. It's not necessary 
for us to remove it: it's not *that* broken.

2) QtCore size
As I said above, removing the legacy codecs we have code for is not a problem. 
They are already disabled in Qt builds where ICU is present, so we'd 
additionally remove them from all other builds. Where ICU is present, there's 
no loss of functionality for user applications, since ICU provides far more 
codecs than we do. For those without ICU, it stands to reason that the user 
chose size so they are aware of the limitations. Plus, one can always 
instantiate their own QTextCodec and add to the list (at least, with today's 
implementation).

If QTextCodec is not in QtCore, then most likely you can't affect how QtCore 
and almost all other Qt classes decode 8-bit data into QString, including 
QTextStream.

and 3) misconfigured locale systems and filename handling
This is probably the biggest problem. As it is right now, when the locale 
isn't set on a Unix system or if it is explicitly set to C, we *cannot* decode 
any file names with the 8th bit set. Those file names are considered 
filesystem corruption. And yet they are quite commonly created by the user 
outside of English-speaking jurisdictions.

Your example of setting LC_ALL (or another environment variable) to force the 
locale to print something that either can be parsed or shared is one such 
problematic scenario. On one hand, you may need it to get some older tools to 
parse output; on the other, it makes Qt applications unable to even see some 
files exist.

>   - why LC_*ALL* comes into play

Because it's the override. If we decide to override and LC_ALL is set, then we 
have no choice but to override it. If it is unset, then we can leave it unset 
too, but may need to override LC_CTYPE.

> I get the impression that this thread was not started as an RFC for an
> open-ended discussion, but as a staged attempt to provide a figleaf for
> a pre-determined decision.

That was not the intention. That's why I am re-starting it so we can come back 
to a reasoned approach.

Anyway, the two independent (but related) decisions we need to make are:
1) do we keep QTextCodec in QtCore?
2) do we want to change we handle legacy (non-UTF8) locales?

For #2, the sub-questions of the OP apply:
 a) What should Qt 6 assume the locale to be, if no locale is set?
 b) In case a non-UTF-8 locale is set, what should we do?
 c) Should we propagate our decision to child processes?

My preferences were:
 a) C.UTF-8
 b) override it to force UTF-8 on the same locale
 c) yes

The reason for my preference in propagating to child processes is so that we 
have a consistent protocol between

Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems

2019-11-16 Thread André Pönitz

On Fri, Nov 15, 2019 at 05:47:04PM -0800, Thiago Macieira wrote:
> On Friday, 15 November 2019 16:23:24 PST André Pönitz wrote:
> > > The questions are:
> > > 1) do we want to prevent another library from accidentally unsetting it?
> > > 2) do we want child processes to use the same?
> > > 
> > > Note the answers for both questions must be the same, for the solution is
> > > the same. So either both yeses or both nos.
> > 
> > This "answers for both questions must be the same" requirement is arbitrary.
> > 
> > The fact that one known solution results in same answers to both is in
> > no way proof that no other solutions exist.
> 
> I don't see how to prevent another library doing setlocale(LC_ALL, "") from 
> not overriding Qt's default other than to make setlocale(LC_ALL, "") do what 
> we want. Since what it does is read the environment, the only solution is to 
> change the environment.

You haven't even explained why this prevention would be needed, what exact
bad would happen if you don't do that, and you cannot prevent the other library
from setting an explicit locale anyway.

With modifying the environment, you just catch the "" case, one out of many,
and I'll continue to argue that it's not Qt's business to try even that.

> > > Qt 6 will not have support for non-UTF-8 codecs, outside of Windows. You
> > > can either deal with binary data or with UTF-8 text, there's no middle
> > > ground.
> > Now that's an interesting twist.
> > 
> > The latest memo I did (not...) get was that codecs are to be moved into a
> > separate module. Which is actually ok, as it allows user code using codecs
> > to live on with minimal changes, and makes QtCore slimmer, kind of "no-loss
> > + win".
> 
> Sure. But that's no different than using ICU or writing your own code to 
> convert from binary to text. QString will not support it on its own.

> 
> > "Qt 6 will not have support for non-UTF-8 codecs, outside of Windows" is
> > definitely news to me. I've not seen this being discussed, neither here nor
> > within the part of the company that I usually talk to.
> 
> You just said yourself, above.

I did not say that.

> If QTextCodec moves to another library, we have  no codecs in QtCore.

Not having codecs in QtCore does not mean QtCore cannot use codecs.

One could have a setup where Qt Core just has the bare minimum, with stubs
for other codecs that are used when that QtCodecs lib is linked.

Actually that's what I had expected something like that to be the targeted
solution once I heard that text codecs move out of QtCore.

> > So when and where was this decision made, by whom, and why?
> > 
> > Did that person bother to check e.g. whether Qt Creator uses non-UTF-8
> > codecs in some cases and did that person come to the conclusion that any
> > such use is bad and deserves to die?
> 
> Probably not. Why does Qt Creator need other codecs?

My guess would be to handle code bases that are not (a subset) of UTF-8.
 
> > > you're arguing that here are broken applications that won't handle
> > > C.UTF-8 correctly, without giving as single example.
> > 
> > ... is of course not true:
> > 
> > 1. I did not claim there were "broken" applications that won't handle
> >C.UTF-8 "correctly", I claimed that there are applications that react
> >differently to C.UTF-8.
> 
> Different behaviour is *exactly* what we want. We want this:

Who is 'we'?

> $ LC_ALL=C.UTF-8 ls á
> ls: cannot access 'á': No such file or directory
> 
> not this:
> 
> $ LC_ALL=C ls á
> ls: cannot access ''$'\303\241': No such file or directory

If you do not touch the environment, the user gets what he asked for.

He will most likely want not to see ''$'\303\241, but if he explicitly asks
for it in the environment he sets up, it's not Qt's job to override this.

> I thought the argument would be that despite being what we wanted,

Who is 'we'?

> it would break certain scenarios. But I haven't seen any examples of breakage.
> 
> >  gcc produces different output under C and C.UTF-8:
> > 
> >  echo x | LC_CTYPE=C gcc -xc -
> >   :1:1: error: expected '=', ',', ';', 'asm' or '__attribute__'
> > at end of input
> > 
> >  echo x | LC_CTYPE=C.UTF-8 gcc -xc -
> >   :1:1: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’
> > at end of input
> > 
> >  As an additional twist, this different behaviour does not require fancy
> > input, input is plain ASCII in both cases.
> > 
> >  Output parsers expecting "'" e.g. to produce a set recommendations how
> > to quick-fix such problems in an IDE will break.
> 
> Any application that is parsing GCC output is already setting LC_ALL in the 
> child process's environment.

Not necessarily, and if so, it's rather 'C', not 'C.UTF-8'.

> Otherwise, they'd be getting possibly translated 
> messages and we all know that the order of the messages could be different. 
> Not to mention that instead of "" or even “” we could see «» or „“.
 
Also the point here is not that the particular case. Each

Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems

2019-11-16 Thread Thiago Macieira

On Friday, 15 November 2019 00:52:55 PST Eike Ziller wrote:
> - You state that as if that were a fact imposed on us from some external
> entity, and as if that patch were already in.

No, but that's the direction that started this conversation. If we're not 
going to do that, then the entire discussion is moot.

> - I thought QTextCodec will
> still be available, even if from a separate module. If that plan has
> changed, provide a patch for Qt Creator as well.

it will, but we'll probably need a session next week to discuss in what form. 
If wew remove the codecs we kept and only use ICU, then QTextCodec will have 
negligible cost and could stay in QtCore.

If it stays in QtCore, we still have a question whether QString::fromLocal8Bit 
shall assume it's UTF-8 on Unix systems.

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel System Software Products

___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development

Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems

2019-11-16 Thread Thiago Macieira

On Friday, 15 November 2019 16:23:24 PST André Pönitz wrote:
> > The questions are:
> > 1) do we want to prevent another library from accidentally unsetting it?
> > 2) do we want child processes to use the same?
> > 
> > Note the answers for both questions must be the same, for the solution is
> > the same. So either both yeses or both nos.
> 
> This "answers for both questions must be the same" requirement is arbitrary.
> 
> The fact that one known solution results in same answers to both is in
> no way proof that no other solutions exist.

I don't see how to prevent another library doing setlocale(LC_ALL, "") from 
not overriding Qt's default other than to make setlocale(LC_ALL, "") do what 
we want. Since what it does is read the environment, the only solution is to 
change the environment.

> > Qt 6 will not have support for non-UTF-8 codecs, outside of Windows. You
> > can either deal with binary data or with UTF-8 text, there's no middle
> > ground.
> Now that's an interesting twist.
> 
> The latest memo I did (not...) get was that codecs are to be moved into a
> separate module. Which is actually ok, as it allows user code using codecs
> to live on with minimal changes, and makes QtCore slimmer, kind of "no-loss
> + win".

Sure. But that's no different than using ICU or writing your own code to 
convert from binary to text. QString will not support it on its own.

> "Qt 6 will not have support for non-UTF-8 codecs, outside of Windows" is
> definitely news to me. I've not seen this being discussed, neither here nor
> within the part of the company that I usually talk to.

You just said yourself, above. If QTextCodec moves to another library, we have 
no codecs in QtCore. That means the rest of Qt will not support other codecs.

> So when and where was this decision made, by whom, and why?
> 
> Did that person bother to check e.g. whether Qt Creator uses non-UTF-8
> codecs in some cases and did that person come to the conclusion that any
> such use is bad and deserves to die?

Probably not. Why does Qt Creator need other codecs?

> > you're arguing that here are broken applications that won't handle
> > C.UTF-8 correctly, without giving as single example.
> 
> ... is of course not true:
> 
> 1. I did not claim there were "broken" applications that won't handle
>C.UTF-8 "correctly", I claimed that there are applications that react
>differently to C.UTF-8.

Different behaviour is *exactly* what we want. We want this:

$ LC_ALL=C.UTF-8 ls á
ls: cannot access 'á': No such file or directory

not this:

$ LC_ALL=C ls á
ls: cannot access ''$'\303\241': No such file or directory

I thought the argument would be that despite being what we wanted, it would 
break certain scenarios. But I haven't seen any examples of breakage.

>  gcc produces different output under C and C.UTF-8:
> 
>  echo x | LC_CTYPE=C gcc -xc -
>   :1:1: error: expected '=', ',', ';', 'asm' or '__attribute__'
> at end of input
> 
>  echo x | LC_CTYPE=C.UTF-8 gcc -xc -
>   :1:1: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’
> at end of input
> 
>  As an additional twist, this different behaviour does not require fancy
> input, input is plain ASCII in both cases.
> 
>  Output parsers expecting "'" e.g. to produce a set recommendations how
> to quick-fix such problems in an IDE will break.

Any application that is parsing GCC output is already setting LC_ALL in the 
child process's environment. Otherwise, they'd be getting possibly translated 
messages and we all know that the order of the messages could be different. 
Not to mention that instead of "" or even “” we could see «» or „“.

Changing the environment of a child process is not going to go away.

If you're telling me that you're setting the environment before the Qt 
application to cope with its brokenness, I will ask why that application 
hasn't been fixed in the 16 years since UTF-8 environments became a thing. And 
we can provide a way to force Qt not to set the environment, for those weird 
cases where you musts deal with broken, proprietary cr#p that won't be fixed 
until the heat death of the Universe. And I will ask why everyone else must 
pay a performance price for the sake of those old, broken applications that 
even the maintainer isn't fixing anymore?

>  #include 
>  #include 
>  #include 
> 
>  int main()
>  {
>  if (strcmp((setlocale(LC_COLLATE, "")), "C") != 0)
>  abort();
>  }
> 
>  runs successfully under LC_ALL="C" and aborts under LC_ALL="C.UTF-8".

Strawman example, this doesn't happen in reality. See my exhaustive search for 
all such checks in an entire Linux distribution. I'm asking for *real* 
situations.

>  While contreived in this form, there _is_ code even in Creator checking
> for "C" literally, raising the suspicion that this might happen in other
> applications, too.

Oh, checking for "C" literally does exist, there were several in my search. 
About half of

Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems

2019-11-15 Thread André Pönitz

On Thu, Nov 14, 2019 at 11:20:08PM -0800, Thiago Macieira wrote:
> On Thursday, 14 November 2019 13:27:23 PST André Pönitz wrote:
> > *Within* a Qt application consisting of Qt library, other libraries,
> > and actual user code it's mildly presumptous for one library to impose
> > random unnecessay restrictions on user code and other libraries.
>
> That boat sailed 20 years ago when we started calling setlocale() from
> QCoreapplication. We set the locale, period.

1. I was refering to putenv, not setlocale.

2. Even for setlocale, the point is not _whether_ it is called, but _how_.
   setlocale(..., 0) e.g. only queries, does not change anything.

   QCoreapplication currently calls setlocale(LC_ALL, "").

   This is fine. This accepts the user's choice of environment as authorative.

   It also works well in practice. I can run something like

   LC_PAPER=de_LU LC_TIME=en_US.UTF-8 LC_COLLATE=C qtcreator

   and it will not only "just work" for the application itself, but
   also be properly passed on to e.g. a terminal started from within.

   So no boat has sailed, let alone 20 years ago.

   The boat _will_ sail once there when you put a non-empty string there,
   overriding user's choice.

> The questions are:
> 1) do we want to prevent another library from accidentally unsetting it?
> 2) do we want child processes to use the same?
>
> Note the answers for both questions must be the same, for the solution is the
> same. So either both yeses or both nos.

This "answers for both questions must be the same" requirement is arbitrary.

The fact that one known solution results in same answers to both is in 
no way proof that no other solutions exist.

But it looks like there's no need to discuss _that_, as my answers are
"no" and "no". 

> > Making assumptions on the controlability of content of a input stream is
> > questionable. The proposed method of changing the environment for child
> > processes is no guarantee on what the child actually produces, and the
> > Qt application still has to be prepared to handle non-Utf-8 or otherwise
> > "broken" input.
>
> Qt 6 will not have support for non-UTF-8 codecs, outside of Windows. You can
> either deal with binary data or with UTF-8 text, there's no middle ground.

Now that's an interesting twist.

The latest memo I did (not...) get was that codecs are to be moved into a 
separate
module. Which is actually ok, as it allows user code using codecs to live
on with minimal changes, and makes QtCore slimmer, kind of "no-loss + win".

"Qt 6 will not have support for non-UTF-8 codecs, outside of Windows" is
definitely news to me. I've not seen this being discussed, neither here nor
within the part of the company that I usually talk to.

So when and where was this decision made, by whom, and why?

Did that person bother to check e.g. whether Qt Creator uses non-UTF-8
codecs in some cases and did that person come to the conclusion that any
such use is bad and deserves to die?

> > This discussion so far claimed the existance of a range of problems
> > without giving an actual example. Then it goes on to propose a shotgut
> > approach (LC_ALL, "ALL") to handle ... what? "Obscure locale settings"
> > like categories that are a bit more fine grained than LC_ALL?  Bear with
> > me when I do not have the impression that Qt will be the right context
> > to accept such "obligations".
>
> The same argument can be made for your statements:

Sure, one could do that.

But that would _my_ argument not make go away, nor compensate for the current 
lack
of answers to the questions I asked.

And ...

> you're arguing that here are broken applications that won't handle
> C.UTF-8 correctly, without giving as single example.

... is of course not true:

1. I did not claim there were "broken" applications that won't handle
   C.UTF-8 "correctly", I claimed that there are applications that react
   differently to C.UTF-8. 

2. I _did_ give two examples. I can repeat here:

   2.1) 
https://lists.qt-project.org/pipermail/development/2019-November/037815.html

 gcc produces different output under C and C.UTF-8:

 echo x | LC_CTYPE=C gcc -xc -
  :1:1: error: expected '=', ',', ';', 'asm' or '__attribute__' at 
end of input

 echo x | LC_CTYPE=C.UTF-8 gcc -xc -
  :1:1: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ at 
end of input

 As an additional twist, this different behaviour does not require fancy
 input, input is plain ASCII in both cases.

 Output parsers expecting "'" e.g. to produce a set recommendations how to 
quick-fix
 such problems in an IDE will break.

   2.2) 
https://lists.qt-project.org/pipermail/development/2019-November/037810.html

 #include 
 #include 
 #include 

 int main()
 {   
 if (strcmp((setlocale(LC_COLLATE, "")), "C") != 0)
 abort();
 }

 runs successfully under LC_ALL="C" and aborts under LC_ALL="C.UTF-8".

 While contreived in this form, there _is_

Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems

2019-11-15 Thread Eike Ziller



> On 15. Nov 2019, at 08:20, Thiago Macieira  wrote:
> 
> On Thursday, 14 November 2019 13:27:23 PST André Pönitz wrote:
>> *Within* a Qt application consisting of Qt library, other libraries,
>> and actual user code it's mildly presumptous for one library to impose
>> random unnecessay restrictions on user code and other libraries.
> 
> That boat sailed 20 years ago when we started calling setlocale() from 
> QCoreapplication. We set the locale, period.
> 
> The questions are:
> 1) do we want to prevent another library from accidentally unsetting it?
> 2) do we want child processes to use the same?
> 
> Note the answers for both questions must be the same, for the solution is the 
> same. So either both yeses or both nos.
> 
>> Making assumptions on the controlability of content of a input stream is
>> questionable. The proposed method of changing the environment for child
>> processes is no guarantee on what the child actually produces, and the
>> Qt application still has to be prepared to handle non-Utf-8 or otherwise
>> "broken" input.
> 
> Qt 6 will not have support for non-UTF-8 codecs, outside of Windows. You can 
> either deal with binary data or with UTF-8 text, there's no middle ground.

- You state that as if that were a fact imposed on us from some external 
entity, and as if that patch were already in.
- I thought QTextCodec will still be available, even if from a separate module. 
If that plan has changed, provide a patch for Qt Creator as well.

> 
>> This discussion so far claimed the existance of a range of problems
>> without giving an actual example. Then it goes on to propose a shotgut
>> approach (LC_ALL, "ALL") to handle ... what? "Obscure locale settings"
>> like categories that are a bit more fine grained than LC_ALL?  Bear with
>> me when I do not have the impression that Qt will be the right context
>> to accept such "obligations".
> 
> The same argument can be made for your statements: you're arguing that here 
> are broken applications that won't handle C.UTF-8 correctly, without giving 
> as 
> single example.
> 
> I think the whole problem is that we're trying to talk about broken 
> applications and the way their brokenness manifests itself. I don't think 
> such 
> applications exist anymore in occurrence sufficient for us to deal with.
> 
> Anyway, since you oppose setting the environment, let's just make a check for 
> assumption:
> 
> if (locale is not UTF-8)
>qFatal("Qt only supports UTF-8 locales. "
>   "Please configure your system properly");
> 
> -- 
> Thiago Macieira - thiago.macieira (AT) intel.com
>  Software Architect - Intel System Software Products
> 
> 
> 
> ___
> Development mailing list
> Development@qt-project.org
> https://lists.qt-project.org/listinfo/development

-- 
Eike Ziller
Principal Software Engineer

The Qt Company GmbH
Erich-Thilo-Straße 10
D-12489 Berlin
eike.zil...@qt.io
http://qt.io
Geschäftsführer: Mika Pälsi,
Juha Varelius, Mika Harjuaho
Sitz der Gesellschaft: Berlin, Registergericht: Amtsgericht Charlottenburg, HRB 
144331 B

___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development

Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems

2019-11-14 Thread Thiago Macieira

On Thursday, 14 November 2019 13:27:23 PST André Pönitz wrote:
> *Within* a Qt application consisting of Qt library, other libraries,
> and actual user code it's mildly presumptous for one library to impose
> random unnecessay restrictions on user code and other libraries.

That boat sailed 20 years ago when we started calling setlocale() from 
QCoreapplication. We set the locale, period.

The questions are:
1) do we want to prevent another library from accidentally unsetting it?
2) do we want child processes to use the same?

Note the answers for both questions must be the same, for the solution is the 
same. So either both yeses or both nos.

> Making assumptions on the controlability of content of a input stream is
> questionable. The proposed method of changing the environment for child
> processes is no guarantee on what the child actually produces, and the
> Qt application still has to be prepared to handle non-Utf-8 or otherwise
> "broken" input.

Qt 6 will not have support for non-UTF-8 codecs, outside of Windows. You can 
either deal with binary data or with UTF-8 text, there's no middle ground.

> This discussion so far claimed the existance of a range of problems
> without giving an actual example. Then it goes on to propose a shotgut
> approach (LC_ALL, "ALL") to handle ... what? "Obscure locale settings"
> like categories that are a bit more fine grained than LC_ALL?  Bear with
> me when I do not have the impression that Qt will be the right context
> to accept such "obligations".

The same argument can be made for your statements: you're arguing that here 
are broken applications that won't handle C.UTF-8 correctly, without giving as 
single example.

I think the whole problem is that we're trying to talk about broken 
applications and the way their brokenness manifests itself. I don't think such 
applications exist anymore in occurrence sufficient for us to deal with.

Anyway, since you oppose setting the environment, let's just make a check for 
assumption:

if (locale is not UTF-8)
qFatal("Qt only supports UTF-8 locales. "
   "Please configure your system properly");

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel System Software Products

___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development

Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems

2019-11-14 Thread André Pönitz

On Thu, Nov 14, 2019 at 12:10:24PM +0100, Mathias Hasselmann wrote:
> 
> Am 03.11.2019 um 06:35 schrieb André Pönitz:
> > I am all for not propagating Qt's UTF-8 choice to child processes at all.
> 
> "Write once, compile/run everywhere" mandates Qt enforcing a maximum level
> of homogenity within our Qt applications.

*Within* a Qt application consisting of Qt library, other libraries,
and actual user code it's mildly presumptous for one library to impose
random unnecessay restrictions on user code and other libraries.

I am running firefox in parallel, currently with 174 shared object
loaded. I don't think it will improve overall firefox user experience
if the authors of said 174 library decide to impose their views on what
is good code and what is bad code on the other 173 participants in
the game. 

And even if people agreed on using UTF-8 inside an application - and I
wouldn't disagree - this does not warrant changing the environment.

> That extends to the input and output streams of the child processes
> our applications deal with.

Making assumptions on the controlability of content of a input stream is
questionable. The proposed method of changing the environment for child
processes is no guarantee on what the child actually produces, and the
Qt application still has to be prepared to handle non-Utf-8 or otherwise
"broken" input.

So this is effectively snake oil.

> Not propagating Qt's UTF-8 choices seeems like a violation of that
> principle of maximum homogenity.

Which you just invented.

Apart from that we just broke "homogenity", as now child processes
started from a Qt application behave differently then when started
otherwise (see the gcc quotes example with different results on pure
7-bit input)

> Hiding the complexity of obscure locale settings truely
> belongs to the hearth of Qt's obligations in my opinion.

This discussion so far claimed the existance of a range of problems
without giving an actual example. Then it goes on to propose a shotgut
approach (LC_ALL, "ALL") to handle ... what? "Obscure locale settings"
like categories that are a bit more fine grained than LC_ALL?  Bear with
me when I do not have the impression that Qt will be the right context
to accept such "obligations".

Andre'

PS: Just seen: https://wiki.debian.org/Locale:

Warning!

Using LC_ALL is strongly discouraged as it overrides everything.
Please use it only when testing and never set it in a startup file. 
___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development

Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems

2019-11-14 Thread Mathias Hasselmann



Am 03.11.2019 um 06:35 schrieb André Pönitz:

On Sat, Nov 02, 2019 at 06:16:36PM +0100, Kevin Kofler wrote:

A true runtime option actually belongs in an environment variable, not in a
method that has to be called by the compiled code. (In fact, that's what I
would have expected your proposed QT_NO_OVERRIDE_LC_CTYPE to be, but
apparently you were thinking of a preprocessor define.)

Whether to propagate the locale to child processes is really a decision that
can and should be left to the user at runtime rather than compiling it
either into the application (as in André's proposal) or even into Qt itself
(as in your proposal).

I am all for not propagating Qt's UTF-8 choice to child processes at all.


"Write once, compile/run everywhere" mandates Qt enforcing a maximum 
level of homogenity within our Qt applications. That extends to the 
input and output streams of the child processes our applications deal 
with. Not propagating Qt's UTF-8 choices seeems like a violation of that 
principle of maximum homogenity. Hiding the complexity of obscure locale 
settings truely belongs to the hearth of Qt's obligations in my opinion.


Ciao
Mathias

___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development

Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems

2019-11-04 Thread Thiago Macieira

On Monday, 4 November 2019 10:55:03 PST Thiago Macieira wrote:
> I'll do a full search on Clear Linux to see if there's any software that
> checks the return value of setlocale().

All "setlocale" calls.

First, the calls that to strcmp: I found comparisons in gnulib and 
replacements for setlocale, which don't count (they're replacement for old 
systems Qt no longer [has never?] runs on). That left a couple of examples of 
exactly what you predicted:

glfw-3.3/src/x11_init.c:if (strcmp(setlocale(LC_CTYPE, NULL), "C") == 0)
https://github.com/glfw/glfw/blob/master/src/x11_init.c#L934-L942
hack around C not supporting wide-char, which wouldn't be needed if we set the 
environment

firefox-60.1.0/xpcom/build/XPCOMInit.cpp:  if (strcmp(setlocale(LC_ALL, 
nullptr), "C") == 0) {
https://searchfox.org/mozilla-central/source/xpcom/build/XPCOMInit.cpp#337
the next line does setlocale(LC_ALL, "")

wxWidgets-3.1.2/src/common/intl.cpp:wxASSERT_MSG( 
strcmp(setlocale(LC_ALL, NULL), "C") == 0,
https://github.com/wxWidgets/wxWidgets/blob/master/src/common/intl.cpp#L1694
Appears to be Windows-specific.

The assignments are much more numerous (1700 of them in my listing). A lot of 
them are of the form:
  old_locale = setlocale(LC_xxx, NULL);
which I assume is later followed up by a setlocale(LC_xxx, old_locale). These 
cases are not relevant to us.

https://github.com/GNUAspell/aspell/blob/master/common/config.cpp#L549-L561
Needs to find the locale to know what language to apply spelling for and also 
how to decode the text. UTF-8 is supported.

http://git.savannah.gnu.org/cgit/bash.git/tree/locale.c
Aside from the check *for* UTF-8 in LC_CTYPE, the assignments are only 
checking for null pointers.

http://git.savannah.gnu.org/cgit/bison.git/tree/src/getargs.c#n446
http://git.savannah.gnu.org/cgit/coreutils.git/tree/src/system.h
Not relevant for us.

https://github.com/BOINC/boinc/blob/master/zip/zip/zip.c#L2214
Null check only, and checks for UTF-8

https://github.com/BOINC/boinc/blob/master/zip/unzip/unzip.c#L773
Not relevant, in #else for nl_langinfo

https://github.com/microsoft/cpprestsdk/blob/master/Release/src/utilities/
asyncrt_utils.cpp
Win32 only

https://github.com/apple/cups/blob/master/cups/language.c
Handles UTF-8 just fine.

https://github.com/apple/cups/blob/master/cups/langprintf.c
Forces .UTF-8.

https://github.com/doxygen/doxygen/blob/master/qtools/qtextcodec.cpp#L508-L529
Trying to guess what QTextCodec to use for ru_RU.

https://git.enlightenment.org/core/efl.git/tree/src/modules/ecore_imf/xim/
ecore_imf_xim.c#n832
Null check only. The rest of EFL is save/restore.

http://git.savannah.gnu.org/cgit/emacs.git/tree/src/sysdep.c#n4049
Null check only.

http://git.savannah.gnu.org/cgit/emacs.git/tree/src/sysdep.c#n4049
COULD mistake, as it does strcmp(locale, "C") then locale = "en"

https://github.com/GNOME/evince/blob/mainline/cut-n-paste/synctex/
synctex_parser.c#L4384-L4399
Save/restore.

https://github.com/GNOME/evolution-data-server/blob/mainline/src/camel/camel-iconv.c#L218
Does compare to "C", but not a problem since the failing case uses nl_langinfo

https://github.com/GNOME/evolution-data-server/blob/mainline/src/addressbook/
libedata-book/e-book-sqlite.c#L2891
Doesn't seem to be a problem.

https://github.com/GNOME/evolution/blob/mainline/src/e-util/e-xml-utils.c#L66
Just getting defaults.

https://github.com/fish-shell/fish-shell/blob/3.0.2/src/env.cpp#L373-L396
Comparing old to new. And no longer present in master.

https://github.com/fltk/fltk/blob/master/src/
Fl_Native_File_Chooser_GTK.cxx#L445-L458
Save/restore, not thread-safe.

https://github.com/zenotech/fox-toolkit/blob/master/src/FXTranslator.cpp#L84
Commented out.

http://git.savannah.gnu.org/cgit/gawk.git/tree/support/dfa.c#n988
Not a problem, just checking if the locale is ASCII-compatible.

binutils-gdb/blob/master/readline/readline/nls.c
Seems fine too.

https://github.com/geany/geany/blob/master/src/libmain.c#L980-L987
Only used in debug output

https://github.com/fangq/gftp/blob/master/lib/protocols.c#L382-L395
Null-pointer check & logging

https://github.com/GNOME/glib/blob/mainline/glib/guniprop.c#L724
Safe

https://github.com/GNOME/glib/blob/mainline/glib/gtranslit.c#L293
Seems to be fine

https://github.com/GNOME/glib/blob/mainline/glib/gdate.c#L1057-L1065
Checking cached results

I'm stopping here.
-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel System Software Products

setlocale-grep.zst
Description: application/zstd
___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development

Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems

2019-11-04 Thread Thiago Macieira

On Monday, 4 November 2019 11:50:01 PST André Pönitz wrote:
> On Mon, Nov 04, 2019 at 11:38:07AM -0800, Thiago Macieira wrote:
> > On Monday, 4 November 2019 11:18:12 PST André Pönitz wrote:
> > > A parser accepting the output of one might or might not be able to
> > > handle the second.
> > 
> > A driver would set LC_ALL in the environment when it calls gcc.
> 
> Can we please take a step back and repeat for the slow thinker^H^H me
> what the benefit of forcing a UTF-8 locale on unknown child processes
> would be?

Two-fold:

1) it forces the UTF-8 locale on the *current* process, in case some other 
part of the same process does setlocale(LC_ALL, "") after QCoreApplication

2) it forces the child process to use the same locale as the parent Qt 
application

Since Qt will force itself to UTF-8, then we want the Qt application to 
interpret
"Arquivo ou diretório inexistente"
instead of
"Arquivo ou diret�rio inexistente"

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel System Software Products



___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development

Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems

2019-11-04 Thread André Pönitz

On Mon, Nov 04, 2019 at 11:38:07AM -0800, Thiago Macieira wrote:
> On Monday, 4 November 2019 11:18:12 PST André Pönitz wrote:
> > A parser accepting the output of one might or might not be able to
> > handle the second.
> 
> A driver would set LC_ALL in the environment when it calls gcc.

Can we please take a step back and repeat for the slow thinker^H^H me
what the benefit of forcing a UTF-8 locale on unknown child processes
would be?

Andre'
___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development

Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems

2019-11-04 Thread André Pönitz

On Mon, Nov 04, 2019 at 10:55:03AM -0800, Thiago Macieira wrote:
> On Monday, 4 November 2019 10:29:16 PST André Pönitz wrote:
> > All but one do not let the UI user change the environment, i.e. the
> > environment is passed through the Qt UI process (so far). The one is
> > Qt Creator, but even there it is not possible to configure all child
> > processes, and would not be tolerable to tell users "When you create a
> > new run configuration remember to undo spurious environment changes done
> > by Qt".
> 
> It's highly unlikely you're running Qt Creator in a non-UTF-8 environment in 
> the first place.

*shrug*

   > locale | grep -q '=C$' && echo oops
   oops


> KDE has not supported such locales for 15 years.

I haven't tried to run KDE in earnest for about the same time. 

> If we were in 2004-2006 when this was recent and other Unix environments like 
> Solaris and HP-UXi where non-UTF-8 could be still in use I could understand 
> the skepticism.
> 
> > 
> > There _are_ setups that _are_ set in stone, that are not connected
> > to anything and that don't give anything on updates, or do not even
> > have the possibility to be "fixed" or changed in any way.
> 
> Why are you inserting Qt 6 into them, then?

Because data generation and data visualization are different tasks, that
can, and perhaps should, be done in different processes, and while data
visualization occasionally might need to react to user demand, data generation
might not.

> > Looks contrieved? [Check your hard disk before you answer.]
> 
> I'll do a full search on Clear Linux to see if there's any software that 
> checks the return value of setlocale().
> 
> > Potentially harmful behaviour should always be opt-in, not opt-out
> > (and never be non-configurable).
> 
> I don't disagree on the statement. I just disagree on whether it's harmful. 
> *Not* calling qputenv could be harmful too.

As mentioned in the second example, even "clean ASCII" 7 bit input produces
different results under "C.UTF-8" and "C":

 echo x | LC_ALL=C.UTF-8 gcc -xc -
 echo x | LC_ALL=C  gcc -xc -

Given that most parsers in the world are ad-hoc, chances are high that some
are based on looking for certain quotes, but not for others.

And even if someone knows that the immediate child processes are ok with
C.UTF-8, their children, grand children, ... might not.

Andre'
___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development

Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems

2019-11-04 Thread Thiago Macieira

On Monday, 4 November 2019 11:18:12 PST André Pönitz wrote:
> A parser accepting the output of one might or might not be able to
> handle the second.

A driver would set LC_ALL in the environment when it calls gcc.

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel System Software Products



___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development

Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems

2019-11-04 Thread André Pönitz

On Mon, Nov 04, 2019 at 09:40:00AM +, Edward Welbourne wrote:
> Indeed, what program would have problems in C.UTF-8 yet have a
> non-Unicode locale in which it works nicely ?

Other example:

echo x |  LC_ALL="C.UTF-8" gcc -xc -

and

echo x |  LC_ALL="C" gcc -xc -

produce different output.

A parser accepting the output of one might or might not be able to
handle the second.

Andre'
___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development

Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems

2019-11-04 Thread Thiago Macieira

On Monday, 4 November 2019 10:29:16 PST André Pönitz wrote:
> All but one do not let the UI user change the environment, i.e. the
> environment is passed through the Qt UI process (so far). The one is
> Qt Creator, but even there it is not possible to configure all child
> processes, and would not be tolerable to tell users "When you create a
> new run configuration remember to undo spurious environment changes done
> by Qt".

It's highly unlikely you're running Qt Creator in a non-UTF-8 environment in 
the first place. KDE has not supported such locales for 15 years.

If we were in 2004-2006 when this was recent and other Unix environments like 
Solaris and HP-UXi where non-UTF-8 could be still in use I could understand 
the skepticism.

> 
> There _are_ setups that _are_ set in stone, that are not connected
> to anything and that don't give anything on updates, or do not even
> have the possibility to be "fixed" or changed in any way.

Why are you inserting Qt 6 into them, then?

> int main()
> {
> if (strcmp((setlocale(LC_COLLATE, "")), "C") != 0)
> abort();
> }
> 
> Looks contrieved? [Check your hard disk before you answer.]

I'll do a full search on Clear Linux to see if there's any software that 
checks the return value of setlocale().

> Potentially harmful behaviour should always be opt-in, not opt-out
> (and never be non-configurable).

I don't disagree on the statement. I just disagree on whether it's harmful. 
*Not* calling qputenv could be harmful too.

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel System Software Products

___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development

Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems

2019-11-04 Thread Thiago Macieira

On Monday, 4 November 2019 09:29:41 PST Edward Welbourne wrote:
> On Monday, 4 November 2019 01:40:00 PST Edward Welbourne wrote:
> > I want to do qputenv in the Qt application *itself*, inside
> > QCoreApplication. Note the most important process that this will apply
> > to: itself. It applies to all other frameworks inside the same
> > application that may inspect the environment, including an extra unknown
> > call to setlocale(LC_ALL, "").

> ... and we can do that just fine if we
> * record the prior value we're over-riding on some master object,
>   that also remembers the list of regexes;
> * call qputenv() exactly as you have in mind;
> * when about to start a sub-process, ask that master object if the
>   command name matches one of its regexes;
> * if it does, restore *for only it* (e.g. after fork()) the prior value.

That only applies to QProcess. It will not apply to third-party components 
that fork helper processes.

It's possible atfork() could do this, but I'm not sure. it won't catch all of 
them, especially those that prepare the environment before forking (like 
execve / execle's caller).

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel System Software Products



___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development

Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems

2019-11-04 Thread André Pönitz

On Mon, Nov 04, 2019 at 09:40:00AM +, Edward Welbourne wrote:
> On Friday, 1 November 2019 12:29:19 PDT André Pönitz wrote:
> >> a) and b) are fine with me, "c) yes" sounds like a potential problem.
> >>
> >> Most of the child process I usually call are not Qt based,
> 
> That shouldn't matter.  Qt<6-based things and non-Qt things are all the
> same from the point of view of the contemplated change.
> 
> To what extent are these child programs started via a UI that lets the
> user set environment variables (as I assume all IDEs do for most of the
> commands they run) ?

All but one do not let the UI user change the environment, i.e. the
environment is passed through the Qt UI process (so far). The one is
Qt Creator, but even there it is not possible to configure all child
processes, and would not be tolerable to tell users "When you create a
new run configuration remember to undo spurious environment changes done
by Qt".

> Obviously, if some antique needs a special locale, that's no problem
> if it's started via a UI that lets one configure its environment,
> overriding what Qt might have set.

Even _if_ that UI would let the user configure the environment,
that's not an excuse.

> >> rather some random unrelated tools, in some cases even quite old
> >> random unrelated tools.
> 
> I read antiquity as tending to assume C locale, so unharmed by C.UTF-8,
> although some may be assuming an ISO Latin or similar legacy codec.
> All the same, so antique as to not grok Unicode at all is pretty old !
> You probably need to update it for security fixes, by now.

"Security reason, because it is old" must be Godwin's Law in 
"Always Online" times.


There _are_ setups that _are_ set in stone, that are not connected
to anything and that don't give anything on updates, or do not even
have the possibility to be "fixed" or changed in any way.

If Qt development does not want to care for these cases _even as
child processes_ that's fine in principle (even with me), but then
it would help to clearly communicate that fact to prevent accidents
in the selection of toolkits.


> Thiago Macieira (1 November 2019 22:49)
> > TBH, all the more reason for propagating the choice. Please remember
> > that on any modern Linux or macOS or FreeBSD, they are already running
> > with a UTF-8 locale. The most common scenario of our setting something
> > is when LC_ALL=C was set in the environment, which will cause us to
> > reset it to C.UTF-8.
> 
> Indeed, what program would have problems in C.UTF-8 yet have a
> non-Unicode locale in which it works nicely ?
> An example would help us to reason about this ...

The following works on all my setups (and, btw, with LC_ALL="C"
which I do _not_ use) and crashes with LC_ALL="C.UTF-8":

#include 
#include 
#include 

int main()
{   
if (strcmp((setlocale(LC_COLLATE, "")), "C") != 0)
abort();
}

Looks contrieved? [Check your hard disk before you answer.]

Shotgun-changing environment for child processes is _not_ harmless.

>  ones, would not make the same choices. If we do not propagate, we
>  could end up with a child process (often helpers) that make
>  different choices as to what command-line arguments or pipes or
>  contents in files mean.
> 
> >> If we propagate we'll expose the child processes to locales they
> >> might not expect, in circumstances where the user of the system
> >> possibly intentionally chose a non-UTF8-locale to make exactly those
> >> child processes happy.
> 
> > True, but that was done at the expense of running Qt in a largely
> > unsupported and untested scenario. Setting the locale to C means we
> > can't access any file with an 8bit file name; setting to Latin1 would
> > allow that, but produce mojibake in GUI.
> 
> >> Effectively, going for "c) yes" deprives the user of a certain level
> >> of freedom that is needed, "c) no" is less intrusive.
> >>
> >> "c) no" as default and a simple one-liner opt-in for applications
> >> that want to engage in "strict parenting" might be an option, too.
> 
> > How about making the resetting opt-out, instead of opt-in?
> > QT_NO_OVERRIDE_LC_CTYPE?
> 
> Possibly its value could be:
> * all, 1, yes, true, .* - it applies to all child processes [*]; or
> * a list of regexes for program names to which it applies, when started
>   as child processes.

The syntax doesn't really matter, but the direction "opt-out" is wrong.

Potentially harmful behaviour should always be opt-in, not opt-out
(and never be non-configurable).

Andre'
___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development

Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems

2019-11-04 Thread Edward Welbourne

On Monday, 4 November 2019 01:40:00 PST Edward Welbourne wrote:
>> * a list of regexes for program names to which it applies, when started
>>   as child processes.
>>
>> Or is that too hard to implement at all the places where we call exec()
>> and its equivalents ?

Thiago Macieira (4 November 2019 15:48)
> That's not at all what I wanted.
>
> I want to do qputenv in the Qt application *itself*, inside QCoreApplication.
> Note the most important process that this will apply to: itself. It applies to
> all other frameworks inside the same application that may inspect the
> environment, including an extra unknown call to setlocale(LC_ALL, "").

... and we can do that just fine if we
* record the prior value we're over-riding on some master object,
  that also remembers the list of regexes;
* call qputenv() exactly as you have in mind;
* when about to start a sub-process, ask that master object if the
  command name matches one of its regexes;
* if it does, restore *for only it* (e.g. after fork()) the prior value.

The default for everything else is then to see an environment with our
"correction" applied to the locale env var(s).

Eddy.
___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development

Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems

2019-11-04 Thread Thiago Macieira

On Monday, 4 November 2019 01:40:00 PST Edward Welbourne wrote:
> * a list of regexes for program names to which it applies, when started
>   as child processes.
> 
> Or is that too hard to implement at all the places where we call exec()
> and its equivalents ?

That's not at all what I wanted.

I want to do qputenv in the Qt application *itself*, inside QCoreApplication. 
Note the most important process that this will apply to: itself. It applies to 
all other frameworks inside the same application that may inspect the 
environment, including an extra unknown call to setlocale(LC_ALL, "").

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel System Software Products

___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development

Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems

2019-11-04 Thread Kevin Kofler

Edward Welbourne wrote:
> [*] I'm fairly sure the actual Unix programs yes and true don't care
> about locale, so treating them meaning as .* would be harmless ...

GNU yes takes an optional string that it repeats instead of "y", so it does 
at least some string processing. I am not sure how it reacts if the 
characters are outside of the locale's character set.

In addition, both GNU yes and GNU true have --help and --version options 
that print translated strings.

Kevin Kofler

___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development

Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems

2019-11-04 Thread Edward Welbourne

>>> "c) no" as default and a simple one-liner opt-in for applications that
>>> want to engage in "strict parenting" might be an option, too.

On Fri, Nov 01, 2019 at 02:49:36PM -0700, Thiago Macieira wrote:
>> How about making the resetting opt-out, instead of opt-in?
>> QT_NO_OVERRIDE_LC_CTYPE?

André Pönitz (2 November 2019 12:53)
> I was more thinking of a runtime option. Like
>
>   QCoreApplication::setPropagateOurChoices(true)
>
> Or do I miss something why this has to be a compile time choice?

I interpreted Thiago as suggesting an environment variable to be
inspected at run-time, not a compile-time option.
Would an environment variable work for you ?

Eddy.
___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development

Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems

2019-11-04 Thread Edward Welbourne

Thiago:
 My personal preference is:
 a) C.UTF-8
 b) override it to force UTF-8 on the same locale
 c) yes

Lars:
>>> I agree with all three choices.

On Friday, 1 November 2019 12:29:19 PDT André Pönitz wrote:
>> a) and b) are fine with me, "c) yes" sounds like a potential problem.
>>
>> Most of the child process I usually call are not Qt based,

That shouldn't matter.  Qt<6-based things and non-Qt things are all the
same from the point of view of the contemplated change.

To what extent are these child programs started via a UI that lets the
user set environment variables (as I assume all IDEs do for most of the
commands they run) ?  Obviously, if some antique needs a special locale,
that's no problem if it's started via a UI that lets one configure its
environment, overriding what Qt might have set.

>> rather some random unrelated tools, in some cases even quite old
>> random unrelated tools.

I read antiquity as tending to assume C locale, so unharmed by C.UTF-8,
although some may be assuming an ISO Latin or similar legacy codec.
All the same, so antique as to not grok Unicode at all is pretty old !
You probably need to update it for security fixes, by now.

Thiago Macieira (1 November 2019 22:49)
> TBH, all the more reason for propagating the choice. Please remember
> that on any modern Linux or macOS or FreeBSD, they are already running
> with a UTF-8 locale. The most common scenario of our setting something
> is when LC_ALL=C was set in the environment, which will cause us to
> reset it to C.UTF-8.

Indeed, what program would have problems in C.UTF-8 yet have a
non-Unicode locale in which it works nicely ?
An example would help us to reason about this ...

 ones, would not make the same choices. If we do not propagate, we
 could end up with a child process (often helpers) that make
 different choices as to what command-line arguments or pipes or
 contents in files mean.

>> If we propagate we'll expose the child processes to locales they
>> might not expect, in circumstances where the user of the system
>> possibly intentionally chose a non-UTF8-locale to make exactly those
>> child processes happy.

> True, but that was done at the expense of running Qt in a largely
> unsupported and untested scenario. Setting the locale to C means we
> can't access any file with an 8bit file name; setting to Latin1 would
> allow that, but produce mojibake in GUI.

>> Effectively, going for "c) yes" deprives the user of a certain level
>> of freedom that is needed, "c) no" is less intrusive.
>>
>> "c) no" as default and a simple one-liner opt-in for applications
>> that want to engage in "strict parenting" might be an option, too.

> How about making the resetting opt-out, instead of opt-in?
> QT_NO_OVERRIDE_LC_CTYPE?

Possibly its value could be:
* all, 1, yes, true, .* - it applies to all child processes [*]; or
* a list of regexes for program names to which it applies, when started
  as child processes.

Or is that too hard to implement at all the places where we call exec()
and its equivalents ?

[*] I'm fairly sure the actual Unix programs yes and true don't care
about locale, so treating them meaning as .* would be harmless ...

Eddy.
___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development

Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems

2019-11-03 Thread Thiago Macieira

On Saturday, 2 November 2019 22:35:00 PST André Pönitz wrote:
> Compiled opt-in per-application at least shifts the blame from Qt to the
> application vendor, compiled opt-in per-process environment leaves the blame
> still with the application vendor, but actually provides the possibility to
> do the right thing when it is known that the child actually _needs_ it.

When the parent process written in Qt knows that the child needs it, they must 
already be using QProcessEnvironment.

So when the user needed to do it, it's a bug in the Qt application. There are 
two scenarios:

1) when the child process needs en_US or C, because it was printing messages 
in another language, or far more commonly, it was using thousands and decimal 
separators other than those of English

2) when the child process needs a non-UTF-8 because it was confused by UTF-8 
multibyteness or was using that to print “fancy quotes”

The case (1) is not a problem if we override the environment to en_US.UTF-8 or 
C.UTF-8. Your scenario is restricted to case (2).

Do note that forcing the environment today, in Qt 5, has implications for the 
Qt application itself. It's just wrong to do so and I think the number of 
people doing that is fairly small. With this proposal, in Qt 6, the Qt 
application would run correctly. But that means that the overriding you're 
asking for is unlikely to exist *today*.

So if we're talking about the future, why is using an environment variable to 
suppress the Qt's override not sufficient?

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel System Software Products

___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development

Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems

2019-11-02 Thread André Pönitz

On Sat, Nov 02, 2019 at 06:16:36PM +0100, Kevin Kofler wrote:
> A true runtime option actually belongs in an environment variable, not in a 
> method that has to be called by the compiled code. (In fact, that's what I 
> would have expected your proposed QT_NO_OVERRIDE_LC_CTYPE to be, but 
> apparently you were thinking of a preprocessor define.)
> 
> Whether to propagate the locale to child processes is really a decision that 
> can and should be left to the user at runtime rather than compiling it 
> either into the application (as in André's proposal) or even into Qt itself 
> (as in your proposal).

I am all for not propagating Qt's UTF-8 choice to child processes at all.

Having that as opt-in on some level was an attempt to appease people who
think that's a good idea.

A configure option for Qt itself does not help as it keeps the question open
what the default setup will be. And given the circumstances that would
be "propagation".

Compiled opt-in per-application at least shifts the blame from Qt to the
application vendor, compiled opt-in per-process environment leaves the blame
still with the application vendor, but actually provides the possibility
to do the right thing when it is known that the child actually _needs_ it.

On the other hand, in those circumstances, this can already be done now
by normal fiddling with the child process environment.

Andre'
___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development

Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems

2019-11-02 Thread Kevin Kofler

Thiago Macieira wrote:
> Is your shell configured for German or for English? Try setting your
> locale to German and then see how long it will take for you to have to
> override when posting a question or an answer.

Unlike you, to get messages in English for human reading, I have been using 
en_US.UTF-8 rather than C for years (long before C.UTF-8 became a thing) 
exactly because of that.

> Except for the LC_ALL=C case for overriding the user's locale so that one
> can get messages and formatting in machine-parseable format. The normal
> case and this one probably account for over 99% of all scenarios.

For machine readability, there is probably a reason for picking C rather 
than en_US.UTF-8 or even C.UTF-8, e.g., to get ASCII quotes rather than the 
fancy Unicode quotes used under en_US.UTF-8.

>> > How about making the resetting opt-out, instead of opt-in?
>> > QT_NO_OVERRIDE_LC_CTYPE?
>> 
>> I was more thinking of a runtime option. Like
>> 
>>   QCoreApplication::setPropagateOurChoices(true)
> 
> I think a runtime option like that belongs in QProcessEnvironment.

A true runtime option actually belongs in an environment variable, not in a 
method that has to be called by the compiled code. (In fact, that's what I 
would have expected your proposed QT_NO_OVERRIDE_LC_CTYPE to be, but 
apparently you were thinking of a preprocessor define.)

Whether to propagate the locale to child processes is really a decision that 
can and should be left to the user at runtime rather than compiling it 
either into the application (as in André's proposal) or even into Qt itself 
(as in your proposal).

Kevin Kofler

___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development

Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems

2019-11-02 Thread Thiago Macieira

On Saturday, 2 November 2019 04:53:10 PDT André Pönitz wrote:
> > TBH, all the more reason for propagating the choice. Please remember that
> > on any modern Linux or macOS or FreeBSD, they are already running with a
> > UTF-8 locale.
> 
> With that argument we wouldn't even need to change the locale for the
> actual Qt application.
> 
> I think we are currently discussing the rare case where the Qt application
> is started with a non-UTF-8 locale, and the main question is whether this
> was some kind of accident that the Qt application should correct for their
> child processes or whether this was intentional.

Right. And the conclusion so far is that it is a mistake.

> As you said, any modern Linux or macOS or FreeBSD default to UTF-8, so
> chances are high that any deviation from that is actually intentionally.

Except for the LC_ALL=C case for overriding the user's locale so that one can 
get messages and formatting in machine-parseable format. The normal case and 
this one probably account for over 99% of all scenarios.

> > The most common scenario of our setting something is when LC_ALL=C was
> > set in the environment, which will cause us to reset it to C.UTF-8.
> 
> I understand that, and even though I am not aware of an actual problem for
> my personal uses I am a bit reluctant to expose unsuspecting processes
> to a variable-lengths encoding they may not be aware of. At least there's
> a potential for buffer overruns here.

Is your shell configured for German or for English? Try setting your locale to 
German and then see how long it will take for you to have to override when 
posting a question or an answer.

$ ls á
ls: cannot access 'á': Arquivo ou diretório inexistente

$ gcc -xc /dev/null
/usr/lib64/gcc/x86_64-suse-linux/9/../../../../x86_64-suse-linux/bin/ld: /usr/
lib64/gcc/x86_64-suse-linux/9/../../../../lib64/crt1.o: na função "_start":
/home/abuild/rpmbuild/BUILD/glibc-2.30/csu/../sysdeps/x86_64/start.S:104: 
referência não definida para "main"
collect2: error: ld returned 1 exit status

$ gcc -xc /dev/null -lmain  
/usr/lib64/gcc/x86_64-suse-linux/9/../../../../x86_64-suse-linux/bin/ld: não 
foi possível localizar -lmain
collect2: error: ld returned 1 exit status


> Also, going from "C" to "C.UTF-8" might foil code checking for the
> string "C" explicitly in a child process.

True, though that's extremely unlikely anyone is doing that.

> > True, but that was done at the expense of running Qt in a largely
> > unsupported and untested scenario. Setting the locale to C means we can't
> > access any file with an 8bit file name; setting to Latin1 would allow
> > that, but produce mojibake in GUI.
> 
> Setting to "C" also "works" in practice when blobs are just read and written
> unmodified.

Except when such a blob's file name contains a character outside of the US-
ASCII subset.

$ ./lconvert á.qm  
Cannot open á.qm: No such file or directory
$ LC_ALL=C ./lconvert á.qm
Cannot open ??.qm: No such file or directory

Was this just the output or did it try to open this actual file?
$ strace -E LC_ALL=C ./lconvert á.qm |& grep -F .qm   
execve("./lconvert", ["./lconvert", "\303\241.qm"], 0x55c2ef3cc7a0 /* 118 vars 
*/) = 0
openat(AT_FDCWD, "??.qm", O_RDONLY|O_CLOEXEC) = -1 ENOENT (Arquivo ou 
diretório inexistente)
write(2, "Cannot open ??.qm: No such file "..., 45Cannot open ??.qm: No such 
file or directory

> > How about making the resetting opt-out, instead of opt-in?
> > QT_NO_OVERRIDE_LC_CTYPE?
> 
> I was more thinking of a runtime option. Like
> 
>   QCoreApplication::setPropagateOurChoices(true)

I think a runtime option like that belongs in QProcessEnvironment.

> Or do I miss something why this has to be a compile time choice?

Yes: whether QString::fromLocal8Bit has to support anything besides UTF-8.

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel System Software Products



___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development

Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems

2019-11-02 Thread André Pönitz

On Fri, Nov 01, 2019 at 02:49:36PM -0700, Thiago Macieira wrote:
> On Friday, 1 November 2019 12:29:19 PDT André Pönitz wrote:
> > > > My personal preference is:
> > > > a) C.UTF-8
> > > > b) override it to force UTF-8 on the same locale
> > > > c) yes
> > > 
> > > I agree with all three choices.
> > 
> > a) and b) are fine with me, "c) yes" sounds like a potential problem.
> > 
> > Most of the child process I usually call are not Qt based, rather some
> > random unrelated tools, in some cases even quite old random unrelated
> > tools.
> 
> TBH, all the more reason for propagating the choice. Please remember that on 
> any modern Linux or macOS or FreeBSD, they are already running with a UTF-8 
> locale.

With that argument we wouldn't even need to change the locale for the
actual Qt application.

I think we are currently discussing the rare case where the Qt application
is started with a non-UTF-8 locale, and the main question is whether this
was some kind of accident that the Qt application should correct for their
child processes or whether this was intentional.

As you said, any modern Linux or macOS or FreeBSD default to UTF-8, so
chances are high that any deviation from that is actually intentionally.

> The most common scenario of our setting something is when LC_ALL=C was 
> set in the environment, which will cause us to reset it to C.UTF-8.

I understand that, and even though I am not aware of an actual problem for
my personal uses I am a bit reluctant to expose unsuspecting processes
to a variable-lengths encoding they may not be aware of. At least there's
a potential for buffer overruns here.

Also, going from "C" to "C.UTF-8" might foil code checking for the 
string "C" explicitly in a child process.

> > > > ones, would not make the same choices. If we do not propagate, we could
> > > > end up with a child process (often helpers) that make different choices
> > > > as to what command-line arguments or pipes or contents in files mean.
> > 
> > If we propagate we'll expose the child processes to locales they might not
> > expect, in circumstances where the user of the system possibly intentionally
> > chose a non-UTF8-locale to make exactly those child processes happy.
> 
> True, but that was done at the expense of running Qt in a largely unsupported 
> and untested scenario. Setting the locale to C means we can't access any file 
> with an 8bit file name; setting to Latin1 would allow that, but produce 
> mojibake in GUI.

Setting to "C" also "works" in practice when blobs are just read and written
unmodified.

> > Effectively, going for "c) yes" deprives the user of a certain level of
> > freedom that is needed, "c) no" is less intrusive.
> > 
> > "c) no" as default and a simple one-liner opt-in for applications that
> > want to engage in "strict parenting" might be an option, too.
> 
> How about making the resetting opt-out, instead of opt-in? 
> QT_NO_OVERRIDE_LC_CTYPE?

I was more thinking of a runtime option. Like 

  QCoreApplication::setPropagateOurChoices(true)

Or do I miss something why this has to be a compile time choice?

Andre'
___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development

Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems

2019-11-01 Thread Thiago Macieira

On Friday, 1 November 2019 12:29:19 PDT André Pönitz wrote:
> > > My personal preference is:
> > > a) C.UTF-8
> > > b) override it to force UTF-8 on the same locale
> > > c) yes
> > 
> > I agree with all three choices.
> 
> a) and b) are fine with me, "c) yes" sounds like a potential problem.
> 
> Most of the child process I usually call are not Qt based, rather some
> random unrelated tools, in some cases even quite old random unrelated
> tools.

TBH, all the more reason for propagating the choice. Please remember that on 
any modern Linux or macOS or FreeBSD, they are already running with a UTF-8 
locale. The most common scenario of our setting something is when LC_ALL=C was 
set in the environment, which will cause us to reset it to C.UTF-8.

> > > ones, would not make the same choices. If we do not propagate, we could
> > > end up with a child process (often helpers) that make different choices
> > > as to what command-line arguments or pipes or contents in files mean.
> 
> If we propagate we'll expose the child processes to locales they might not
> expect, in circumstances where the user of the system possibly intentionally
> chose a non-UTF8-locale to make exactly those child processes happy.

True, but that was done at the expense of running Qt in a largely unsupported 
and untested scenario. Setting the locale to C means we can't access any file 
with an 8bit file name; setting to Latin1 would allow that, but produce 
mojibake in GUI.

> Effectively, going for "c) yes" deprives the user of a certain level of
> freedom that is needed, "c) no" is less intrusive.
> 
> "c) no" as default and a simple one-liner opt-in for applications that
> want to engage in "strict parenting" might be an option, too.

How about making the resetting opt-out, instead of opt-in? 
QT_NO_OVERRIDE_LC_CTYPE?

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel System Software Products

___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development

Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems

2019-11-01 Thread André Pönitz

On Fri, Nov 01, 2019 at 09:21:48AM +, Lars Knoll wrote:
> > There are three questions to be decided:
> > a) What should Qt 6 assume the locale to be, if no locale is set?
> > b) In case a non-UTF-8 locale is set, what should we do?
> > c) Should we propagate our decision to child processes?
> > 
> > My personal preference is:
> > a) C.UTF-8
> > b) override it to force UTF-8 on the same locale
> > c) yes
> 
> I agree with all three choices.

a) and b) are fine with me, "c) yes" sounds like a potential problem.

Most of the child process I usually call are not Qt based, rather some random
unrelated tools, in some cases even quite old random unrelated tools.

> > ones, would not make the same choices. If we do not propagate, we could end 
> > up 
> > with a child process (often helpers) that make different choices as to what 
> > command-line arguments or pipes or contents in files mean.

If we propagate we'll expose the child processes to locales they might not
expect, in circumstances where the user of the system possibly intentionally
chose a non-UTF8-locale to make exactly those child processes happy.

Effectively, going for "c) yes" deprives the user of a certain level of
freedom that is needed, "c) no" is less intrusive.

"c) no" as default and a simple one-liner opt-in for applications that
want to engage in "strict parenting" might be an option, too.

Andre'
___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development

Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems

2019-11-01 Thread Edward Welbourne

Lars Knoll (1 November 2019 10:21)
> Thanks for the comprehensive mail.

Seconded :-)

By effectively joining forces with Python to deprecate ASCII in favour
of UTF-8, perhaps we can even put some pressure on POSIX to board the
Unicode train.

> For your bonus (d) below, I’d say we should print a warning if we
> encounter a non UTF-8 locale other than C.

On 31 Oct 2019, at 22:11, Thiago Macieira  wrote:
>> Bonus d) should we print a warning when we've made a change?
>>
>> Options are:
>> - yes, for all of them
>> - yes, but only for locales other than "C"
>> - no

I note that [PEP 538] says (on the "C is C-UTF8" part of the subject
matter), under Implementation Notes:

  Attempting to implement the PEP as originally accepted showed that the
  proposal to emit locale coercion and compatibility warnings by default
  simply wasn't practical (there were too many cases where previously
  working code failed because of the warnings, rather than because of
  latent locale handling defects in the affected code).

* [PEP 538] https://www.python.org/dev/peps/pep-0538/

So I cast another vote for
>> - yes, but only for locales other than "C"

Eddy.
___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development

Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems

2019-11-01 Thread Lars Knoll

Hi Thiago,

Thanks for the comprehensive mail.

> On 31 Oct 2019, at 22:11, Thiago Macieira  wrote:
> 
> Re: https://codereview.qt-project.org/c/qt/qtbase/+/275152 (WIP: Move 
> QTextCodec support out of QtCore)
> See also: https://www.python.org/dev/peps/pep-0538/
>   https://www.python.org/dev/peps/pep-0540/
> 
> Summary:
> The change above, while removing QTextCodec from our API, had the side-effect 
> of forcing the locale encoding on Unix to be only UTF-8. This RFC (to be 
> recorded as a QUIP) is meant to discuss how we'll deal with locales on Unix 
> systems on Qt 6. This does not apply to Windows because on Windows we cannot 
> reasonably be expected to use UTF-8 for the 8-bit encoding.

I do not think we have to worry about the local 8 bit encoding on Windows 
anymore these days. All our interaction with the OS goes through the 16 bit 
APIs (ie. uses UTF-16). I don’t think file content is a huge issue neither 
anymore as Windows 10 seems to have added UTF-8 support to most of it’s tools.

Afaik, we can also use a Unicode API for console  and debug output, so the only 
piece that’s left might be our users interacting with legacy ANSI APIs. That 
should be a rare case and it should be straightforward to port that over to use 
the Unicode API instead.

> 
> There are three questions to be decided:
> a) What should Qt 6 assume the locale to be, if no locale is set?
> b) In case a non-UTF-8 locale is set, what should we do?
> c) Should we propagate our decision to child processes?
> 
> My personal preference is:
> a) C.UTF-8
> b) override it to force UTF-8 on the same locale
> c) yes

I agree with all three choices. For your bonus (d) below, I’d say we should 
print a warning if we encounter a non UTF-8 locale other than C.

Cheers,
Lars

> 
> Long explanation:
> 
> On Unix systems, traditionally, the locale is a factor of multiple 
> environment 
> variables starting with LC_ (matching macro names from ), as well 
> as 
> the LANG and LANGUAGES variables. If none of those is set, the C and POSIX 
> standards say that the default locale is "C". Moreover, POSIX says that the 
> "POSIX" locale is "C" and does not have multibyte encodings -- that excludes 
> its encoding from being UTF-8.
> 
> Most modern Unix-based operating systems do set a reasonable, UTF8-based 
> locale for the user. They've been doing that for about 15 years -- it was in 
> 2003 that this started, when I had to switch from zsh back to bash because 
> zsh 
> didn't support UTF-8 yet, but switched back in 2005 when it gained support. 
> On 
> top of that, some even more recent Unix offerings -- namely, macOS and 
> Android 
> -- enforce that the default (or only!) locale encoding is UTF-8.
> 
> Right now, Qt faithfully accepts the locale configuration set by the user in 
> the environment. It can do that because it has QTextCodec, which is also 
> backed by either the libiconv routines or by ICU, so it can deal with any 
> encoding. In properly-configured environments, there's no problem.
> 
> The two Python documents above (PEP-538 and 540) also discuss how Python 
> changed its strategy. I'm proposing that we follow Python and go a little 
> further. 
> 
> What's the problem?
> 
> The problem is where the locale is not set up properly or it is explicitly 
> overriden. See PEP-538 for examples in containers, but as can be seen from 
> it, 
> Linux will default to "POSIX" or empty, which means Qt will interpret the 
> locale as US-ASCII, which is almost never what is intended. Moreover, because 
> of our use of QString for file names, any name that contains code units above 
> 0x7f will be deemed a filesystem corruption and ignored on directory listing 
> -- they are not representable.
> 
> Furthermore, it happens quite often that users and tools set LC_ALL to "C" in 
> order to obtain messages in English, so they can be parsed by other tools or 
> to be pasted in emails (every time you see me post an error message from a 
> console, I've done that). There are alternative locales that can be used, 
> like 
> "C.UTF-8", "C.utf8" or "UTF-8", but those depend on the operating system and 
> may not be actually available.
> 
> Arguing that this is an incorrect setup, while factually correct, does not 
> change the fact that it happens.
> 
> Questions and options:
> 
> a) What should Qt 6 assume when the locale is unset or is just "C"?
> 
> This is the case of a simple environment where the variables are unset or 
> have 
> some legacy system-wide defaults, as well as when the user explicitly sets 
> LC_ALL to "C". The options are:
> - accept them as-is
> - assume that C with UTF-8 support was intended
> 
> The first option is what we have today. And if this is our option, then 
> neither question b or c make sense.
> 
> The second option implies doing the check in QCoreApplication right after 
> setlocale(LC_ALL, ""):
>   if (strcmp(setlocale(LC_ALL, NULL), "C") == 0)
>  setlocale(LC_CTYPE, "C.UTF-8");
> 
> b) What should Qt 6 do i

[Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems

2019-10-31 Thread Thiago Macieira

Re: https://codereview.qt-project.org/c/qt/qtbase/+/275152 (WIP: Move 
QTextCodec support out of QtCore)
See also: https://www.python.org/dev/peps/pep-0538/
https://www.python.org/dev/peps/pep-0540/

Summary:
The change above, while removing QTextCodec from our API, had the side-effect 
of forcing the locale encoding on Unix to be only UTF-8. This RFC (to be 
recorded as a QUIP) is meant to discuss how we'll deal with locales on Unix 
systems on Qt 6. This does not apply to Windows because on Windows we cannot 
reasonably be expected to use UTF-8 for the 8-bit encoding.

There are three questions to be decided:
 a) What should Qt 6 assume the locale to be, if no locale is set?
 b) In case a non-UTF-8 locale is set, what should we do?
 c) Should we propagate our decision to child processes?

My personal preference is:
 a) C.UTF-8
 b) override it to force UTF-8 on the same locale
 c) yes

Long explanation:

On Unix systems, traditionally, the locale is a factor of multiple environment 
variables starting with LC_ (matching macro names from ), as well as 
the LANG and LANGUAGES variables. If none of those is set, the C and POSIX 
standards say that the default locale is "C". Moreover, POSIX says that the 
"POSIX" locale is "C" and does not have multibyte encodings -- that excludes 
its encoding from being UTF-8.

Most modern Unix-based operating systems do set a reasonable, UTF8-based 
locale for the user. They've been doing that for about 15 years -- it was in 
2003 that this started, when I had to switch from zsh back to bash because zsh 
didn't support UTF-8 yet, but switched back in 2005 when it gained support. On 
top of that, some even more recent Unix offerings -- namely, macOS and Android 
-- enforce that the default (or only!) locale encoding is UTF-8.

Right now, Qt faithfully accepts the locale configuration set by the user in 
the environment. It can do that because it has QTextCodec, which is also 
backed by either the libiconv routines or by ICU, so it can deal with any 
encoding. In properly-configured environments, there's no problem.

The two Python documents above (PEP-538 and 540) also discuss how Python 
changed its strategy. I'm proposing that we follow Python and go a little 
further. 

What's the problem?

The problem is where the locale is not set up properly or it is explicitly 
overriden. See PEP-538 for examples in containers, but as can be seen from it, 
Linux will default to "POSIX" or empty, which means Qt will interpret the 
locale as US-ASCII, which is almost never what is intended. Moreover, because 
of our use of QString for file names, any name that contains code units above 
0x7f will be deemed a filesystem corruption and ignored on directory listing 
-- they are not representable.

Furthermore, it happens quite often that users and tools set LC_ALL to "C" in 
order to obtain messages in English, so they can be parsed by other tools or 
to be pasted in emails (every time you see me post an error message from a 
console, I've done that). There are alternative locales that can be used, like 
"C.UTF-8", "C.utf8" or "UTF-8", but those depend on the operating system and 
may not be actually available.

Arguing that this is an incorrect setup, while factually correct, does not 
change the fact that it happens.

Questions and options:

a) What should Qt 6 assume when the locale is unset or is just "C"?

This is the case of a simple environment where the variables are unset or have 
some legacy system-wide defaults, as well as when the user explicitly sets 
LC_ALL to "C". The options are:
 - accept them as-is
 - assume that C with UTF-8 support was intended

The first option is what we have today. And if this is our option, then 
neither question b or c make sense.

The second option implies doing the check in QCoreApplication right after 
setlocale(LC_ALL, ""):
   if (strcmp(setlocale(LC_ALL, NULL), "C") == 0)
  setlocale(LC_CTYPE, "C.UTF-8");

b) What should Qt 6 do if a different locale, other than C, is non-UTF8?

This case is not an accident, most of the time. It can happen from time to 
time that someone is simply testing different languages and forces LC_ALL to 
something non-default to see what happens. They'll very quickly try the UTF-8 
versions. But when it's not an accident, it means it was intended. This is the 
general state of Unix prior to 2003, when locales like "en_US", "en_GB", 
"fr_FR", "pt_BR" existed, as well as the 2001-2003 Euro variants "fr_FR@euro", 
"de_DE@euro", "nl_NL@euro", etc. Options are:

 - accept them as-is (this is what Python does)
 - assume that the UTF-8 variant was intended, just not properly set

The first option is what we have today, aside from the C locale (question 
(a)). However, keeping that option working implies keeping either ICU or iconv 
working in Qt 6 and we might want to get rid of that dependency for codecs.

The second option implies modifying the QCoreApplication change above. Instead 
of explicitly checking for

72 matches

Mail list logo