Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems
Lars Knoll (18 April 2023 09:46) replied >> I think this should be the goal, but I’d vote for a slightly faster >> schedule. >> >> (a) and (b) are things we should be able to do right now. I (18 April 2023 14:05) commented: > Sounds sensible to me. ... so have opened QTBUG-112954 and QTBUG-112955 for the opening move of making it possible for the user to opt in, Eddy. -- Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems
On Tuesday, 18 April 2023 00:46:26 PDT Lars Knoll wrote: > > But anything that goes through QIODeivce::read or write (QProcess, QFile, > > Q{Udp,Tcp,Local}Socket) will suffer if there's no agreement on what that > > encoding is. Usually for sockets, the protocol is binary and obviate the > > problem. For files, some file formats help. But in particular for > > communicating with another process, there's no reliable way. > > Communicating through a socket will always require that both sides agree on > the encoding. That’s not really anything new. > > The question is how they encode the data when writing to the socket. If they > use QTextStream, the data will by default get written in utf8 already today > (since Qt 6.0). If they explicitly convert the QString to and from a > specific encoding using QStringConverter/QTextCodec nothing bad will happen > neither. > > So the remaining problem comes when they use QString::to/fromLocal8Bit(), as > that might change from some windows locale to utf8. Not a problem when > communicating with a socket between two Qt apps, but might be an issue when > storing data in a file or communicating with an app that doesn’t use Qt. > > But we could consider that a user error, as you really shouldn’t use > local8bit for anything else than stdin/out and interfacing with 8bit system > APIs. Please don't focus on sockets, as we all agree the protocol will usually inform what the encoding is. Instead, let's focus on QProcess. Here's a test: write an application that displays in GUI the output of: QProcess proc; proc.start("cmd.exe", { "/c", "dir" }); This is an uncommon scenario, but it is representative of any application that is or simulates a terminal. If you want to have a more realistic version of the above, replace "dir" with "nmake" or "ninja": all three will print the names of files. Conversely, write the application that keeps its output unmodified so it can be consumed by its current consumers. > We did enforce it on Unix systems though with Qt 6. I do believe we can over > time enforce it on windows as well, or at least make it the default. In time, I agree. But we are right now where Unix was in 2003-2005, and with differences. For Unix systems, there's no UTF-16 API, so the equivalent commands of the above could afford to be encoding-agnostic, so they were a pass-through of what the filesystem offered. In fact, it was only Qt applications that had problems because we converted to UTF-16 back in 3.0 (since 2.0) -- that is STILL a complaint we've often heard about our FS API. > > But I think we should: > > a) do it for our own applications, since we do know our own code > > b) advise users somehow that they should opt-in to this > > c) decide if we want to change from opt-in to opt-out in the medium term > > (7.0 for example) > > > > d) decide if we want to enforce it in the long-term > > > > Option (d) depends on (c). Option (c) informs whether we need a Qt CMake > > API or whether we can simply say upstream CMake should handle it. > > I think this should be the goal, but I’d vote for a slightly faster > schedule. > > (a) and (b) are things we should be able to do right now. All our apps work > fine one Unix systems with a utf8 locale, so there should be relatively few > problems doing the switch on Windows. The only thing this requires is a bit > of cake infrastructure work (that I believe has been mostly done already), > and some documentation for our users. > > (c) is something we should also announce with a time schedule right now. I > would go and do this either for 6.8 or 6.9 (ie with the next LTS release or > directly afterwards). If we announce it now, it gives our users 1.5 to 2 > years to adopt (and they can always opt out afterwards). I don't think that's realistic because I think we'll find issues. I think we need to do the conversion of our own applications and tools first, figure out what the issues are for ourselves, before we make time promises. I expect we'll need more than 1.5 year of advance notice that the opt-in will change to opt-out. > (d) is something I would do for Qt 7, as that’s the correct time to do those > changes and clean up our code base I also think it's unrealistic for the same reason. That's a 4-6 year leniency, for something that Unix took 17 and had a single system-wide encoding (Windows has three). -- Thiago Macieira - thiago.macieira (AT) intel.com Cloud Software Architect - Intel DCAI Cloud Engineering smime.p7s Description: S/MIME cryptographic signature -- Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems
On 17 Apr 2023, at 18:16, Thiago Macieira wrote: >> But anything that goes through QIODeivce::read or write (QProcess, >> QFile, Q{Udp,Tcp,Local}Socket) will suffer if there's no agreement on >> what that encoding is. And that's a cross-platform problem for anyone who has to consume data produced by a (presumably non-Qt) source that's using legacy codecs. At present our answer is to use Qt-with-ICU or some separate codec-converter. >> [snip] What has changed is that the Windows API has matured to the >> point that this is now a viable choice (previously, it was >> experimental and known to cause issues). But it's still an >> application choice; we can't enforce it. But we *can* document how to do it as part of our "how to package your application" instructions, thereby encouraging users of Qt to do so. >> But I think we should: >> a) do it for our own applications, since we do know our own code >> b) advise users somehow that they should opt-in to this >> c) decide if we want to change from opt-in to opt-out in the medium >>term (7.0 for example) >> d) decide if we want to enforce it in the long-term >> >> Option (d) depends on (c). Option (c) informs whether we need a Qt >> CMake API or whether we can simply say upstream CMake should handle >> it. Lars Knoll (18 April 2023 09:46) replied > I think this should be the goal, but I’d vote for a slightly faster > schedule. > > (a) and (b) are things we should be able to do right now. Sounds sensible to me. Eddy. -- Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems
> On 17 Apr 2023, at 18:16, Thiago Macieira wrote: > > On Monday, 20 March 2023 08:44:30 CDT Edward Welbourne wrote: >> Thiago Macieira (31 October 2019 22:11) wrote [0]: >>> This RFC (...) is meant to discuss how we'll deal with locales on Unix >>> systems on Qt 6. This does not apply to Windows because on Windows we >>> cannot reasonably be expected to use UTF-8 for the 8-bit encoding. >> >> [0] >> https://lists.qt-project.org/pipermail/development/2019-October/037791.html >> >> The GNU make mailing list currently has a thread (starts at [1]) about >> handling of encodings on Windows. >> >> [1] https://lists.gnu.org/archive/html/bug-make/2023-03/msg00066.html >> >> The discussion there seems to indicate that setting the system code-page >> to UTF-8 can be done in a way that interoperates gracefully with other >> processes and the file system, presumably thanks to the system being >> substantially UTF-16-based, so all 8-bit encodings go via that anyway. > > That only works for the file names, not the file contents and other channels. > For QProcess, we're slightly fortunate that we have UTF-16 API, so the > encoding that the other application uses for its command-line is irrelevant > for us. > > But anything that goes through QIODeivce::read or write (QProcess, QFile, > Q{Udp,Tcp,Local}Socket) will suffer if there's no agreement on what that > encoding is. Usually for sockets, the protocol is binary and obviate the > problem. For files, some file formats help. But in particular for > communicating > with another process, there's no reliable way. Communicating through a socket will always require that both sides agree on the encoding. That’s not really anything new. The question is how they encode the data when writing to the socket. If they use QTextStream, the data will by default get written in utf8 already today (since Qt 6.0). If they explicitly convert the QString to and from a specific encoding using QStringConverter/QTextCodec nothing bad will happen neither. So the remaining problem comes when they use QString::to/fromLocal8Bit(), as that might change from some windows locale to utf8. Not a problem when communicating with a socket between two Qt apps, but might be an issue when storing data in a file or communicating with an app that doesn’t use Qt. But we could consider that a user error, as you really shouldn’t use local8bit for anything else than stdin/out and interfacing with 8bit system APIs. > >> The means to achieve this appear [2] to hinge on setting the active >> codepage for the application in a manifest file, that it gets combined >> with after it is linked. >> >> [2] >> https://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-> >> code-page > > That was already known at the time, in 2019. What has changed is that the > Windows API has matured to the point that this is now a viable choice > (previously, it was experimental and known to cause issues). But it's still > an > application choice; we can't enforce it. We did enforce it on Unix systems though with Qt 6. I do believe we can over time enforce it on windows as well, or at least make it the default. > >> There do appear to be some vagaries still, it may depend on UCRT and I'm >> not sure I've really understood it all, but it looks like we may, in >> time, be able to consistently use UTF-8 as 8-bit encoding on Windows. > > Sorry, no, we can't force users to do it because we don't know if their code > is safe. > > But I think we should: > a) do it for our own applications, since we do know our own code > b) advise users somehow that they should opt-in to this > c) decide if we want to change from opt-in to opt-out in the medium term (7.0 > for example) > d) decide if we want to enforce it in the long-term > > Option (d) depends on (c). Option (c) informs whether we need a Qt CMake API > or whether we can simply say upstream CMake should handle it. I think this should be the goal, but I’d vote for a slightly faster schedule. (a) and (b) are things we should be able to do right now. All our apps work fine one Unix systems with a utf8 locale, so there should be relatively few problems doing the switch on Windows. The only thing this requires is a bit of cake infrastructure work (that I believe has been mostly done already), and some documentation for our users. (c) is something we should also announce with a time schedule right now. I would go and do this either for 6.8 or 6.9 (ie with the next LTS release or directly afterwards). If we announce it now, it gives our users 1.5 to 2 years to adopt (and they can always opt out afterwards). (d) is something I would do for Qt 7, as that’s the correct time to do those changes and clean up our code base Cheers, Lars -- Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems
On Monday, 20 March 2023 08:44:30 CDT Edward Welbourne wrote: > Thiago Macieira (31 October 2019 22:11) wrote [0]: > > This RFC (...) is meant to discuss how we'll deal with locales on Unix > > systems on Qt 6. This does not apply to Windows because on Windows we > > cannot reasonably be expected to use UTF-8 for the 8-bit encoding. > > [0] > https://lists.qt-project.org/pipermail/development/2019-October/037791.html > > The GNU make mailing list currently has a thread (starts at [1]) about > handling of encodings on Windows. > > [1] https://lists.gnu.org/archive/html/bug-make/2023-03/msg00066.html > > The discussion there seems to indicate that setting the system code-page > to UTF-8 can be done in a way that interoperates gracefully with other > processes and the file system, presumably thanks to the system being > substantially UTF-16-based, so all 8-bit encodings go via that anyway. That only works for the file names, not the file contents and other channels. For QProcess, we're slightly fortunate that we have UTF-16 API, so the encoding that the other application uses for its command-line is irrelevant for us. But anything that goes through QIODeivce::read or write (QProcess, QFile, Q{Udp,Tcp,Local}Socket) will suffer if there's no agreement on what that encoding is. Usually for sockets, the protocol is binary and obviate the problem. For files, some file formats help. But in particular for communicating with another process, there's no reliable way. > The means to achieve this appear [2] to hinge on setting the active > codepage for the application in a manifest file, that it gets combined > with after it is linked. > > [2] > https://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-> > code-page That was already known at the time, in 2019. What has changed is that the Windows API has matured to the point that this is now a viable choice (previously, it was experimental and known to cause issues). But it's still an application choice; we can't enforce it. > There do appear to be some vagaries still, it may depend on UCRT and I'm > not sure I've really understood it all, but it looks like we may, in > time, be able to consistently use UTF-8 as 8-bit encoding on Windows. Sorry, no, we can't force users to do it because we don't know if their code is safe. But I think we should: a) do it for our own applications, since we do know our own code b) advise users somehow that they should opt-in to this c) decide if we want to change from opt-in to opt-out in the medium term (7.0 for example) d) decide if we want to enforce it in the long-term Option (d) depends on (c). Option (c) informs whether we need a Qt CMake API or whether we can simply say upstream CMake should handle it. -- Thiago Macieira - thiago.macieira (AT) intel.com Cloud Software Architect - Intel DCAI Cloud Engineering smime.p7s Description: S/MIME cryptographic signature -- Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems
On Wednesday, 22 March 2023 09:48:05 HST Volker Hilsheimer via Development wrote: > Even if one Qt 5 application and one Qt 6 application exchange data over a > local socket, unwisely using to/fromLocal8Bit for the purpose - if the Qt 5 > application continues to run with the system code page, then the Qt 6 > application starting to sending UTF-8 encoded data will break this. QLocalSocket is very rare on Windows. And any decent socket code that is prepared to work over networks has either used proper 8-bit tagging to indicate the encoding (since 2001) or plain UTF-8 (since 2003). The console is already a mess on Windows because it's not just the ACP for Win32 "A" API, but also the legacy DOS encoding (the mess that renders my middle name JosÚ or JosΘ). Since that is already a mess, I don't particularly find it problematic to see José now... wouldn't be the first time. Most Windows applications aren't console applications so this is a limited issue. It's also time-limited: those issues should smooth out easily with proper terminal applications, which is how we solved it in the Unix world too. No, the far more likely scenario is interchange via files and via pipes to child processes. So yes, finding out what the legacy ACP is might be a useful piece of information. It shouldn't be the toLocal8Bit encoding, but it should be available should the need arise. -- Thiago Macieira - thiago.macieira (AT) intel.com Cloud Software Architect - Intel DCAI Cloud Engineering smime.p7s Description: S/MIME cryptographic signature -- Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems
Am 22.03.2023 um 20:48 schrieb Volker Hilsheimer: Indeed, the many hits in the sql code are mostly from warning output, thanks for checking. But that Postgres supports UTF-8 doesn’t mean that an existing server is also configured to use it. If a server is configured to work with e.g. ISO_8859_5 encoding, because all Qt clients (which are likely middleware servers, so fully controlled) run on Windows machines with a corresponding code page, then Qt deciding to encode in UTF-8 instead will break things, won’t it? And SQL is just one example. No, the client encoding is completely unrelated to the encoding on the server and the database. All three can differ. Even informix supported this already 15 years ago iirc. The conversion happens between the client and server. Christian -- Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems
> On 22 Mar 2023, at 18:58, Christian Ehrlicher wrote: > > Am 22.03.2023 um 17:35 schrieb Volker Hilsheimer via Development: >> But we use toLocal8Bit in plenty of cases as well. For instance in our Qt >> SQL APIs. > > The only plugin which really uses toLocal8Bit() is the IBase - Plugin. > Postgres is using it as fallback but according the docs the utf-8 > encoding is supported by at least PostgreSQL 7.3 so the non utf-8 part > should be removed. > > The other usages are for qWarning() output. > > > Will take a look on the IBase stuff to see if we can replace it. Indeed, the many hits in the sql code are mostly from warning output, thanks for checking. But that Postgres supports UTF-8 doesn’t mean that an existing server is also configured to use it. If a server is configured to work with e.g. ISO_8859_5 encoding, because all Qt clients (which are likely middleware servers, so fully controlled) run on Windows machines with a corresponding code page, then Qt deciding to encode in UTF-8 instead will break things, won’t it? And SQL is just one example. Even if one Qt 5 application and one Qt 6 application exchange data over a local socket, unwisely using to/fromLocal8Bit for the purpose - if the Qt 5 application continues to run with the system code page, then the Qt 6 application starting to sending UTF-8 encoded data will break this. Volker -- Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems
Am 22.03.2023 um 18:58 schrieb Christian Ehrlicher: Am 22.03.2023 um 17:35 schrieb Volker Hilsheimer via Development: But we use toLocal8Bit in plenty of cases as well. For instance in our Qt SQL APIs. The only plugin which really uses toLocal8Bit() is the IBase - Plugin. Correction: it's only used during open() and for the event notification. Cheerst, Christian -- Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems
Am 22.03.2023 um 17:35 schrieb Volker Hilsheimer via Development: But we use toLocal8Bit in plenty of cases as well. For instance in our Qt SQL APIs. The only plugin which really uses toLocal8Bit() is the IBase - Plugin. Postgres is using it as fallback but according the docs the utf-8 encoding is supported by at least PostgreSQL 7.3 so the non utf-8 part should be removed. The other usages are for qWarning() output. Will take a look on the IBase stuff to see if we can replace it. Cheers, Christian -- Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems
On Wednesday, 22 March 2023 01:07:12 HST Alvin Wong via Development wrote: > In reality, most of the debug messages are ASCII, so this issue rarely > affects anything and I consider it just "a mild annoyance". And also a Not Out Bug issue. First, the debuggers should opt in to UTF-16 support, if they can. If they can't, they should be updated to understand CP_UTF8 manifest executables, if they are real debuggers. That leaves debugview.exe which is not a debugger and therefore doesn't know where the messages are coming from. This should reduce the annoyance level. Question: which category does Qt Creator fall into? -- Thiago Macieira - thiago.macieira (AT) intel.com Cloud Software Architect - Intel DCAI Cloud Engineering smime.p7s Description: S/MIME cryptographic signature -- Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems
> On 22 Mar 2023, at 12:07, Alvin Wong via Development > wrote: > On 22/3/2023 17:58, Lars Knoll wrote: >> Hi, >> >> >>> On 21 Mar 2023, at 17:46, Alvin Wong via Development >>> wrote: >>> >>> Hi, >>> >>> Yes, embedding the manifest with activeCodePage set to UTF-8 is the only >>> thing need to enable UTF-8 as the ANSI code page (ACP) for the process. >>> >>> Qt itself should work fine after the bug in QStringConverter had been fixed >>> [1] a while back. (You can also refer to the linked mail thread. [2]) >>> However, as this bug has shown, any code that uses`MultiByteToWideChar` >>> incorrectly or wrongly assumes that `CP_ACP` always refers to a charset in >>> which each characters are formed by no more than two bytes will break. >>> Therefore, before switching to UTF-8 as the ACP, application developers >>> have to check their code and other libraries to make sure everything will >>> still work properly after the switch. >>> >>> [1]: https://codereview.qt-project.org/c/qt/qtbase/+/412208 >>> [2]: https://lists.qt-project.org/pipermail/interest/2022-May/038241.html >>> >>> About the CRT, it is true that only UCRT fully supports UTF-8 locale. When >>> compiling with MSVC, you are almost always using UCRT so it should be fine. >>> >>> MinGW-w64 is a bit more complicated -- when one gets a mingw-w64 toolchain, >>> the whole toolchain is already configured for a specific CRT. Usually it >>> will be the system MSVCRT. (If it's configured for UCRT, the toolchain >>> author will usually make it clear, because compiled programs will not run >>> out-of-the-box on Windows 8.1 or earlier.) I did not run tests myself, but >>> I would not trust MSVCRT to support UTF-8 ACP fully. mingw-builds [3] and >>> llvm-mingw [4] are some examples of mingw-w64 toolchains that ships UCRT >>> versions. >>> >>> [3]: https://github.com/niXman/mingw-builds-binaries/releases >>> [4]: https://github.com/mstorsjo/llvm-mingw >>> >>> There are two more problems with enabling UTF-8 ACP using the manifest that >>> I have encountered so far. When a process is running with UTF-8 ACP, there >>> seems to be no API available to get the native system ACP. This can be an >>> issue if, for example some external tools write files using the system ACP >>> and your program wants to read those files. The other problem (a mild >>> annoyance) is that, some debuggers which isn't using updated APIs (gdb for >>> example) does not capture `OutputDebugString` messages in the correct >>> encoding, which affects QDebug output. >>> >>> >> I’ve looked into that one when we did the work for Qt 6. The console has its >> own code page that can be set independently from the app, and I believe also >> independently from the system code page. qDebug() should be mostly fine, as >> we’re using OutputDebugStringW() internally and let Windows handle this mess. >> >> What it does affect is writing to stdout/err and OutputDebugStringA(). >> > It is unfortunately a bit more messy. OutputDebugString communicates with the > debugger via a debug event which contains an address, then the debugger reads > the debug message from the memory space of the debuggee process. > The documentation of OutputDebugStringW [1] states: > "In the past, the operating system did not return Unicode strings through > OutputDebugStringW (ASCII strings were returned instead). To force > OutputDebugStringW to return Unicode strings, debuggers are required to call > the WaitForDebugEventEx function to opt into the new behavior. In this way, > the operating system knows that the debugger supports Unicode and is > specifically opting into receiving Unicode strings." > "OutputDebugStringW converts the specified string based on the current system > locale information and passes it to OutputDebugStringA to be displayed. As a > result, some Unicode characters may not be displayed correctly." > What happens with a debugger that does not call `WaitForDebugEventEx` (e.g. > gdb) is this: The debuggee calls OutputDebugStringW, which converts the debug > string to ACP (UTF-8 in this case) to be passed to OutputDebugStringA. Then > the debugger receives the event and tries to read the debug string from the > debuggee as ACP, but the debugger thinks ACP is the system ACP (Windows-1252, > CP950 or whatever) so it ends up displaying mojibake. The same also happens > with Sysinternals DebugView. > In reality, most of the debug messages are ASCII, so this issue rarely > affects anything and I consider it just "a mild annoyance". > [1]: > https://learn.microsoft.com/en-us/windows/win32/api/debugapi/nf-debugapi-outputdebugstringw >> >>> (Console output encoding is separate from the ACP, so one might also need >>> to call `SetConsoleOutputCP(CP_UTF8)`, but the detail is a bit fuzzy to me.) >>> >> Setting the code page for console output should help when writing to >> stdout/err. It’ll require a bit of testing again (it’s been a while since I >> looked into i
Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems
Hi, I’ve looked into that one when we did the work for Qt 6. The console has its own code page that can be set independently from the app, and I believe also independently from the system code page. qDebug() should be mostly fine, as we’re using OutputDebugStringW() internally and let Windows handle this mess. What it does affect is writing to stdout/err and OutputDebugStringA(). It is unfortunately a bit more messy. OutputDebugString communicates with the debugger via a debug event which contains an address, then the debugger reads the debug message from the memory space of the debuggee process. The documentation of OutputDebugStringW [1] states: "In the past, the operating system did not return Unicode strings through OutputDebugStringW (ASCII strings were returned instead). To force OutputDebugStringW to return Unicode strings, debuggers are required to call the WaitForDebugEventEx function to opt into the new behavior. In this way, the operating system knows that the debugger supports Unicode and is specifically opting into receiving Unicode strings." "OutputDebugStringW converts the specified string based on the current system locale information and passes it to OutputDebugStringA to be displayed. As a result, some Unicode characters may not be displayed correctly." What happens with a debugger that does not call `WaitForDebugEventEx` (e.g. gdb) is this: The debuggee calls OutputDebugStringW, which converts the debug string to ACP (UTF-8 in this case) to be passed to OutputDebugStringA. Then the debugger receives the event and tries to read the debug string from the debuggee as ACP, but the debugger thinks ACP is the system ACP (Windows-1252, CP950 or whatever) so it ends up displaying mojibake. The same also happens with Sysinternals DebugView. In reality, most of the debug messages are ASCII, so this issue rarely affects anything and I consider it just "a mild annoyance". [1]: https://learn.microsoft.com/en-us/windows/win32/api/debugapi/nf-debugapi-outputdebugstringw Cheers, Alvin On 22/3/2023 17:58, Lars Knoll wrote: Hi, On 21 Mar 2023, at 17:46, Alvin Wong via Development wrote: Hi, Yes, embedding the manifest with activeCodePage set to UTF-8 is the only thing need to enable UTF-8 as the ANSI code page (ACP) for the process. Qt itself should work fine after the bug in QStringConverter had been fixed [1] a while back. (You can also refer to the linked mail thread. [2]) However, as this bug has shown, any code that uses`MultiByteToWideChar` incorrectly or wrongly assumes that `CP_ACP` always refers to a charset in which each characters are formed by no more than two bytes will break. Therefore, before switching to UTF-8 as the ACP, application developers have to check their code and other libraries to make sure everything will still work properly after the switch. [1]:https://codereview.qt-project.org/c/qt/qtbase/+/412208 [2]:https://lists.qt-project.org/pipermail/interest/2022-May/038241.html About the CRT, it is true that only UCRT fully supports UTF-8 locale. When compiling with MSVC, you are almost always using UCRT so it should be fine. MinGW-w64 is a bit more complicated -- when one gets a mingw-w64 toolchain, the whole toolchain is already configured for a specific CRT. Usually it will be the system MSVCRT. (If it's configured for UCRT, the toolchain author will usually make it clear, because compiled programs will not run out-of-the-box on Windows 8.1 or earlier.) I did not run tests myself, but I would not trust MSVCRT to support UTF-8 ACP fully. mingw-builds [3] and llvm-mingw [4] are some examples of mingw-w64 toolchains that ships UCRT versions. [3]:https://github.com/niXman/mingw-builds-binaries/releases [4]:https://github.com/mstorsjo/llvm-mingw There are two more problems with enabling UTF-8 ACP using the manifest that I have encountered so far. When a process is running with UTF-8 ACP, there seems to be no API available to get the native system ACP. This can be an issue if, for example some external tools write files using the system ACP and your program wants to read those files. The other problem (a mild annoyance) is that, some debuggers which isn't using updated APIs (gdb for example) does not capture `OutputDebugString` messages in the correct encoding, which affects QDebug output. I’ve looked into that one when we did the work for Qt 6. The console has its own code page that can be set independently from the app, and I believe also independently from the system code page. qDebug() should be mostly fine, as we’re using OutputDebugStringW() internally and let Windows handle this mess. What it does affect is writing to stdout/err and OutputDebugStringA(). (Console output encoding is separate from the ACP, so one might also need to call `SetConsoleOutputCP(CP_UTF8)`, but the detail is a bit fuzzy to me.) Setting the code page for console output should help when writing to
Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems
Hi, > On 21 Mar 2023, at 17:46, Alvin Wong via Development > wrote: > > Hi, > > Yes, embedding the manifest with activeCodePage set to UTF-8 is the only > thing need to enable UTF-8 as the ANSI code page (ACP) for the process. > > Qt itself should work fine after the bug in QStringConverter had been fixed > [1] a while back. (You can also refer to the linked mail thread. [2]) > However, as this bug has shown, any code that uses`MultiByteToWideChar` > incorrectly or wrongly assumes that `CP_ACP` always refers to a charset in > which each characters are formed by no more than two bytes will break. > Therefore, before switching to UTF-8 as the ACP, application developers have > to check their code and other libraries to make sure everything will still > work properly after the switch. > > [1]: https://codereview.qt-project.org/c/qt/qtbase/+/412208 > [2]: https://lists.qt-project.org/pipermail/interest/2022-May/038241.html > > About the CRT, it is true that only UCRT fully supports UTF-8 locale. When > compiling with MSVC, you are almost always using UCRT so it should be fine. > > MinGW-w64 is a bit more complicated -- when one gets a mingw-w64 toolchain, > the whole toolchain is already configured for a specific CRT. Usually it will > be the system MSVCRT. (If it's configured for UCRT, the toolchain author will > usually make it clear, because compiled programs will not run out-of-the-box > on Windows 8.1 or earlier.) I did not run tests myself, but I would not trust > MSVCRT to support UTF-8 ACP fully. mingw-builds [3] and llvm-mingw [4] are > some examples of mingw-w64 toolchains that ships UCRT versions. > > [3]: https://github.com/niXman/mingw-builds-binaries/releases > [4]: https://github.com/mstorsjo/llvm-mingw > > There are two more problems with enabling UTF-8 ACP using the manifest that I > have encountered so far. When a process is running with UTF-8 ACP, there > seems to be no API available to get the native system ACP. This can be an > issue if, for example some external tools write files using the system ACP > and your program wants to read those files. The other problem (a mild > annoyance) is that, some debuggers which isn't using updated APIs (gdb for > example) does not capture `OutputDebugString` messages in the correct > encoding, which affects QDebug output. > I’ve looked into that one when we did the work for Qt 6. The console has its own code page that can be set independently from the app, and I believe also independently from the system code page. qDebug() should be mostly fine, as we’re using OutputDebugStringW() internally and let Windows handle this mess. What it does affect is writing to stdout/err and OutputDebugStringA(). > (Console output encoding is separate from the ACP, so one might also need to > call `SetConsoleOutputCP(CP_UTF8)`, but the detail is a bit fuzzy to me.) Setting the code page for console output should help when writing to stdout/err. It’ll require a bit of testing again (it’s been a while since I looked into it), but I believe console was mostly handling this fine independent of the codepage being used by it internally (ie. Windows would recode the string). Cheers, Lars > > Cheers, > Alvin > > > On 20/3/2023 21:44, Edward Welbourne wrote: >> Thiago Macieira (31 October 2019 22:11) wrote [0]: >>> This RFC (...) is meant to discuss how we'll deal with locales on Unix >>> systems on Qt 6. This does not apply to Windows because on Windows we >>> cannot reasonably be expected to use UTF-8 for the 8-bit encoding. >> [0] >> https://lists.qt-project.org/pipermail/development/2019-October/037791.html >> >> The GNU make mailing list currently has a thread (starts at [1]) about >> handling of encodings on Windows. >> >> [1] https://lists.gnu.org/archive/html/bug-make/2023-03/msg00066.html >> >> The discussion there seems to indicate that setting the system code-page >> to UTF-8 can be done in a way that interoperates gracefully with other >> processes and the file system, presumably thanks to the system being >> substantially UTF-16-based, so all 8-bit encodings go via that anyway. >> >> The means to achieve this appear [2] to hinge on setting the active >> codepage for the application in a manifest file, that it gets combined >> with after it is linked. >> >> [2] >> https://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page >> >> There do appear to be some vagaries still, it may depend on UCRT and I'm >> not sure I've really understood it all, but it looks like we may, in >> time, be able to consistently use UTF-8 as 8-bit encoding on Windows. >> >> Eddy. >> > -- > Development mailing list > Development@qt-project.org > https://lists.qt-project.org/listinfo/development -- Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems
Hi, Yes, embedding the manifest with activeCodePage set to UTF-8 is the only thing need to enable UTF-8 as the ANSI code page (ACP) for the process. Qt itself should work fine after the bug in QStringConverter had been fixed [1] a while back. (You can also refer to the linked mail thread. [2]) However, as this bug has shown, any code that uses`MultiByteToWideChar` incorrectly or wrongly assumes that `CP_ACP` always refers to a charset in which each characters are formed by no more than two bytes will break. Therefore, before switching to UTF-8 as the ACP, application developers have to check their code and other libraries to make sure everything will still work properly after the switch. [1]: https://codereview.qt-project.org/c/qt/qtbase/+/412208 [2]: https://lists.qt-project.org/pipermail/interest/2022-May/038241.html About the CRT, it is true that only UCRT fully supports UTF-8 locale. When compiling with MSVC, you are almost always using UCRT so it should be fine. MinGW-w64 is a bit more complicated -- when one gets a mingw-w64 toolchain, the whole toolchain is already configured for a specific CRT. Usually it will be the system MSVCRT. (If it's configured for UCRT, the toolchain author will usually make it clear, because compiled programs will not run out-of-the-box on Windows 8.1 or earlier.) I did not run tests myself, but I would not trust MSVCRT to support UTF-8 ACP fully. mingw-builds [3] and llvm-mingw [4] are some examples of mingw-w64 toolchains that ships UCRT versions. [3]: https://github.com/niXman/mingw-builds-binaries/releases [4]: https://github.com/mstorsjo/llvm-mingw There are two more problems with enabling UTF-8 ACP using the manifest that I have encountered so far. When a process is running with UTF-8 ACP, there seems to be no API available to get the native system ACP. This can be an issue if, for example some external tools write files using the system ACP and your program wants to read those files. The other problem (a mild annoyance) is that, some debuggers which isn't using updated APIs (gdb for example) does not capture `OutputDebugString` messages in the correct encoding, which affects QDebug output. (Console output encoding is separate from the ACP, so one might also need to call `SetConsoleOutputCP(CP_UTF8)`, but the detail is a bit fuzzy to me.) Cheers, Alvin On 20/3/2023 21:44, Edward Welbourne wrote: Thiago Macieira (31 October 2019 22:11) wrote [0]: This RFC (...) is meant to discuss how we'll deal with locales on Unix systems on Qt 6. This does not apply to Windows because on Windows we cannot reasonably be expected to use UTF-8 for the 8-bit encoding. [0] https://lists.qt-project.org/pipermail/development/2019-October/037791.html The GNU make mailing list currently has a thread (starts at [1]) about handling of encodings on Windows. [1] https://lists.gnu.org/archive/html/bug-make/2023-03/msg00066.html The discussion there seems to indicate that setting the system code-page to UTF-8 can be done in a way that interoperates gracefully with other processes and the file system, presumably thanks to the system being substantially UTF-16-based, so all 8-bit encodings go via that anyway. The means to achieve this appear [2] to hinge on setting the active codepage for the application in a manifest file, that it gets combined with after it is linked. [2] https://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page There do appear to be some vagaries still, it may depend on UCRT and I'm not sure I've really understood it all, but it looks like we may, in time, be able to consistently use UTF-8 as 8-bit encoding on Windows. Eddy. -- Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems
On Mon, 20 Mar 2023 13:44:30 + Edward Welbourne via Development wrote: > The means to achieve this appear [2] to hinge on setting the active > codepage for the application in a manifest file, that it gets combined > with after it is linked. > > [2] > https://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page setlocale has support to set UTF-8 locale as well: https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/setlocale-wsetlocale?source=recommendations&view=msvc-170#utf-8-support -- Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems
On Monday, 20 March 2023 03:44:30 HST Edward Welbourne via Development wrote: > There do appear to be some vagaries still, it may depend on UCRT and I'm > not sure I've really understood it all, but it looks like we may, in > time, be able to consistently use UTF-8 as 8-bit encoding on Windows. That is indeed the long-term objective, both ours and Microsoft's. The question is only when we will be ready. Do we need to do something to our DLLs? Can we start suggesting the manifest flag for user applications with our CMake support (like windeployqt)? And can we do it now for our own applications? -- Thiago Macieira - thiago.macieira (AT) intel.com Cloud Software Architect - Intel DCAI Cloud Engineering smime.p7s Description: S/MIME cryptographic signature -- Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems
Thiago Macieira (31 October 2019 22:11) wrote [0]: > This RFC (...) is meant to discuss how we'll deal with locales on Unix > systems on Qt 6. This does not apply to Windows because on Windows we > cannot reasonably be expected to use UTF-8 for the 8-bit encoding. [0] https://lists.qt-project.org/pipermail/development/2019-October/037791.html The GNU make mailing list currently has a thread (starts at [1]) about handling of encodings on Windows. [1] https://lists.gnu.org/archive/html/bug-make/2023-03/msg00066.html The discussion there seems to indicate that setting the system code-page to UTF-8 can be done in a way that interoperates gracefully with other processes and the file system, presumably thanks to the system being substantially UTF-16-based, so all 8-bit encodings go via that anyway. The means to achieve this appear [2] to hinge on setting the active codepage for the application in a manifest file, that it gets combined with after it is linked. [2] https://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page There do appear to be some vagaries still, it may depend on UCRT and I'm not sure I've really understood it all, but it looks like we may, in time, be able to consistently use UTF-8 as 8-bit encoding on Windows. Eddy. -- Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems
Hi, AFAICS (from the public coin logs that dump the entire set of environment variables), nothing specifically sets any of the locale environment variables. In fact, no variables are set. You can check for yourself for example from the macOS log from one of the recent qtbase dev integrations: https://testresults.qt.io/coin/integration/qt/qtbase/tasks/1588073951 The regular Apple Terminal appears to be the entity that sets LC_CTYPE before launching the shell - however the CI system is not using the Apple Terminal. I see two options: (1) Either assume that macOS is UTF-8. (2) We add a script to the provisioning of macOS to always set the LC_CTYPE environment variable to have the value UTF-8 (or any other environment variable that you'd like). Can you think of any other ways to resolve this? Simon From: Development on behalf of Thiago Macieira Sent: Tuesday, April 28, 2020 17:42 To: development@qt-project.org Subject: Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems On Tuesday, 28 April 2020 07:20:33 PDT Thiago Macieira wrote: > On Monday, 27 April 2020 13:54:13 PDT Simon Hausmann wrote: > > I looked at the patch again and searched a bit around. I think nl_langinfo > > is “broken” on macOS but it doesn’t matter: everything seems to be utf-8, > > all system APIs expect it. I think the CI is well configured and the patch > > should treat Darwin like Android > > nl_langinfo is not broken on Mac. I tested it on 10.14 and 10.15 and it > works just fine. More importantly, setlocale() obeys the LC_ALL behaviour > to change the locale of the POSIX functions just fine. > > What I need is that the CI set LANG or LC_ALL to "UTF-8". Somehow, the CI > either unset that or was run from an environment that didn t have it set in > he first place. Another possibility is that some script overrode LC_ALL to "C" so as to get non-localised output. Please fix it to override to "C.UTF-8" or something that works on a Mac. -- Thiago Macieira - thiago.macieira (AT) intel.com Software Architect - Intel System Software Products ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems
On Tuesday, 28 April 2020 08:42:17 PDT Thiago Macieira wrote: > Another possibility is that some script overrode LC_ALL to "C" so as to get > non-localised output. Please fix it to override to "C.UTF-8" or something > that works on a Mac. Found it: it's the test itself. -- Thiago Macieira - thiago.macieira (AT) intel.com Software Architect - Intel System Software Products ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems
On Tuesday, 28 April 2020 07:20:33 PDT Thiago Macieira wrote: > On Monday, 27 April 2020 13:54:13 PDT Simon Hausmann wrote: > > I looked at the patch again and searched a bit around. I think nl_langinfo > > is “broken” on macOS but it doesn’t matter: everything seems to be utf-8, > > all system APIs expect it. I think the CI is well configured and the patch > > should treat Darwin like Android > > nl_langinfo is not broken on Mac. I tested it on 10.14 and 10.15 and it > works just fine. More importantly, setlocale() obeys the LC_ALL behaviour > to change the locale of the POSIX functions just fine. > > What I need is that the CI set LANG or LC_ALL to "UTF-8". Somehow, the CI > either unset that or was run from an environment that didn t have it set in > he first place. Another possibility is that some script overrode LC_ALL to "C" so as to get non-localised output. Please fix it to override to "C.UTF-8" or something that works on a Mac. -- Thiago Macieira - thiago.macieira (AT) intel.com Software Architect - Intel System Software Products ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems
On Monday, 27 April 2020 13:54:13 PDT Simon Hausmann wrote: > I looked at the patch again and searched a bit around. I think nl_langinfo > is “broken” on macOS but it doesn’t matter: everything seems to be utf-8, > all system APIs expect it. I think the CI is well configured and the patch > should treat Darwin like Android nl_langinfo is not broken on Mac. I tested it on 10.14 and 10.15 and it works just fine. More importantly, setlocale() obeys the LC_ALL behaviour to change the locale of the POSIX functions just fine. What I need is that the CI set LANG or LC_ALL to "UTF-8". Somehow, the CI either unset that or was run from an environment that didn t have it set in he first place. -- Thiago Macieira - thiago.macieira (AT) intel.com Software Architect - Intel System Software Products ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems
Hi, I looked at the patch again and searched a bit around. I think nl_langinfo is “broken” on macOS but it doesn’t matter: everything seems to be utf-8, all system APIs expect it. I think the CI is well configured and the patch should treat Darwin like Android :-) Simon Am 27.04.2020 um 19:09 schrieb Simon Hausmann : Hi, I can't really think of anything that's changed in the default macOS setup that would affect the locale encoding. The scripts that are run are here: https://code.qt.io/cgit/qt/qt5.git/tree/coin/provisioning/qtci-macos-10.14-x86_64 but I'm not even sure that it's possible to "misconfigure" a macOS installation to not use utf-8. Can you think of any setting to check? Or do you have a little test program to run to verify? Simon From: Development on behalf of Thiago Macieira Sent: Monday, April 27, 2020 18:13 To: development@qt-project.org Subject: Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems On Sunday, 26 April 2020 09:22:00 PDT Thiago Macieira wrote: > On Thursday, 31 October 2019 14:11:05 PDT Thiago Macieira wrote: > > Re: https://codereview.qt-project.org/c/qt/qtbase/+/275152 (WIP: Move > > QTextCodec support out of QtCore) > > See also: https://www.python.org/dev/peps/pep-0538/ > > > > https://www.python.org/dev/peps/pep-0540/ > > Just sending to the mailing list to get more attention: > > The change above cannot integrate because the new warning breaks the QtTest > self-tests because the environment where the tests are run is not UTF-8. Can > the CI be fixed, please? Apologies, I replied thinking the link above was to my change, but that was Rainer's that has since been superseded by Lars's. The change I want to integrate is: https://codereview.qt-project.org/c/qt/qtbase/+/282359 The error from the CI is: FAIL! : tst_Selftests::runSubTest(assert lightxml + stdout junitxml) 'err.isEmpty()' returned FALSE. (Detected system locale encoding (US-ASCII, locale "C") is not UTF-8. Qt shall use a UTF-8 locale ("UTF-8") instead. If this causes problems, reconfigure your locale. See the locale(1) manual for more information. ) Note this warning is on a Mac, which is an UTF-8 system. Can the CI please set up the environment properly? -- Thiago Macieira - thiago.macieira (AT) intel.com Software Architect - Intel System Software Products ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems
Hi, I can't really think of anything that's changed in the default macOS setup that would affect the locale encoding. The scripts that are run are here: https://code.qt.io/cgit/qt/qt5.git/tree/coin/provisioning/qtci-macos-10.14-x86_64 but I'm not even sure that it's possible to "misconfigure" a macOS installation to not use utf-8. Can you think of any setting to check? Or do you have a little test program to run to verify? Simon From: Development on behalf of Thiago Macieira Sent: Monday, April 27, 2020 18:13 To: development@qt-project.org Subject: Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems On Sunday, 26 April 2020 09:22:00 PDT Thiago Macieira wrote: > On Thursday, 31 October 2019 14:11:05 PDT Thiago Macieira wrote: > > Re: https://codereview.qt-project.org/c/qt/qtbase/+/275152 (WIP: Move > > QTextCodec support out of QtCore) > > See also: https://www.python.org/dev/peps/pep-0538/ > > > > https://www.python.org/dev/peps/pep-0540/ > > Just sending to the mailing list to get more attention: > > The change above cannot integrate because the new warning breaks the QtTest > self-tests because the environment where the tests are run is not UTF-8. Can > the CI be fixed, please? Apologies, I replied thinking the link above was to my change, but that was Rainer's that has since been superseded by Lars's. The change I want to integrate is: https://codereview.qt-project.org/c/qt/qtbase/+/282359 The error from the CI is: FAIL! : tst_Selftests::runSubTest(assert lightxml + stdout junitxml) 'err.isEmpty()' returned FALSE. (Detected system locale encoding (US-ASCII, locale "C") is not UTF-8. Qt shall use a UTF-8 locale ("UTF-8") instead. If this causes problems, reconfigure your locale. See the locale(1) manual for more information. ) Note this warning is on a Mac, which is an UTF-8 system. Can the CI please set up the environment properly? -- Thiago Macieira - thiago.macieira (AT) intel.com Software Architect - Intel System Software Products ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems
On Sunday, 26 April 2020 09:22:00 PDT Thiago Macieira wrote: > On Thursday, 31 October 2019 14:11:05 PDT Thiago Macieira wrote: > > Re: https://codereview.qt-project.org/c/qt/qtbase/+/275152 (WIP: Move > > QTextCodec support out of QtCore) > > See also: https://www.python.org/dev/peps/pep-0538/ > > > > https://www.python.org/dev/peps/pep-0540/ > > Just sending to the mailing list to get more attention: > > The change above cannot integrate because the new warning breaks the QtTest > self-tests because the environment where the tests are run is not UTF-8. Can > the CI be fixed, please? Apologies, I replied thinking the link above was to my change, but that was Rainer's that has since been superseded by Lars's. The change I want to integrate is: https://codereview.qt-project.org/c/qt/qtbase/+/282359 The error from the CI is: FAIL! : tst_Selftests::runSubTest(assert lightxml + stdout junitxml) 'err.isEmpty()' returned FALSE. (Detected system locale encoding (US-ASCII, locale "C") is not UTF-8. Qt shall use a UTF-8 locale ("UTF-8") instead. If this causes problems, reconfigure your locale. See the locale(1) manual for more information. ) Note this warning is on a Mac, which is an UTF-8 system. Can the CI please set up the environment properly? -- Thiago Macieira - thiago.macieira (AT) intel.com Software Architect - Intel System Software Products ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems
On Thursday, 31 October 2019 14:11:05 PDT Thiago Macieira wrote: > Re: https://codereview.qt-project.org/c/qt/qtbase/+/275152 (WIP: Move > QTextCodec support out of QtCore) > See also: https://www.python.org/dev/peps/pep-0538/ > https://www.python.org/dev/peps/pep-0540/ Just sending to the mailing list to get more attention: The change above cannot integrate because the new warning breaks the QtTest self-tests because the environment where the tests are run is not UTF-8. Can the CI be fixed, please? -- Thiago Macieira - thiago.macieira (AT) intel.com Software Architect - Intel System Software Products ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems
On Sunday, 17 November 2019 01:55:32 CET Thiago Macieira wrote: > I don't know why QTextCodec is being removed. I don't remember any decisions > in prior QtCS or this mailing list about removing it. We definitely > discussed removing the CJK codecs and their big tables and that can still > be done, with no effect in the API, since QTextCodec is backed by ICU's > ucnv. We may have discussed removing it, but I don't remember a firm > decision. And even if it is firm, after looking at the consequences of > doing so, we may want to reverse our decision. Update: after talking to Lars during QtCS, he said that he thinks the QTextCodec API is poorly designed and should be replaced. I agree. But that doesn't mean we need to remove the *functionality*, just deprecate the API. I'll bring this up during the QtCore session tomorrow to see if we want to invest time creating a new API, hopefully for 5.15, so code can begin porting before the 6.0 time. That way, we could move QTextCodec out of QtCore. > 1) QTextCodec in the API > I think we cannot do without it, it'll have to stay in one way or another. > So the question reduces to whether it should stay in QtCore or be moved to > another library. Given the QXmlStreamReader problem above, it's probably > best to keep it in QtCore, actually. > > QTextCodec has some API limitations but they can be fixed. It's not > necessary for us to remove it: it's not *that* broken. This is now TBD, depending on finding a good design and whether it can be done incrementally in QTextCodec. -- Thiago Macieira - thiago.macieira (AT) intel.com Software Architect - Intel System Software Products ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems
> I see no reason why we can't keep the QTextCodec _interface_ in Qt Core, > together with some interface to register new codecs, provide UTF-* directly, > and let the "fancy" ones live on in a seperate module, plugging them in > at runtime. My opinion is the same. Keep QTextCodec in QtCore with only UTF encodings. All others, like ICU and the conversion tables, move to a module and are only enabled when the users choose to do so. Rainer ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems
On Tuesday, 19 November 2019 00:23:52 CET Thiago Macieira wrote: > I wasn't referring to QTextCodec. > > I was referring to these files: Sorry, race condition. The Ctrl for Ctrl+F1 was pressed too early and matched the Enter for the next line causing Ctrl+Enter (Send). $ ls -1 src/corelib/codecs/*~*qtextcodec*~*icu*~*utf*~*windows* src/corelib/codecs/codecs.pri src/corelib/codecs/codecs.qdoc src/corelib/codecs/cp949codetbl_p.h src/corelib/codecs/qbig5codec.cpp src/corelib/codecs/QBIG5CODEC_LICENSE.txt src/corelib/codecs/qbig5codec_p.h src/corelib/codecs/QBKCODEC_LICENSE.txt src/corelib/codecs/qeucjpcodec.cpp src/corelib/codecs/QEUCJPCODEC_LICENSE.txt src/corelib/codecs/qeucjpcodec_p.h src/corelib/codecs/qeuckrcodec.cpp src/corelib/codecs/QEUCKRCODEC_LICENSE.txt src/corelib/codecs/qeuckrcodec_p.h src/corelib/codecs/qgb18030codec.cpp src/corelib/codecs/qgb18030codec_p.h src/corelib/codecs/qiconvcodec.cpp src/corelib/codecs/qiconvcodec_p.h src/corelib/codecs/qisciicodec.cpp src/corelib/codecs/qisciicodec_p.h src/corelib/codecs/qjiscodec.cpp src/corelib/codecs/QJISCODEC_LICENSE.txt src/corelib/codecs/qjiscodec_p.h src/corelib/codecs/qjpunicode.cpp src/corelib/codecs/qjpunicode_p.h src/corelib/codecs/qlatincodec.cpp src/corelib/codecs/qlatincodec_p.h src/corelib/codecs/qsimplecodec.cpp src/corelib/codecs/qsimplecodec_p.h src/corelib/codecs/qsjiscodec.cpp src/corelib/codecs/QSJISCODEC_LICENSE.txt src/corelib/codecs/qsjiscodec_p.h src/corelib/codecs/qt_attribution.json src/corelib/codecs/qtsciicodec.cpp src/corelib/codecs/QTSCIICODEC_LICENSE.txt src/corelib/codecs/qtsciicodec_p.h -- Thiago Macieira - thiago.macieira (AT) intel.com Software Architect - Intel System Software Products ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems
On Monday, 18 November 2019 19:48:24 CET André Pönitz wrote: > > But we should not keep the our codecs (aside from the UTF ones) because of > > that. > > Why not? > > I see no reason why we can't keep the QTextCodec _interface_ in Qt Core, > together with some interface to register new codecs, provide UTF-* directly, > and let the "fancy" ones live on in a seperate module, plugging them in at > runtime. I wasn't referring to QTextCodec. I was referring to these files: -- Thiago Macieira - thiago.macieira (AT) intel.com Software Architect - Intel System Software Products ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems
On Mon, Nov 18, 2019 at 07:09:30PM +0100, Thiago Macieira wrote: > On Monday, 18 November 2019 17:05:29 CET Lars Knoll wrote: > > > On 18 Nov 2019, at 17:00, Kevin Kofler wrote: > > > > > > Thiago Macieira wrote: > > > > > >> The codecs we want to remove are just big tables of mapping old, legacy > > >> codecs to UTF-16. We can easily remove those. > > >> > > >> After that, removal of QTextCodec itself is not a big gain. > > > > > > > > > So let me ask once again: Is ICU not already a hard requirement for Qt on > > > > > > *nix systems? So why can we not just rely on ICU's tables? > > > > > > No, it’s not a hard requirement. And especially for low end embedded > > systems, we also want to keep it that way. > > But we should not keep the our codecs (aside from the UTF ones) because of > that. Why not? I see no reason why we can't keep the QTextCodec _interface_ in Qt Core, together with some interface to register new codecs, provide UTF-* directly, and let the "fancy" ones live on in a seperate module, plugging them in at runtime. Andre' ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems
On Monday, 18 November 2019 17:05:29 CET Lars Knoll wrote: > > On 18 Nov 2019, at 17:00, Kevin Kofler wrote: > > > > Thiago Macieira wrote: > > > >> The codecs we want to remove are just big tables of mapping old, legacy > >> codecs to UTF-16. We can easily remove those. > >> > >> After that, removal of QTextCodec itself is not a big gain. > > > > > > So let me ask once again: Is ICU not already a hard requirement for Qt on > > > > *nix systems? So why can we not just rely on ICU's tables? > > > No, it’s not a hard requirement. And especially for low end embedded > systems, we also want to keep it that way. But we should not keep the our codecs (aside from the UTF ones) because of that. -- Thiago Macieira - thiago.macieira (AT) intel.com Software Architect - Intel System Software Products ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems
> On 18 Nov 2019, at 17:00, Kevin Kofler wrote: > > Thiago Macieira wrote: >> The codecs we want to remove are just big tables of mapping old, legacy >> codecs to UTF-16. We can easily remove those. >> >> After that, removal of QTextCodec itself is not a big gain. > > So let me ask once again: Is ICU not already a hard requirement for Qt on > *nix systems? So why can we not just rely on ICU's tables? No, it’s not a hard requirement. And especially for low end embedded systems, we also want to keep it that way. Cheers, Lars ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems
Thiago Macieira wrote: > The codecs we want to remove are just big tables of mapping old, legacy > codecs to UTF-16. We can easily remove those. > > After that, removal of QTextCodec itself is not a big gain. So let me ask once again: Is ICU not already a hard requirement for Qt on *nix systems? So why can we not just rely on ICU's tables? Kevin Kofler ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems
On Monday, 18 November 2019 00:12:19 CET Giuseppe D'Angelo via Development wrote: > I don't know either. Is it to make QtCore smaller? Wasn't the feature > system ("Qt Lite") supposed to address that? Or is it to make it less of > a "kitchen sink", and split it in smaller libraries? Could that mean > having QTextCodec in its own library, and QXmlStreamReader in another > (that depends on the former)? The codecs we want to remove are just big tables of mapping old, legacy codecs to UTF-16. We can easily remove those. After that, removal of QTextCodec itself is not a big gain. > > Related to that is the discussion of whether UTF-8 is the only acceptable > > locale on Unix systems. If we don't have QTextCodec, then we have to have > > something fixed for QString::fromLocal8Bit and it would necessarily be > > UTF-8. But even if we do have QTextCodec, that's still a reasonable > > question: should assume it is UTF-8? And should we enforce it? Those were > > the questions in my OP. > > Should fromLocal8Bit be following the locale environment instead > (LC_CTYPE, LC_MESSAGES or similar)? That's what it does today. The question is whether we can assume those imply UTF-8, like we do when QT_LOCALE_IS_UTF8 is defined. > > If QTextCodec is not in QtCore, then most likely you can't affect how > > QtCore and almost all other Qt classes decode 8-bit data into QString, > > including QTextStream. > > See above -- it also means QTextStream goes in some I/O lib that > contains or depends on the codecs lib. Or we remove the ability in QTextStream to specify the codec, which is what the proposed change would do. I don't think we can move QTextStream out of QtCore. > Why do we bother about "saving the world"? A misconfigured system is the > user's mistake. They should be in charge of fixing it in order to > address the problem. That is an option and this is what the qFatal I mentioned would do. > > For #2, the sub-questions of the OP apply: > > a) What should Qt 6 assume the locale to be, if no locale is set? > > b) In case a non-UTF-8 locale is set, what should we do? > > c) Should we propagate our decision to child processes? > > > > My preferences were: > > a) C.UTF-8 > > b) override it to force UTF-8 on the same locale > > c) yes > > How about > > a) either C / C.UTF-8, but warning the user; but I'd up the ante, and > say: just assert/crash. > > b) keep the choice. Silently changing it sounds like a bad idea; we > should never override the user choices silently. That means keeping QTextCodec and the ability to work with an arbitrary codec. > c) no. We shouldn't "fix" subprocesses. They have the right to make > their own independent decisions. This is not about fixing the subprocess, but about ensuring that it can talk to the current process. And it's only necessary if in (b) we override, selecting UTF-8. If we don't override or if we forbid running with a non-UTF-8 locale, then we don't need to set the environment. > Or, on the other hand: what is the chance that a system comes without a > locale set? What is more likely to conclude, that it's an accident or a > deliberate setting? If it's an accident, why not being *very* verbose > about it? It's extremely unlikely that a Qt application, especially a Qt 6 one, will be run with no locale set. So if the locale isn't set to UTF-8, then it's explicit. The question is whether it was *intentional* to change the codec. As I've argued time and again, changing the locale to English is standard practice in any tool parsing another tool's output. But did they mean to change the codec too? -- Thiago Macieira - thiago.macieira (AT) intel.com Software Architect - Intel System Software Products ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems
> On 18. Nov 2019, at 00:12, Giuseppe D'Angelo via Development > wrote: > > Il 17/11/19 01:55, Thiago Macieira ha scritto: >> Hi >> Sorry, it looks like this thread is not progressing in a calm and reasoned >> manner, the way it was meant to be. And I'm very much to blame. So I >> apologise >> for the strong language and passionate opinions. I'm deleting most of what I >> had written as a reply so we can start over. >> Let's start with your questions: >> On Saturday, 16 November 2019 10:50:13 PST André Pönitz wrote: >>> You have not yet answered >>> >>> - why this decision was made >> You know, I don't know. To be frank, I don't know that a decision *was* made. >> It all started with a change (see OP) about removing QTextCodec from the API >> and from QtCore. It seemed reasonable enough but it turned up quite a few >> kinks that hadn't been predicted. One of them, which may still be a >> showstopper, is QXmlStreamReader's inability to handle XML data encoded in >> anything except UTF-8, though a thorough search of all XML files in my system >> turned up exactly zero such files. >> I don't know why QTextCodec is being removed. I don't remember any decisions >> in prior QtCS or this mailing list about removing it. We definitely discussed >> removing the CJK codecs and their big tables and that can still be done, with >> no effect in the API, since QTextCodec is backed by ICU's ucnv. We may have >> discussed removing it, but I don't remember a firm decision. And even if it >> is >> firm, after looking at the consequences of doing so, we may want to reverse >> our decision. > > I don't know either. Is it to make QtCore smaller? Wasn't the feature system > ("Qt Lite") supposed to address that? Or is it to make it less of a "kitchen > sink", and split it in smaller libraries? Could that mean having QTextCodec > in its own library, and QXmlStreamReader in another (that depends on the > former)? In QtCore it seems to be used by the MIME database support, and in a serialization backend. So, one would need to think about what to do with these at least. Then, looking at qtbase, it’s also used for DBUS and in androiddeployqt / -testrunner (for e.g. the manifest file), and RCC of course. >> Why does Qt Creator need other codecs? Qt Creator is a generic text editor. A generic text editor is expected to be able to read and write files in different encodings. -- Eike Ziller Principal Software Engineer The Qt Company GmbH Erich-Thilo-Straße 10 D-12489 Berlin eike.zil...@qt.io http://qt.io Geschäftsführer: Mika Pälsi, Juha Varelius, Mika Harjuaho Sitz der Gesellschaft: Berlin, Registergericht: Amtsgericht Charlottenburg, HRB 144331 B ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems
Il 17/11/19 01:55, Thiago Macieira ha scritto: Hi Sorry, it looks like this thread is not progressing in a calm and reasoned manner, the way it was meant to be. And I'm very much to blame. So I apologise for the strong language and passionate opinions. I'm deleting most of what I had written as a reply so we can start over. Let's start with your questions: On Saturday, 16 November 2019 10:50:13 PST André Pönitz wrote: You have not yet answered - why this decision was made You know, I don't know. To be frank, I don't know that a decision *was* made. It all started with a change (see OP) about removing QTextCodec from the API and from QtCore. It seemed reasonable enough but it turned up quite a few kinks that hadn't been predicted. One of them, which may still be a showstopper, is QXmlStreamReader's inability to handle XML data encoded in anything except UTF-8, though a thorough search of all XML files in my system turned up exactly zero such files. I don't know why QTextCodec is being removed. I don't remember any decisions in prior QtCS or this mailing list about removing it. We definitely discussed removing the CJK codecs and their big tables and that can still be done, with no effect in the API, since QTextCodec is backed by ICU's ucnv. We may have discussed removing it, but I don't remember a firm decision. And even if it is firm, after looking at the consequences of doing so, we may want to reverse our decision. I don't know either. Is it to make QtCore smaller? Wasn't the feature system ("Qt Lite") supposed to address that? Or is it to make it less of a "kitchen sink", and split it in smaller libraries? Could that mean having QTextCodec in its own library, and QXmlStreamReader in another (that depends on the former)? Related to that is the discussion of whether UTF-8 is the only acceptable locale on Unix systems. If we don't have QTextCodec, then we have to have something fixed for QString::fromLocal8Bit and it would necessarily be UTF-8. But even if we do have QTextCodec, that's still a reasonable question: should assume it is UTF-8? And should we enforce it? Those were the questions in my OP. Should fromLocal8Bit be following the locale environment instead (LC_CTYPE, LC_MESSAGES or similar)? 2) QtCore size As I said above, removing the legacy codecs we have code for is not a problem. They are already disabled in Qt builds where ICU is present, so we'd additionally remove them from all other builds. Where ICU is present, there's no loss of functionality for user applications, since ICU provides far more codecs than we do. For those without ICU, it stands to reason that the user chose size so they are aware of the limitations. Plus, one can always instantiate their own QTextCodec and add to the list (at least, with today's implementation). If QTextCodec is not in QtCore, then most likely you can't affect how QtCore and almost all other Qt classes decode 8-bit data into QString, including QTextStream. See above -- it also means QTextStream goes in some I/O lib that contains or depends on the codecs lib. and 3) misconfigured locale systems and filename handling This is probably the biggest problem. As it is right now, when the locale isn't set on a Unix system or if it is explicitly set to C, we *cannot* decode any file names with the 8th bit set. Those file names are considered filesystem corruption. And yet they are quite commonly created by the user outside of English-speaking jurisdictions. Why do we bother about "saving the world"? A misconfigured system is the user's mistake. They should be in charge of fixing it in order to address the problem. I get the impression that this thread was not started as an RFC for an open-ended discussion, but as a staged attempt to provide a figleaf for a pre-determined decision. That was not the intention. That's why I am re-starting it so we can come back to a reasoned approach. Anyway, the two independent (but related) decisions we need to make are: 1) do we keep QTextCodec in QtCore? 2) do we want to change we handle legacy (non-UTF8) locales? For #2, the sub-questions of the OP apply: a) What should Qt 6 assume the locale to be, if no locale is set? b) In case a non-UTF-8 locale is set, what should we do? c) Should we propagate our decision to child processes? My preferences were: a) C.UTF-8 b) override it to force UTF-8 on the same locale c) yes How about a) either C / C.UTF-8, but warning the user; but I'd up the ante, and say: just assert/crash. b) keep the choice. Silently changing it sounds like a bad idea; we should never override the user choices silently. c) no. We shouldn't "fix" subprocesses. They have the right to make their own independent decisions. But I think we should. My arguments are that UTF-8 locales are the default in all desktop Linux distributions, all BSDs and on macOS and have been for 15 years. Most embedded systems from the last 5 years at least also have i
Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems
On Sunday, 17 November 2019 13:19:27 CET Kevin Kofler wrote: > Please be warned that C.UTF-8 is a recent introduction. (Has upstream glibc > even accepted it yet?) So setting the locale to C.UTF-8 will produce warning > spam or even fatal errors (depending on the application) on many older > distributions and possibly even on some current ones. (E.g., Fedora has > introduced this in Fedora 24 and in updates to Fedora 22 and 23. I don't > know whether this was backported to RHEL releases up to RHEL 7. RHEL 8 has > probably inherited it from recent Fedora, at least.) Given that Fedora 31 is current, Fedora 24 is 3 years old. It's probably old enough. And Python sets LANG to it if the environment is unset. -- Thiago Macieira - thiago.macieira (AT) intel.com Software Architect - Intel System Software Products ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems
Thiago Macieira wrote: > 2) QtCore size > As I said above, removing the legacy codecs we have code for is not a > problem. They are already disabled in Qt builds where ICU is present, so > we'd additionally remove them from all other builds. Where ICU is present, > there's no loss of functionality for user applications, since ICU provides > far more codecs than we do. For those without ICU, it stands to reason > that the user chose size so they are aware of the limitations. Plus, one > can always instantiate their own QTextCodec and add to the list (at least, > with today's implementation). Isn't ICU already a hard requirement on *nix? Since we are talking about locales on *nix systems only, we should be able to assume a Qt build with ICU, shouldn't we? > Turns out, there's one locale that we can be sure that its non-UTF-8 > default is decodable under UTF-8 and that'st he "C" locale. So we don't > *have* to qputenv "C.UTF-8" if the locale is explicitly "C" (as opposed to > being unset). > > But I think we should. Please be warned that C.UTF-8 is a recent introduction. (Has upstream glibc even accepted it yet?) So setting the locale to C.UTF-8 will produce warning spam or even fatal errors (depending on the application) on many older distributions and possibly even on some current ones. (E.g., Fedora has introduced this in Fedora 24 and in updates to Fedora 22 and 23. I don't know whether this was backported to RHEL releases up to RHEL 7. RHEL 8 has probably inherited it from recent Fedora, at least.) Kevin Kofler ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems
Hi Sorry, it looks like this thread is not progressing in a calm and reasoned manner, the way it was meant to be. And I'm very much to blame. So I apologise for the strong language and passionate opinions. I'm deleting most of what I had written as a reply so we can start over. Let's start with your questions: On Saturday, 16 November 2019 10:50:13 PST André Pönitz wrote: > You have not yet answered > > - why this decision was made You know, I don't know. To be frank, I don't know that a decision *was* made. It all started with a change (see OP) about removing QTextCodec from the API and from QtCore. It seemed reasonable enough but it turned up quite a few kinks that hadn't been predicted. One of them, which may still be a showstopper, is QXmlStreamReader's inability to handle XML data encoded in anything except UTF-8, though a thorough search of all XML files in my system turned up exactly zero such files. I don't know why QTextCodec is being removed. I don't remember any decisions in prior QtCS or this mailing list about removing it. We definitely discussed removing the CJK codecs and their big tables and that can still be done, with no effect in the API, since QTextCodec is backed by ICU's ucnv. We may have discussed removing it, but I don't remember a firm decision. And even if it is firm, after looking at the consequences of doing so, we may want to reverse our decision. Related to that is the discussion of whether UTF-8 is the only acceptable locale on Unix systems. If we don't have QTextCodec, then we have to have something fixed for QString::fromLocal8Bit and it would necessarily be UTF-8. But even if we do have QTextCodec, that's still a reasonable question: should assume it is UTF-8? And should we enforce it? Those were the questions in my OP. > - who did it Considering I don't know a decision *was* made, I don't think we can say who made it. > - what the actual problem to solve was Three things being tackled, all related: 1) QTextCodec in the API I think we cannot do without it, it'll have to stay in one way or another. So the question reduces to whether it should stay in QtCore or be moved to another library. Given the QXmlStreamReader problem above, it's probably best to keep it in QtCore, actually. QTextCodec has some API limitations but they can be fixed. It's not necessary for us to remove it: it's not *that* broken. 2) QtCore size As I said above, removing the legacy codecs we have code for is not a problem. They are already disabled in Qt builds where ICU is present, so we'd additionally remove them from all other builds. Where ICU is present, there's no loss of functionality for user applications, since ICU provides far more codecs than we do. For those without ICU, it stands to reason that the user chose size so they are aware of the limitations. Plus, one can always instantiate their own QTextCodec and add to the list (at least, with today's implementation). If QTextCodec is not in QtCore, then most likely you can't affect how QtCore and almost all other Qt classes decode 8-bit data into QString, including QTextStream. and 3) misconfigured locale systems and filename handling This is probably the biggest problem. As it is right now, when the locale isn't set on a Unix system or if it is explicitly set to C, we *cannot* decode any file names with the 8th bit set. Those file names are considered filesystem corruption. And yet they are quite commonly created by the user outside of English-speaking jurisdictions. Your example of setting LC_ALL (or another environment variable) to force the locale to print something that either can be parsed or shared is one such problematic scenario. On one hand, you may need it to get some older tools to parse output; on the other, it makes Qt applications unable to even see some files exist. > - why LC_*ALL* comes into play Because it's the override. If we decide to override and LC_ALL is set, then we have no choice but to override it. If it is unset, then we can leave it unset too, but may need to override LC_CTYPE. > I get the impression that this thread was not started as an RFC for an > open-ended discussion, but as a staged attempt to provide a figleaf for > a pre-determined decision. That was not the intention. That's why I am re-starting it so we can come back to a reasoned approach. Anyway, the two independent (but related) decisions we need to make are: 1) do we keep QTextCodec in QtCore? 2) do we want to change we handle legacy (non-UTF8) locales? For #2, the sub-questions of the OP apply: a) What should Qt 6 assume the locale to be, if no locale is set? b) In case a non-UTF-8 locale is set, what should we do? c) Should we propagate our decision to child processes? My preferences were: a) C.UTF-8 b) override it to force UTF-8 on the same locale c) yes The reason for my preference in propagating to child processes is so that we have a consistent protocol between
Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems
On Fri, Nov 15, 2019 at 05:47:04PM -0800, Thiago Macieira wrote: > On Friday, 15 November 2019 16:23:24 PST André Pönitz wrote: > > > The questions are: > > > 1) do we want to prevent another library from accidentally unsetting it? > > > 2) do we want child processes to use the same? > > > > > > Note the answers for both questions must be the same, for the solution is > > > the same. So either both yeses or both nos. > > > > This "answers for both questions must be the same" requirement is arbitrary. > > > > The fact that one known solution results in same answers to both is in > > no way proof that no other solutions exist. > > I don't see how to prevent another library doing setlocale(LC_ALL, "") from > not overriding Qt's default other than to make setlocale(LC_ALL, "") do what > we want. Since what it does is read the environment, the only solution is to > change the environment. You haven't even explained why this prevention would be needed, what exact bad would happen if you don't do that, and you cannot prevent the other library from setting an explicit locale anyway. With modifying the environment, you just catch the "" case, one out of many, and I'll continue to argue that it's not Qt's business to try even that. > > > Qt 6 will not have support for non-UTF-8 codecs, outside of Windows. You > > > can either deal with binary data or with UTF-8 text, there's no middle > > > ground. > > Now that's an interesting twist. > > > > The latest memo I did (not...) get was that codecs are to be moved into a > > separate module. Which is actually ok, as it allows user code using codecs > > to live on with minimal changes, and makes QtCore slimmer, kind of "no-loss > > + win". > > Sure. But that's no different than using ICU or writing your own code to > convert from binary to text. QString will not support it on its own. > > > "Qt 6 will not have support for non-UTF-8 codecs, outside of Windows" is > > definitely news to me. I've not seen this being discussed, neither here nor > > within the part of the company that I usually talk to. > > You just said yourself, above. I did not say that. > If QTextCodec moves to another library, we have no codecs in QtCore. Not having codecs in QtCore does not mean QtCore cannot use codecs. One could have a setup where Qt Core just has the bare minimum, with stubs for other codecs that are used when that QtCodecs lib is linked. Actually that's what I had expected something like that to be the targeted solution once I heard that text codecs move out of QtCore. > > So when and where was this decision made, by whom, and why? > > > > Did that person bother to check e.g. whether Qt Creator uses non-UTF-8 > > codecs in some cases and did that person come to the conclusion that any > > such use is bad and deserves to die? > > Probably not. Why does Qt Creator need other codecs? My guess would be to handle code bases that are not (a subset) of UTF-8. > > > you're arguing that here are broken applications that won't handle > > > C.UTF-8 correctly, without giving as single example. > > > > ... is of course not true: > > > > 1. I did not claim there were "broken" applications that won't handle > >C.UTF-8 "correctly", I claimed that there are applications that react > >differently to C.UTF-8. > > Different behaviour is *exactly* what we want. We want this: Who is 'we'? > $ LC_ALL=C.UTF-8 ls á > ls: cannot access 'á': No such file or directory > > not this: > > $ LC_ALL=C ls á > ls: cannot access ''$'\303\241': No such file or directory If you do not touch the environment, the user gets what he asked for. He will most likely want not to see ''$'\303\241, but if he explicitly asks for it in the environment he sets up, it's not Qt's job to override this. > I thought the argument would be that despite being what we wanted, Who is 'we'? > it would break certain scenarios. But I haven't seen any examples of breakage. > > > gcc produces different output under C and C.UTF-8: > > > > echo x | LC_CTYPE=C gcc -xc - > > :1:1: error: expected '=', ',', ';', 'asm' or '__attribute__' > > at end of input > > > > echo x | LC_CTYPE=C.UTF-8 gcc -xc - > > :1:1: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ > > at end of input > > > > As an additional twist, this different behaviour does not require fancy > > input, input is plain ASCII in both cases. > > > > Output parsers expecting "'" e.g. to produce a set recommendations how > > to quick-fix such problems in an IDE will break. > > Any application that is parsing GCC output is already setting LC_ALL in the > child process's environment. Not necessarily, and if so, it's rather 'C', not 'C.UTF-8'. > Otherwise, they'd be getting possibly translated > messages and we all know that the order of the messages could be different. > Not to mention that instead of "" or even “” we could see «» or „“. Also the point here is not that the particular case. Each
Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems
On Friday, 15 November 2019 00:52:55 PST Eike Ziller wrote: > - You state that as if that were a fact imposed on us from some external > entity, and as if that patch were already in. No, but that's the direction that started this conversation. If we're not going to do that, then the entire discussion is moot. > - I thought QTextCodec will > still be available, even if from a separate module. If that plan has > changed, provide a patch for Qt Creator as well. it will, but we'll probably need a session next week to discuss in what form. If wew remove the codecs we kept and only use ICU, then QTextCodec will have negligible cost and could stay in QtCore. If it stays in QtCore, we still have a question whether QString::fromLocal8Bit shall assume it's UTF-8 on Unix systems. -- Thiago Macieira - thiago.macieira (AT) intel.com Software Architect - Intel System Software Products ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems
On Friday, 15 November 2019 16:23:24 PST André Pönitz wrote: > > The questions are: > > 1) do we want to prevent another library from accidentally unsetting it? > > 2) do we want child processes to use the same? > > > > Note the answers for both questions must be the same, for the solution is > > the same. So either both yeses or both nos. > > This "answers for both questions must be the same" requirement is arbitrary. > > The fact that one known solution results in same answers to both is in > no way proof that no other solutions exist. I don't see how to prevent another library doing setlocale(LC_ALL, "") from not overriding Qt's default other than to make setlocale(LC_ALL, "") do what we want. Since what it does is read the environment, the only solution is to change the environment. > > Qt 6 will not have support for non-UTF-8 codecs, outside of Windows. You > > can either deal with binary data or with UTF-8 text, there's no middle > > ground. > Now that's an interesting twist. > > The latest memo I did (not...) get was that codecs are to be moved into a > separate module. Which is actually ok, as it allows user code using codecs > to live on with minimal changes, and makes QtCore slimmer, kind of "no-loss > + win". Sure. But that's no different than using ICU or writing your own code to convert from binary to text. QString will not support it on its own. > "Qt 6 will not have support for non-UTF-8 codecs, outside of Windows" is > definitely news to me. I've not seen this being discussed, neither here nor > within the part of the company that I usually talk to. You just said yourself, above. If QTextCodec moves to another library, we have no codecs in QtCore. That means the rest of Qt will not support other codecs. > So when and where was this decision made, by whom, and why? > > Did that person bother to check e.g. whether Qt Creator uses non-UTF-8 > codecs in some cases and did that person come to the conclusion that any > such use is bad and deserves to die? Probably not. Why does Qt Creator need other codecs? > > you're arguing that here are broken applications that won't handle > > C.UTF-8 correctly, without giving as single example. > > ... is of course not true: > > 1. I did not claim there were "broken" applications that won't handle >C.UTF-8 "correctly", I claimed that there are applications that react >differently to C.UTF-8. Different behaviour is *exactly* what we want. We want this: $ LC_ALL=C.UTF-8 ls á ls: cannot access 'á': No such file or directory not this: $ LC_ALL=C ls á ls: cannot access ''$'\303\241': No such file or directory I thought the argument would be that despite being what we wanted, it would break certain scenarios. But I haven't seen any examples of breakage. > gcc produces different output under C and C.UTF-8: > > echo x | LC_CTYPE=C gcc -xc - > :1:1: error: expected '=', ',', ';', 'asm' or '__attribute__' > at end of input > > echo x | LC_CTYPE=C.UTF-8 gcc -xc - > :1:1: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ > at end of input > > As an additional twist, this different behaviour does not require fancy > input, input is plain ASCII in both cases. > > Output parsers expecting "'" e.g. to produce a set recommendations how > to quick-fix such problems in an IDE will break. Any application that is parsing GCC output is already setting LC_ALL in the child process's environment. Otherwise, they'd be getting possibly translated messages and we all know that the order of the messages could be different. Not to mention that instead of "" or even “” we could see «» or „“. Changing the environment of a child process is not going to go away. If you're telling me that you're setting the environment before the Qt application to cope with its brokenness, I will ask why that application hasn't been fixed in the 16 years since UTF-8 environments became a thing. And we can provide a way to force Qt not to set the environment, for those weird cases where you musts deal with broken, proprietary cr#p that won't be fixed until the heat death of the Universe. And I will ask why everyone else must pay a performance price for the sake of those old, broken applications that even the maintainer isn't fixing anymore? > #include > #include > #include > > int main() > { > if (strcmp((setlocale(LC_COLLATE, "")), "C") != 0) > abort(); > } > > runs successfully under LC_ALL="C" and aborts under LC_ALL="C.UTF-8". Strawman example, this doesn't happen in reality. See my exhaustive search for all such checks in an entire Linux distribution. I'm asking for *real* situations. > While contreived in this form, there _is_ code even in Creator checking > for "C" literally, raising the suspicion that this might happen in other > applications, too. Oh, checking for "C" literally does exist, there were several in my search. About half of
Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems
On Thu, Nov 14, 2019 at 11:20:08PM -0800, Thiago Macieira wrote: > On Thursday, 14 November 2019 13:27:23 PST André Pönitz wrote: > > *Within* a Qt application consisting of Qt library, other libraries, > > and actual user code it's mildly presumptous for one library to impose > > random unnecessay restrictions on user code and other libraries. > > That boat sailed 20 years ago when we started calling setlocale() from > QCoreapplication. We set the locale, period. 1. I was refering to putenv, not setlocale. 2. Even for setlocale, the point is not _whether_ it is called, but _how_. setlocale(..., 0) e.g. only queries, does not change anything. QCoreapplication currently calls setlocale(LC_ALL, ""). This is fine. This accepts the user's choice of environment as authorative. It also works well in practice. I can run something like LC_PAPER=de_LU LC_TIME=en_US.UTF-8 LC_COLLATE=C qtcreator and it will not only "just work" for the application itself, but also be properly passed on to e.g. a terminal started from within. So no boat has sailed, let alone 20 years ago. The boat _will_ sail once there when you put a non-empty string there, overriding user's choice. > The questions are: > 1) do we want to prevent another library from accidentally unsetting it? > 2) do we want child processes to use the same? > > Note the answers for both questions must be the same, for the solution is the > same. So either both yeses or both nos. This "answers for both questions must be the same" requirement is arbitrary. The fact that one known solution results in same answers to both is in no way proof that no other solutions exist. But it looks like there's no need to discuss _that_, as my answers are "no" and "no". > > Making assumptions on the controlability of content of a input stream is > > questionable. The proposed method of changing the environment for child > > processes is no guarantee on what the child actually produces, and the > > Qt application still has to be prepared to handle non-Utf-8 or otherwise > > "broken" input. > > Qt 6 will not have support for non-UTF-8 codecs, outside of Windows. You can > either deal with binary data or with UTF-8 text, there's no middle ground. Now that's an interesting twist. The latest memo I did (not...) get was that codecs are to be moved into a separate module. Which is actually ok, as it allows user code using codecs to live on with minimal changes, and makes QtCore slimmer, kind of "no-loss + win". "Qt 6 will not have support for non-UTF-8 codecs, outside of Windows" is definitely news to me. I've not seen this being discussed, neither here nor within the part of the company that I usually talk to. So when and where was this decision made, by whom, and why? Did that person bother to check e.g. whether Qt Creator uses non-UTF-8 codecs in some cases and did that person come to the conclusion that any such use is bad and deserves to die? > > This discussion so far claimed the existance of a range of problems > > without giving an actual example. Then it goes on to propose a shotgut > > approach (LC_ALL, "ALL") to handle ... what? "Obscure locale settings" > > like categories that are a bit more fine grained than LC_ALL? Bear with > > me when I do not have the impression that Qt will be the right context > > to accept such "obligations". > > The same argument can be made for your statements: Sure, one could do that. But that would _my_ argument not make go away, nor compensate for the current lack of answers to the questions I asked. And ... > you're arguing that here are broken applications that won't handle > C.UTF-8 correctly, without giving as single example. ... is of course not true: 1. I did not claim there were "broken" applications that won't handle C.UTF-8 "correctly", I claimed that there are applications that react differently to C.UTF-8. 2. I _did_ give two examples. I can repeat here: 2.1) https://lists.qt-project.org/pipermail/development/2019-November/037815.html gcc produces different output under C and C.UTF-8: echo x | LC_CTYPE=C gcc -xc - :1:1: error: expected '=', ',', ';', 'asm' or '__attribute__' at end of input echo x | LC_CTYPE=C.UTF-8 gcc -xc - :1:1: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ at end of input As an additional twist, this different behaviour does not require fancy input, input is plain ASCII in both cases. Output parsers expecting "'" e.g. to produce a set recommendations how to quick-fix such problems in an IDE will break. 2.2) https://lists.qt-project.org/pipermail/development/2019-November/037810.html #include #include #include int main() { if (strcmp((setlocale(LC_COLLATE, "")), "C") != 0) abort(); } runs successfully under LC_ALL="C" and aborts under LC_ALL="C.UTF-8". While contreived in this form, there _is_
Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems
> On 15. Nov 2019, at 08:20, Thiago Macieira wrote: > > On Thursday, 14 November 2019 13:27:23 PST André Pönitz wrote: >> *Within* a Qt application consisting of Qt library, other libraries, >> and actual user code it's mildly presumptous for one library to impose >> random unnecessay restrictions on user code and other libraries. > > That boat sailed 20 years ago when we started calling setlocale() from > QCoreapplication. We set the locale, period. > > The questions are: > 1) do we want to prevent another library from accidentally unsetting it? > 2) do we want child processes to use the same? > > Note the answers for both questions must be the same, for the solution is the > same. So either both yeses or both nos. > >> Making assumptions on the controlability of content of a input stream is >> questionable. The proposed method of changing the environment for child >> processes is no guarantee on what the child actually produces, and the >> Qt application still has to be prepared to handle non-Utf-8 or otherwise >> "broken" input. > > Qt 6 will not have support for non-UTF-8 codecs, outside of Windows. You can > either deal with binary data or with UTF-8 text, there's no middle ground. - You state that as if that were a fact imposed on us from some external entity, and as if that patch were already in. - I thought QTextCodec will still be available, even if from a separate module. If that plan has changed, provide a patch for Qt Creator as well. > >> This discussion so far claimed the existance of a range of problems >> without giving an actual example. Then it goes on to propose a shotgut >> approach (LC_ALL, "ALL") to handle ... what? "Obscure locale settings" >> like categories that are a bit more fine grained than LC_ALL? Bear with >> me when I do not have the impression that Qt will be the right context >> to accept such "obligations". > > The same argument can be made for your statements: you're arguing that here > are broken applications that won't handle C.UTF-8 correctly, without giving > as > single example. > > I think the whole problem is that we're trying to talk about broken > applications and the way their brokenness manifests itself. I don't think > such > applications exist anymore in occurrence sufficient for us to deal with. > > Anyway, since you oppose setting the environment, let's just make a check for > assumption: > > if (locale is not UTF-8) >qFatal("Qt only supports UTF-8 locales. " > "Please configure your system properly"); > > -- > Thiago Macieira - thiago.macieira (AT) intel.com > Software Architect - Intel System Software Products > > > > ___ > Development mailing list > Development@qt-project.org > https://lists.qt-project.org/listinfo/development -- Eike Ziller Principal Software Engineer The Qt Company GmbH Erich-Thilo-Straße 10 D-12489 Berlin eike.zil...@qt.io http://qt.io Geschäftsführer: Mika Pälsi, Juha Varelius, Mika Harjuaho Sitz der Gesellschaft: Berlin, Registergericht: Amtsgericht Charlottenburg, HRB 144331 B ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems
On Thursday, 14 November 2019 13:27:23 PST André Pönitz wrote: > *Within* a Qt application consisting of Qt library, other libraries, > and actual user code it's mildly presumptous for one library to impose > random unnecessay restrictions on user code and other libraries. That boat sailed 20 years ago when we started calling setlocale() from QCoreapplication. We set the locale, period. The questions are: 1) do we want to prevent another library from accidentally unsetting it? 2) do we want child processes to use the same? Note the answers for both questions must be the same, for the solution is the same. So either both yeses or both nos. > Making assumptions on the controlability of content of a input stream is > questionable. The proposed method of changing the environment for child > processes is no guarantee on what the child actually produces, and the > Qt application still has to be prepared to handle non-Utf-8 or otherwise > "broken" input. Qt 6 will not have support for non-UTF-8 codecs, outside of Windows. You can either deal with binary data or with UTF-8 text, there's no middle ground. > This discussion so far claimed the existance of a range of problems > without giving an actual example. Then it goes on to propose a shotgut > approach (LC_ALL, "ALL") to handle ... what? "Obscure locale settings" > like categories that are a bit more fine grained than LC_ALL? Bear with > me when I do not have the impression that Qt will be the right context > to accept such "obligations". The same argument can be made for your statements: you're arguing that here are broken applications that won't handle C.UTF-8 correctly, without giving as single example. I think the whole problem is that we're trying to talk about broken applications and the way their brokenness manifests itself. I don't think such applications exist anymore in occurrence sufficient for us to deal with. Anyway, since you oppose setting the environment, let's just make a check for assumption: if (locale is not UTF-8) qFatal("Qt only supports UTF-8 locales. " "Please configure your system properly"); -- Thiago Macieira - thiago.macieira (AT) intel.com Software Architect - Intel System Software Products ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems
On Thu, Nov 14, 2019 at 12:10:24PM +0100, Mathias Hasselmann wrote: > > Am 03.11.2019 um 06:35 schrieb André Pönitz: > > I am all for not propagating Qt's UTF-8 choice to child processes at all. > > "Write once, compile/run everywhere" mandates Qt enforcing a maximum level > of homogenity within our Qt applications. *Within* a Qt application consisting of Qt library, other libraries, and actual user code it's mildly presumptous for one library to impose random unnecessay restrictions on user code and other libraries. I am running firefox in parallel, currently with 174 shared object loaded. I don't think it will improve overall firefox user experience if the authors of said 174 library decide to impose their views on what is good code and what is bad code on the other 173 participants in the game. And even if people agreed on using UTF-8 inside an application - and I wouldn't disagree - this does not warrant changing the environment. > That extends to the input and output streams of the child processes > our applications deal with. Making assumptions on the controlability of content of a input stream is questionable. The proposed method of changing the environment for child processes is no guarantee on what the child actually produces, and the Qt application still has to be prepared to handle non-Utf-8 or otherwise "broken" input. So this is effectively snake oil. > Not propagating Qt's UTF-8 choices seeems like a violation of that > principle of maximum homogenity. Which you just invented. Apart from that we just broke "homogenity", as now child processes started from a Qt application behave differently then when started otherwise (see the gcc quotes example with different results on pure 7-bit input) > Hiding the complexity of obscure locale settings truely > belongs to the hearth of Qt's obligations in my opinion. This discussion so far claimed the existance of a range of problems without giving an actual example. Then it goes on to propose a shotgut approach (LC_ALL, "ALL") to handle ... what? "Obscure locale settings" like categories that are a bit more fine grained than LC_ALL? Bear with me when I do not have the impression that Qt will be the right context to accept such "obligations". Andre' PS: Just seen: https://wiki.debian.org/Locale: Warning! Using LC_ALL is strongly discouraged as it overrides everything. Please use it only when testing and never set it in a startup file. ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems
Am 03.11.2019 um 06:35 schrieb André Pönitz: On Sat, Nov 02, 2019 at 06:16:36PM +0100, Kevin Kofler wrote: A true runtime option actually belongs in an environment variable, not in a method that has to be called by the compiled code. (In fact, that's what I would have expected your proposed QT_NO_OVERRIDE_LC_CTYPE to be, but apparently you were thinking of a preprocessor define.) Whether to propagate the locale to child processes is really a decision that can and should be left to the user at runtime rather than compiling it either into the application (as in André's proposal) or even into Qt itself (as in your proposal). I am all for not propagating Qt's UTF-8 choice to child processes at all. "Write once, compile/run everywhere" mandates Qt enforcing a maximum level of homogenity within our Qt applications. That extends to the input and output streams of the child processes our applications deal with. Not propagating Qt's UTF-8 choices seeems like a violation of that principle of maximum homogenity. Hiding the complexity of obscure locale settings truely belongs to the hearth of Qt's obligations in my opinion. Ciao Mathias ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems
On Monday, 4 November 2019 10:55:03 PST Thiago Macieira wrote: > I'll do a full search on Clear Linux to see if there's any software that > checks the return value of setlocale(). All "setlocale" calls. First, the calls that to strcmp: I found comparisons in gnulib and replacements for setlocale, which don't count (they're replacement for old systems Qt no longer [has never?] runs on). That left a couple of examples of exactly what you predicted: glfw-3.3/src/x11_init.c:if (strcmp(setlocale(LC_CTYPE, NULL), "C") == 0) https://github.com/glfw/glfw/blob/master/src/x11_init.c#L934-L942 hack around C not supporting wide-char, which wouldn't be needed if we set the environment firefox-60.1.0/xpcom/build/XPCOMInit.cpp: if (strcmp(setlocale(LC_ALL, nullptr), "C") == 0) { https://searchfox.org/mozilla-central/source/xpcom/build/XPCOMInit.cpp#337 the next line does setlocale(LC_ALL, "") wxWidgets-3.1.2/src/common/intl.cpp:wxASSERT_MSG( strcmp(setlocale(LC_ALL, NULL), "C") == 0, https://github.com/wxWidgets/wxWidgets/blob/master/src/common/intl.cpp#L1694 Appears to be Windows-specific. The assignments are much more numerous (1700 of them in my listing). A lot of them are of the form: old_locale = setlocale(LC_xxx, NULL); which I assume is later followed up by a setlocale(LC_xxx, old_locale). These cases are not relevant to us. https://github.com/GNUAspell/aspell/blob/master/common/config.cpp#L549-L561 Needs to find the locale to know what language to apply spelling for and also how to decode the text. UTF-8 is supported. http://git.savannah.gnu.org/cgit/bash.git/tree/locale.c Aside from the check *for* UTF-8 in LC_CTYPE, the assignments are only checking for null pointers. http://git.savannah.gnu.org/cgit/bison.git/tree/src/getargs.c#n446 http://git.savannah.gnu.org/cgit/coreutils.git/tree/src/system.h Not relevant for us. https://github.com/BOINC/boinc/blob/master/zip/zip/zip.c#L2214 Null check only, and checks for UTF-8 https://github.com/BOINC/boinc/blob/master/zip/unzip/unzip.c#L773 Not relevant, in #else for nl_langinfo https://github.com/microsoft/cpprestsdk/blob/master/Release/src/utilities/ asyncrt_utils.cpp Win32 only https://github.com/apple/cups/blob/master/cups/language.c Handles UTF-8 just fine. https://github.com/apple/cups/blob/master/cups/langprintf.c Forces .UTF-8. https://github.com/doxygen/doxygen/blob/master/qtools/qtextcodec.cpp#L508-L529 Trying to guess what QTextCodec to use for ru_RU. https://git.enlightenment.org/core/efl.git/tree/src/modules/ecore_imf/xim/ ecore_imf_xim.c#n832 Null check only. The rest of EFL is save/restore. http://git.savannah.gnu.org/cgit/emacs.git/tree/src/sysdep.c#n4049 Null check only. http://git.savannah.gnu.org/cgit/emacs.git/tree/src/sysdep.c#n4049 COULD mistake, as it does strcmp(locale, "C") then locale = "en" https://github.com/GNOME/evince/blob/mainline/cut-n-paste/synctex/ synctex_parser.c#L4384-L4399 Save/restore. https://github.com/GNOME/evolution-data-server/blob/mainline/src/camel/camel-iconv.c#L218 Does compare to "C", but not a problem since the failing case uses nl_langinfo https://github.com/GNOME/evolution-data-server/blob/mainline/src/addressbook/ libedata-book/e-book-sqlite.c#L2891 Doesn't seem to be a problem. https://github.com/GNOME/evolution/blob/mainline/src/e-util/e-xml-utils.c#L66 Just getting defaults. https://github.com/fish-shell/fish-shell/blob/3.0.2/src/env.cpp#L373-L396 Comparing old to new. And no longer present in master. https://github.com/fltk/fltk/blob/master/src/ Fl_Native_File_Chooser_GTK.cxx#L445-L458 Save/restore, not thread-safe. https://github.com/zenotech/fox-toolkit/blob/master/src/FXTranslator.cpp#L84 Commented out. http://git.savannah.gnu.org/cgit/gawk.git/tree/support/dfa.c#n988 Not a problem, just checking if the locale is ASCII-compatible. binutils-gdb/blob/master/readline/readline/nls.c Seems fine too. https://github.com/geany/geany/blob/master/src/libmain.c#L980-L987 Only used in debug output https://github.com/fangq/gftp/blob/master/lib/protocols.c#L382-L395 Null-pointer check & logging https://github.com/GNOME/glib/blob/mainline/glib/guniprop.c#L724 Safe https://github.com/GNOME/glib/blob/mainline/glib/gtranslit.c#L293 Seems to be fine https://github.com/GNOME/glib/blob/mainline/glib/gdate.c#L1057-L1065 Checking cached results I'm stopping here. -- Thiago Macieira - thiago.macieira (AT) intel.com Software Architect - Intel System Software Products setlocale-grep.zst Description: application/zstd ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems
On Monday, 4 November 2019 11:50:01 PST André Pönitz wrote: > On Mon, Nov 04, 2019 at 11:38:07AM -0800, Thiago Macieira wrote: > > On Monday, 4 November 2019 11:18:12 PST André Pönitz wrote: > > > A parser accepting the output of one might or might not be able to > > > handle the second. > > > > A driver would set LC_ALL in the environment when it calls gcc. > > Can we please take a step back and repeat for the slow thinker^H^H me > what the benefit of forcing a UTF-8 locale on unknown child processes > would be? Two-fold: 1) it forces the UTF-8 locale on the *current* process, in case some other part of the same process does setlocale(LC_ALL, "") after QCoreApplication 2) it forces the child process to use the same locale as the parent Qt application Since Qt will force itself to UTF-8, then we want the Qt application to interpret "Arquivo ou diretório inexistente" instead of "Arquivo ou diret�rio inexistente" -- Thiago Macieira - thiago.macieira (AT) intel.com Software Architect - Intel System Software Products ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems
On Mon, Nov 04, 2019 at 11:38:07AM -0800, Thiago Macieira wrote: > On Monday, 4 November 2019 11:18:12 PST André Pönitz wrote: > > A parser accepting the output of one might or might not be able to > > handle the second. > > A driver would set LC_ALL in the environment when it calls gcc. Can we please take a step back and repeat for the slow thinker^H^H me what the benefit of forcing a UTF-8 locale on unknown child processes would be? Andre' ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems
On Mon, Nov 04, 2019 at 10:55:03AM -0800, Thiago Macieira wrote: > On Monday, 4 November 2019 10:29:16 PST André Pönitz wrote: > > All but one do not let the UI user change the environment, i.e. the > > environment is passed through the Qt UI process (so far). The one is > > Qt Creator, but even there it is not possible to configure all child > > processes, and would not be tolerable to tell users "When you create a > > new run configuration remember to undo spurious environment changes done > > by Qt". > > It's highly unlikely you're running Qt Creator in a non-UTF-8 environment in > the first place. *shrug* > locale | grep -q '=C$' && echo oops oops > KDE has not supported such locales for 15 years. I haven't tried to run KDE in earnest for about the same time. > If we were in 2004-2006 when this was recent and other Unix environments like > Solaris and HP-UXi where non-UTF-8 could be still in use I could understand > the skepticism. > > > > > There _are_ setups that _are_ set in stone, that are not connected > > to anything and that don't give anything on updates, or do not even > > have the possibility to be "fixed" or changed in any way. > > Why are you inserting Qt 6 into them, then? Because data generation and data visualization are different tasks, that can, and perhaps should, be done in different processes, and while data visualization occasionally might need to react to user demand, data generation might not. > > Looks contrieved? [Check your hard disk before you answer.] > > I'll do a full search on Clear Linux to see if there's any software that > checks the return value of setlocale(). > > > Potentially harmful behaviour should always be opt-in, not opt-out > > (and never be non-configurable). > > I don't disagree on the statement. I just disagree on whether it's harmful. > *Not* calling qputenv could be harmful too. As mentioned in the second example, even "clean ASCII" 7 bit input produces different results under "C.UTF-8" and "C": echo x | LC_ALL=C.UTF-8 gcc -xc - echo x | LC_ALL=C gcc -xc - Given that most parsers in the world are ad-hoc, chances are high that some are based on looking for certain quotes, but not for others. And even if someone knows that the immediate child processes are ok with C.UTF-8, their children, grand children, ... might not. Andre' ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems
On Monday, 4 November 2019 11:18:12 PST André Pönitz wrote: > A parser accepting the output of one might or might not be able to > handle the second. A driver would set LC_ALL in the environment when it calls gcc. -- Thiago Macieira - thiago.macieira (AT) intel.com Software Architect - Intel System Software Products ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems
On Mon, Nov 04, 2019 at 09:40:00AM +, Edward Welbourne wrote: > Indeed, what program would have problems in C.UTF-8 yet have a > non-Unicode locale in which it works nicely ? Other example: echo x | LC_ALL="C.UTF-8" gcc -xc - and echo x | LC_ALL="C" gcc -xc - produce different output. A parser accepting the output of one might or might not be able to handle the second. Andre' ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems
On Monday, 4 November 2019 10:29:16 PST André Pönitz wrote: > All but one do not let the UI user change the environment, i.e. the > environment is passed through the Qt UI process (so far). The one is > Qt Creator, but even there it is not possible to configure all child > processes, and would not be tolerable to tell users "When you create a > new run configuration remember to undo spurious environment changes done > by Qt". It's highly unlikely you're running Qt Creator in a non-UTF-8 environment in the first place. KDE has not supported such locales for 15 years. If we were in 2004-2006 when this was recent and other Unix environments like Solaris and HP-UXi where non-UTF-8 could be still in use I could understand the skepticism. > > There _are_ setups that _are_ set in stone, that are not connected > to anything and that don't give anything on updates, or do not even > have the possibility to be "fixed" or changed in any way. Why are you inserting Qt 6 into them, then? > int main() > { > if (strcmp((setlocale(LC_COLLATE, "")), "C") != 0) > abort(); > } > > Looks contrieved? [Check your hard disk before you answer.] I'll do a full search on Clear Linux to see if there's any software that checks the return value of setlocale(). > Potentially harmful behaviour should always be opt-in, not opt-out > (and never be non-configurable). I don't disagree on the statement. I just disagree on whether it's harmful. *Not* calling qputenv could be harmful too. -- Thiago Macieira - thiago.macieira (AT) intel.com Software Architect - Intel System Software Products ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems
On Monday, 4 November 2019 09:29:41 PST Edward Welbourne wrote: > On Monday, 4 November 2019 01:40:00 PST Edward Welbourne wrote: > > I want to do qputenv in the Qt application *itself*, inside > > QCoreApplication. Note the most important process that this will apply > > to: itself. It applies to all other frameworks inside the same > > application that may inspect the environment, including an extra unknown > > call to setlocale(LC_ALL, ""). > ... and we can do that just fine if we > * record the prior value we're over-riding on some master object, > that also remembers the list of regexes; > * call qputenv() exactly as you have in mind; > * when about to start a sub-process, ask that master object if the > command name matches one of its regexes; > * if it does, restore *for only it* (e.g. after fork()) the prior value. That only applies to QProcess. It will not apply to third-party components that fork helper processes. It's possible atfork() could do this, but I'm not sure. it won't catch all of them, especially those that prepare the environment before forking (like execve / execle's caller). -- Thiago Macieira - thiago.macieira (AT) intel.com Software Architect - Intel System Software Products ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems
On Mon, Nov 04, 2019 at 09:40:00AM +, Edward Welbourne wrote: > On Friday, 1 November 2019 12:29:19 PDT André Pönitz wrote: > >> a) and b) are fine with me, "c) yes" sounds like a potential problem. > >> > >> Most of the child process I usually call are not Qt based, > > That shouldn't matter. Qt<6-based things and non-Qt things are all the > same from the point of view of the contemplated change. > > To what extent are these child programs started via a UI that lets the > user set environment variables (as I assume all IDEs do for most of the > commands they run) ? All but one do not let the UI user change the environment, i.e. the environment is passed through the Qt UI process (so far). The one is Qt Creator, but even there it is not possible to configure all child processes, and would not be tolerable to tell users "When you create a new run configuration remember to undo spurious environment changes done by Qt". > Obviously, if some antique needs a special locale, that's no problem > if it's started via a UI that lets one configure its environment, > overriding what Qt might have set. Even _if_ that UI would let the user configure the environment, that's not an excuse. > >> rather some random unrelated tools, in some cases even quite old > >> random unrelated tools. > > I read antiquity as tending to assume C locale, so unharmed by C.UTF-8, > although some may be assuming an ISO Latin or similar legacy codec. > All the same, so antique as to not grok Unicode at all is pretty old ! > You probably need to update it for security fixes, by now. "Security reason, because it is old" must be Godwin's Law in "Always Online" times. There _are_ setups that _are_ set in stone, that are not connected to anything and that don't give anything on updates, or do not even have the possibility to be "fixed" or changed in any way. If Qt development does not want to care for these cases _even as child processes_ that's fine in principle (even with me), but then it would help to clearly communicate that fact to prevent accidents in the selection of toolkits. > Thiago Macieira (1 November 2019 22:49) > > TBH, all the more reason for propagating the choice. Please remember > > that on any modern Linux or macOS or FreeBSD, they are already running > > with a UTF-8 locale. The most common scenario of our setting something > > is when LC_ALL=C was set in the environment, which will cause us to > > reset it to C.UTF-8. > > Indeed, what program would have problems in C.UTF-8 yet have a > non-Unicode locale in which it works nicely ? > An example would help us to reason about this ... The following works on all my setups (and, btw, with LC_ALL="C" which I do _not_ use) and crashes with LC_ALL="C.UTF-8": #include #include #include int main() { if (strcmp((setlocale(LC_COLLATE, "")), "C") != 0) abort(); } Looks contrieved? [Check your hard disk before you answer.] Shotgun-changing environment for child processes is _not_ harmless. > ones, would not make the same choices. If we do not propagate, we > could end up with a child process (often helpers) that make > different choices as to what command-line arguments or pipes or > contents in files mean. > > >> If we propagate we'll expose the child processes to locales they > >> might not expect, in circumstances where the user of the system > >> possibly intentionally chose a non-UTF8-locale to make exactly those > >> child processes happy. > > > True, but that was done at the expense of running Qt in a largely > > unsupported and untested scenario. Setting the locale to C means we > > can't access any file with an 8bit file name; setting to Latin1 would > > allow that, but produce mojibake in GUI. > > >> Effectively, going for "c) yes" deprives the user of a certain level > >> of freedom that is needed, "c) no" is less intrusive. > >> > >> "c) no" as default and a simple one-liner opt-in for applications > >> that want to engage in "strict parenting" might be an option, too. > > > How about making the resetting opt-out, instead of opt-in? > > QT_NO_OVERRIDE_LC_CTYPE? > > Possibly its value could be: > * all, 1, yes, true, .* - it applies to all child processes [*]; or > * a list of regexes for program names to which it applies, when started > as child processes. The syntax doesn't really matter, but the direction "opt-out" is wrong. Potentially harmful behaviour should always be opt-in, not opt-out (and never be non-configurable). Andre' ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems
On Monday, 4 November 2019 01:40:00 PST Edward Welbourne wrote: >> * a list of regexes for program names to which it applies, when started >> as child processes. >> >> Or is that too hard to implement at all the places where we call exec() >> and its equivalents ? Thiago Macieira (4 November 2019 15:48) > That's not at all what I wanted. > > I want to do qputenv in the Qt application *itself*, inside QCoreApplication. > Note the most important process that this will apply to: itself. It applies to > all other frameworks inside the same application that may inspect the > environment, including an extra unknown call to setlocale(LC_ALL, ""). ... and we can do that just fine if we * record the prior value we're over-riding on some master object, that also remembers the list of regexes; * call qputenv() exactly as you have in mind; * when about to start a sub-process, ask that master object if the command name matches one of its regexes; * if it does, restore *for only it* (e.g. after fork()) the prior value. The default for everything else is then to see an environment with our "correction" applied to the locale env var(s). Eddy. ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems
On Monday, 4 November 2019 01:40:00 PST Edward Welbourne wrote: > * a list of regexes for program names to which it applies, when started > as child processes. > > Or is that too hard to implement at all the places where we call exec() > and its equivalents ? That's not at all what I wanted. I want to do qputenv in the Qt application *itself*, inside QCoreApplication. Note the most important process that this will apply to: itself. It applies to all other frameworks inside the same application that may inspect the environment, including an extra unknown call to setlocale(LC_ALL, ""). -- Thiago Macieira - thiago.macieira (AT) intel.com Software Architect - Intel System Software Products ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems
Edward Welbourne wrote: > [*] I'm fairly sure the actual Unix programs yes and true don't care > about locale, so treating them meaning as .* would be harmless ... GNU yes takes an optional string that it repeats instead of "y", so it does at least some string processing. I am not sure how it reacts if the characters are outside of the locale's character set. In addition, both GNU yes and GNU true have --help and --version options that print translated strings. Kevin Kofler ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems
>>> "c) no" as default and a simple one-liner opt-in for applications that >>> want to engage in "strict parenting" might be an option, too. On Fri, Nov 01, 2019 at 02:49:36PM -0700, Thiago Macieira wrote: >> How about making the resetting opt-out, instead of opt-in? >> QT_NO_OVERRIDE_LC_CTYPE? André Pönitz (2 November 2019 12:53) > I was more thinking of a runtime option. Like > > QCoreApplication::setPropagateOurChoices(true) > > Or do I miss something why this has to be a compile time choice? I interpreted Thiago as suggesting an environment variable to be inspected at run-time, not a compile-time option. Would an environment variable work for you ? Eddy. ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems
Thiago: My personal preference is: a) C.UTF-8 b) override it to force UTF-8 on the same locale c) yes Lars: >>> I agree with all three choices. On Friday, 1 November 2019 12:29:19 PDT André Pönitz wrote: >> a) and b) are fine with me, "c) yes" sounds like a potential problem. >> >> Most of the child process I usually call are not Qt based, That shouldn't matter. Qt<6-based things and non-Qt things are all the same from the point of view of the contemplated change. To what extent are these child programs started via a UI that lets the user set environment variables (as I assume all IDEs do for most of the commands they run) ? Obviously, if some antique needs a special locale, that's no problem if it's started via a UI that lets one configure its environment, overriding what Qt might have set. >> rather some random unrelated tools, in some cases even quite old >> random unrelated tools. I read antiquity as tending to assume C locale, so unharmed by C.UTF-8, although some may be assuming an ISO Latin or similar legacy codec. All the same, so antique as to not grok Unicode at all is pretty old ! You probably need to update it for security fixes, by now. Thiago Macieira (1 November 2019 22:49) > TBH, all the more reason for propagating the choice. Please remember > that on any modern Linux or macOS or FreeBSD, they are already running > with a UTF-8 locale. The most common scenario of our setting something > is when LC_ALL=C was set in the environment, which will cause us to > reset it to C.UTF-8. Indeed, what program would have problems in C.UTF-8 yet have a non-Unicode locale in which it works nicely ? An example would help us to reason about this ... ones, would not make the same choices. If we do not propagate, we could end up with a child process (often helpers) that make different choices as to what command-line arguments or pipes or contents in files mean. >> If we propagate we'll expose the child processes to locales they >> might not expect, in circumstances where the user of the system >> possibly intentionally chose a non-UTF8-locale to make exactly those >> child processes happy. > True, but that was done at the expense of running Qt in a largely > unsupported and untested scenario. Setting the locale to C means we > can't access any file with an 8bit file name; setting to Latin1 would > allow that, but produce mojibake in GUI. >> Effectively, going for "c) yes" deprives the user of a certain level >> of freedom that is needed, "c) no" is less intrusive. >> >> "c) no" as default and a simple one-liner opt-in for applications >> that want to engage in "strict parenting" might be an option, too. > How about making the resetting opt-out, instead of opt-in? > QT_NO_OVERRIDE_LC_CTYPE? Possibly its value could be: * all, 1, yes, true, .* - it applies to all child processes [*]; or * a list of regexes for program names to which it applies, when started as child processes. Or is that too hard to implement at all the places where we call exec() and its equivalents ? [*] I'm fairly sure the actual Unix programs yes and true don't care about locale, so treating them meaning as .* would be harmless ... Eddy. ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems
On Saturday, 2 November 2019 22:35:00 PST André Pönitz wrote: > Compiled opt-in per-application at least shifts the blame from Qt to the > application vendor, compiled opt-in per-process environment leaves the blame > still with the application vendor, but actually provides the possibility to > do the right thing when it is known that the child actually _needs_ it. When the parent process written in Qt knows that the child needs it, they must already be using QProcessEnvironment. So when the user needed to do it, it's a bug in the Qt application. There are two scenarios: 1) when the child process needs en_US or C, because it was printing messages in another language, or far more commonly, it was using thousands and decimal separators other than those of English 2) when the child process needs a non-UTF-8 because it was confused by UTF-8 multibyteness or was using that to print “fancy quotes” The case (1) is not a problem if we override the environment to en_US.UTF-8 or C.UTF-8. Your scenario is restricted to case (2). Do note that forcing the environment today, in Qt 5, has implications for the Qt application itself. It's just wrong to do so and I think the number of people doing that is fairly small. With this proposal, in Qt 6, the Qt application would run correctly. But that means that the overriding you're asking for is unlikely to exist *today*. So if we're talking about the future, why is using an environment variable to suppress the Qt's override not sufficient? -- Thiago Macieira - thiago.macieira (AT) intel.com Software Architect - Intel System Software Products ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems
On Sat, Nov 02, 2019 at 06:16:36PM +0100, Kevin Kofler wrote: > A true runtime option actually belongs in an environment variable, not in a > method that has to be called by the compiled code. (In fact, that's what I > would have expected your proposed QT_NO_OVERRIDE_LC_CTYPE to be, but > apparently you were thinking of a preprocessor define.) > > Whether to propagate the locale to child processes is really a decision that > can and should be left to the user at runtime rather than compiling it > either into the application (as in André's proposal) or even into Qt itself > (as in your proposal). I am all for not propagating Qt's UTF-8 choice to child processes at all. Having that as opt-in on some level was an attempt to appease people who think that's a good idea. A configure option for Qt itself does not help as it keeps the question open what the default setup will be. And given the circumstances that would be "propagation". Compiled opt-in per-application at least shifts the blame from Qt to the application vendor, compiled opt-in per-process environment leaves the blame still with the application vendor, but actually provides the possibility to do the right thing when it is known that the child actually _needs_ it. On the other hand, in those circumstances, this can already be done now by normal fiddling with the child process environment. Andre' ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems
Thiago Macieira wrote: > Is your shell configured for German or for English? Try setting your > locale to German and then see how long it will take for you to have to > override when posting a question or an answer. Unlike you, to get messages in English for human reading, I have been using en_US.UTF-8 rather than C for years (long before C.UTF-8 became a thing) exactly because of that. > Except for the LC_ALL=C case for overriding the user's locale so that one > can get messages and formatting in machine-parseable format. The normal > case and this one probably account for over 99% of all scenarios. For machine readability, there is probably a reason for picking C rather than en_US.UTF-8 or even C.UTF-8, e.g., to get ASCII quotes rather than the fancy Unicode quotes used under en_US.UTF-8. >> > How about making the resetting opt-out, instead of opt-in? >> > QT_NO_OVERRIDE_LC_CTYPE? >> >> I was more thinking of a runtime option. Like >> >> QCoreApplication::setPropagateOurChoices(true) > > I think a runtime option like that belongs in QProcessEnvironment. A true runtime option actually belongs in an environment variable, not in a method that has to be called by the compiled code. (In fact, that's what I would have expected your proposed QT_NO_OVERRIDE_LC_CTYPE to be, but apparently you were thinking of a preprocessor define.) Whether to propagate the locale to child processes is really a decision that can and should be left to the user at runtime rather than compiling it either into the application (as in André's proposal) or even into Qt itself (as in your proposal). Kevin Kofler ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems
On Saturday, 2 November 2019 04:53:10 PDT André Pönitz wrote: > > TBH, all the more reason for propagating the choice. Please remember that > > on any modern Linux or macOS or FreeBSD, they are already running with a > > UTF-8 locale. > > With that argument we wouldn't even need to change the locale for the > actual Qt application. > > I think we are currently discussing the rare case where the Qt application > is started with a non-UTF-8 locale, and the main question is whether this > was some kind of accident that the Qt application should correct for their > child processes or whether this was intentional. Right. And the conclusion so far is that it is a mistake. > As you said, any modern Linux or macOS or FreeBSD default to UTF-8, so > chances are high that any deviation from that is actually intentionally. Except for the LC_ALL=C case for overriding the user's locale so that one can get messages and formatting in machine-parseable format. The normal case and this one probably account for over 99% of all scenarios. > > The most common scenario of our setting something is when LC_ALL=C was > > set in the environment, which will cause us to reset it to C.UTF-8. > > I understand that, and even though I am not aware of an actual problem for > my personal uses I am a bit reluctant to expose unsuspecting processes > to a variable-lengths encoding they may not be aware of. At least there's > a potential for buffer overruns here. Is your shell configured for German or for English? Try setting your locale to German and then see how long it will take for you to have to override when posting a question or an answer. $ ls á ls: cannot access 'á': Arquivo ou diretório inexistente $ gcc -xc /dev/null /usr/lib64/gcc/x86_64-suse-linux/9/../../../../x86_64-suse-linux/bin/ld: /usr/ lib64/gcc/x86_64-suse-linux/9/../../../../lib64/crt1.o: na função "_start": /home/abuild/rpmbuild/BUILD/glibc-2.30/csu/../sysdeps/x86_64/start.S:104: referência não definida para "main" collect2: error: ld returned 1 exit status $ gcc -xc /dev/null -lmain /usr/lib64/gcc/x86_64-suse-linux/9/../../../../x86_64-suse-linux/bin/ld: não foi possível localizar -lmain collect2: error: ld returned 1 exit status > Also, going from "C" to "C.UTF-8" might foil code checking for the > string "C" explicitly in a child process. True, though that's extremely unlikely anyone is doing that. > > True, but that was done at the expense of running Qt in a largely > > unsupported and untested scenario. Setting the locale to C means we can't > > access any file with an 8bit file name; setting to Latin1 would allow > > that, but produce mojibake in GUI. > > Setting to "C" also "works" in practice when blobs are just read and written > unmodified. Except when such a blob's file name contains a character outside of the US- ASCII subset. $ ./lconvert á.qm Cannot open á.qm: No such file or directory $ LC_ALL=C ./lconvert á.qm Cannot open ??.qm: No such file or directory Was this just the output or did it try to open this actual file? $ strace -E LC_ALL=C ./lconvert á.qm |& grep -F .qm execve("./lconvert", ["./lconvert", "\303\241.qm"], 0x55c2ef3cc7a0 /* 118 vars */) = 0 openat(AT_FDCWD, "??.qm", O_RDONLY|O_CLOEXEC) = -1 ENOENT (Arquivo ou diretório inexistente) write(2, "Cannot open ??.qm: No such file "..., 45Cannot open ??.qm: No such file or directory > > How about making the resetting opt-out, instead of opt-in? > > QT_NO_OVERRIDE_LC_CTYPE? > > I was more thinking of a runtime option. Like > > QCoreApplication::setPropagateOurChoices(true) I think a runtime option like that belongs in QProcessEnvironment. > Or do I miss something why this has to be a compile time choice? Yes: whether QString::fromLocal8Bit has to support anything besides UTF-8. -- Thiago Macieira - thiago.macieira (AT) intel.com Software Architect - Intel System Software Products ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems
On Fri, Nov 01, 2019 at 02:49:36PM -0700, Thiago Macieira wrote: > On Friday, 1 November 2019 12:29:19 PDT André Pönitz wrote: > > > > My personal preference is: > > > > a) C.UTF-8 > > > > b) override it to force UTF-8 on the same locale > > > > c) yes > > > > > > I agree with all three choices. > > > > a) and b) are fine with me, "c) yes" sounds like a potential problem. > > > > Most of the child process I usually call are not Qt based, rather some > > random unrelated tools, in some cases even quite old random unrelated > > tools. > > TBH, all the more reason for propagating the choice. Please remember that on > any modern Linux or macOS or FreeBSD, they are already running with a UTF-8 > locale. With that argument we wouldn't even need to change the locale for the actual Qt application. I think we are currently discussing the rare case where the Qt application is started with a non-UTF-8 locale, and the main question is whether this was some kind of accident that the Qt application should correct for their child processes or whether this was intentional. As you said, any modern Linux or macOS or FreeBSD default to UTF-8, so chances are high that any deviation from that is actually intentionally. > The most common scenario of our setting something is when LC_ALL=C was > set in the environment, which will cause us to reset it to C.UTF-8. I understand that, and even though I am not aware of an actual problem for my personal uses I am a bit reluctant to expose unsuspecting processes to a variable-lengths encoding they may not be aware of. At least there's a potential for buffer overruns here. Also, going from "C" to "C.UTF-8" might foil code checking for the string "C" explicitly in a child process. > > > > ones, would not make the same choices. If we do not propagate, we could > > > > end up with a child process (often helpers) that make different choices > > > > as to what command-line arguments or pipes or contents in files mean. > > > > If we propagate we'll expose the child processes to locales they might not > > expect, in circumstances where the user of the system possibly intentionally > > chose a non-UTF8-locale to make exactly those child processes happy. > > True, but that was done at the expense of running Qt in a largely unsupported > and untested scenario. Setting the locale to C means we can't access any file > with an 8bit file name; setting to Latin1 would allow that, but produce > mojibake in GUI. Setting to "C" also "works" in practice when blobs are just read and written unmodified. > > Effectively, going for "c) yes" deprives the user of a certain level of > > freedom that is needed, "c) no" is less intrusive. > > > > "c) no" as default and a simple one-liner opt-in for applications that > > want to engage in "strict parenting" might be an option, too. > > How about making the resetting opt-out, instead of opt-in? > QT_NO_OVERRIDE_LC_CTYPE? I was more thinking of a runtime option. Like QCoreApplication::setPropagateOurChoices(true) Or do I miss something why this has to be a compile time choice? Andre' ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems
On Friday, 1 November 2019 12:29:19 PDT André Pönitz wrote: > > > My personal preference is: > > > a) C.UTF-8 > > > b) override it to force UTF-8 on the same locale > > > c) yes > > > > I agree with all three choices. > > a) and b) are fine with me, "c) yes" sounds like a potential problem. > > Most of the child process I usually call are not Qt based, rather some > random unrelated tools, in some cases even quite old random unrelated > tools. TBH, all the more reason for propagating the choice. Please remember that on any modern Linux or macOS or FreeBSD, they are already running with a UTF-8 locale. The most common scenario of our setting something is when LC_ALL=C was set in the environment, which will cause us to reset it to C.UTF-8. > > > ones, would not make the same choices. If we do not propagate, we could > > > end up with a child process (often helpers) that make different choices > > > as to what command-line arguments or pipes or contents in files mean. > > If we propagate we'll expose the child processes to locales they might not > expect, in circumstances where the user of the system possibly intentionally > chose a non-UTF8-locale to make exactly those child processes happy. True, but that was done at the expense of running Qt in a largely unsupported and untested scenario. Setting the locale to C means we can't access any file with an 8bit file name; setting to Latin1 would allow that, but produce mojibake in GUI. > Effectively, going for "c) yes" deprives the user of a certain level of > freedom that is needed, "c) no" is less intrusive. > > "c) no" as default and a simple one-liner opt-in for applications that > want to engage in "strict parenting" might be an option, too. How about making the resetting opt-out, instead of opt-in? QT_NO_OVERRIDE_LC_CTYPE? -- Thiago Macieira - thiago.macieira (AT) intel.com Software Architect - Intel System Software Products ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems
On Fri, Nov 01, 2019 at 09:21:48AM +, Lars Knoll wrote: > > There are three questions to be decided: > > a) What should Qt 6 assume the locale to be, if no locale is set? > > b) In case a non-UTF-8 locale is set, what should we do? > > c) Should we propagate our decision to child processes? > > > > My personal preference is: > > a) C.UTF-8 > > b) override it to force UTF-8 on the same locale > > c) yes > > I agree with all three choices. a) and b) are fine with me, "c) yes" sounds like a potential problem. Most of the child process I usually call are not Qt based, rather some random unrelated tools, in some cases even quite old random unrelated tools. > > ones, would not make the same choices. If we do not propagate, we could end > > up > > with a child process (often helpers) that make different choices as to what > > command-line arguments or pipes or contents in files mean. If we propagate we'll expose the child processes to locales they might not expect, in circumstances where the user of the system possibly intentionally chose a non-UTF8-locale to make exactly those child processes happy. Effectively, going for "c) yes" deprives the user of a certain level of freedom that is needed, "c) no" is less intrusive. "c) no" as default and a simple one-liner opt-in for applications that want to engage in "strict parenting" might be an option, too. Andre' ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems
Lars Knoll (1 November 2019 10:21) > Thanks for the comprehensive mail. Seconded :-) By effectively joining forces with Python to deprecate ASCII in favour of UTF-8, perhaps we can even put some pressure on POSIX to board the Unicode train. > For your bonus (d) below, I’d say we should print a warning if we > encounter a non UTF-8 locale other than C. On 31 Oct 2019, at 22:11, Thiago Macieira wrote: >> Bonus d) should we print a warning when we've made a change? >> >> Options are: >> - yes, for all of them >> - yes, but only for locales other than "C" >> - no I note that [PEP 538] says (on the "C is C-UTF8" part of the subject matter), under Implementation Notes: Attempting to implement the PEP as originally accepted showed that the proposal to emit locale coercion and compatibility warnings by default simply wasn't practical (there were too many cases where previously working code failed because of the warnings, rather than because of latent locale handling defects in the affected code). * [PEP 538] https://www.python.org/dev/peps/pep-0538/ So I cast another vote for >> - yes, but only for locales other than "C" Eddy. ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems
Hi Thiago, Thanks for the comprehensive mail. > On 31 Oct 2019, at 22:11, Thiago Macieira wrote: > > Re: https://codereview.qt-project.org/c/qt/qtbase/+/275152 (WIP: Move > QTextCodec support out of QtCore) > See also: https://www.python.org/dev/peps/pep-0538/ > https://www.python.org/dev/peps/pep-0540/ > > Summary: > The change above, while removing QTextCodec from our API, had the side-effect > of forcing the locale encoding on Unix to be only UTF-8. This RFC (to be > recorded as a QUIP) is meant to discuss how we'll deal with locales on Unix > systems on Qt 6. This does not apply to Windows because on Windows we cannot > reasonably be expected to use UTF-8 for the 8-bit encoding. I do not think we have to worry about the local 8 bit encoding on Windows anymore these days. All our interaction with the OS goes through the 16 bit APIs (ie. uses UTF-16). I don’t think file content is a huge issue neither anymore as Windows 10 seems to have added UTF-8 support to most of it’s tools. Afaik, we can also use a Unicode API for console and debug output, so the only piece that’s left might be our users interacting with legacy ANSI APIs. That should be a rare case and it should be straightforward to port that over to use the Unicode API instead. > > There are three questions to be decided: > a) What should Qt 6 assume the locale to be, if no locale is set? > b) In case a non-UTF-8 locale is set, what should we do? > c) Should we propagate our decision to child processes? > > My personal preference is: > a) C.UTF-8 > b) override it to force UTF-8 on the same locale > c) yes I agree with all three choices. For your bonus (d) below, I’d say we should print a warning if we encounter a non UTF-8 locale other than C. Cheers, Lars > > Long explanation: > > On Unix systems, traditionally, the locale is a factor of multiple > environment > variables starting with LC_ (matching macro names from ), as well > as > the LANG and LANGUAGES variables. If none of those is set, the C and POSIX > standards say that the default locale is "C". Moreover, POSIX says that the > "POSIX" locale is "C" and does not have multibyte encodings -- that excludes > its encoding from being UTF-8. > > Most modern Unix-based operating systems do set a reasonable, UTF8-based > locale for the user. They've been doing that for about 15 years -- it was in > 2003 that this started, when I had to switch from zsh back to bash because > zsh > didn't support UTF-8 yet, but switched back in 2005 when it gained support. > On > top of that, some even more recent Unix offerings -- namely, macOS and > Android > -- enforce that the default (or only!) locale encoding is UTF-8. > > Right now, Qt faithfully accepts the locale configuration set by the user in > the environment. It can do that because it has QTextCodec, which is also > backed by either the libiconv routines or by ICU, so it can deal with any > encoding. In properly-configured environments, there's no problem. > > The two Python documents above (PEP-538 and 540) also discuss how Python > changed its strategy. I'm proposing that we follow Python and go a little > further. > > What's the problem? > > The problem is where the locale is not set up properly or it is explicitly > overriden. See PEP-538 for examples in containers, but as can be seen from > it, > Linux will default to "POSIX" or empty, which means Qt will interpret the > locale as US-ASCII, which is almost never what is intended. Moreover, because > of our use of QString for file names, any name that contains code units above > 0x7f will be deemed a filesystem corruption and ignored on directory listing > -- they are not representable. > > Furthermore, it happens quite often that users and tools set LC_ALL to "C" in > order to obtain messages in English, so they can be parsed by other tools or > to be pasted in emails (every time you see me post an error message from a > console, I've done that). There are alternative locales that can be used, > like > "C.UTF-8", "C.utf8" or "UTF-8", but those depend on the operating system and > may not be actually available. > > Arguing that this is an incorrect setup, while factually correct, does not > change the fact that it happens. > > Questions and options: > > a) What should Qt 6 assume when the locale is unset or is just "C"? > > This is the case of a simple environment where the variables are unset or > have > some legacy system-wide defaults, as well as when the user explicitly sets > LC_ALL to "C". The options are: > - accept them as-is > - assume that C with UTF-8 support was intended > > The first option is what we have today. And if this is our option, then > neither question b or c make sense. > > The second option implies doing the check in QCoreApplication right after > setlocale(LC_ALL, ""): > if (strcmp(setlocale(LC_ALL, NULL), "C") == 0) > setlocale(LC_CTYPE, "C.UTF-8"); > > b) What should Qt 6 do i
[Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems
Re: https://codereview.qt-project.org/c/qt/qtbase/+/275152 (WIP: Move QTextCodec support out of QtCore) See also: https://www.python.org/dev/peps/pep-0538/ https://www.python.org/dev/peps/pep-0540/ Summary: The change above, while removing QTextCodec from our API, had the side-effect of forcing the locale encoding on Unix to be only UTF-8. This RFC (to be recorded as a QUIP) is meant to discuss how we'll deal with locales on Unix systems on Qt 6. This does not apply to Windows because on Windows we cannot reasonably be expected to use UTF-8 for the 8-bit encoding. There are three questions to be decided: a) What should Qt 6 assume the locale to be, if no locale is set? b) In case a non-UTF-8 locale is set, what should we do? c) Should we propagate our decision to child processes? My personal preference is: a) C.UTF-8 b) override it to force UTF-8 on the same locale c) yes Long explanation: On Unix systems, traditionally, the locale is a factor of multiple environment variables starting with LC_ (matching macro names from ), as well as the LANG and LANGUAGES variables. If none of those is set, the C and POSIX standards say that the default locale is "C". Moreover, POSIX says that the "POSIX" locale is "C" and does not have multibyte encodings -- that excludes its encoding from being UTF-8. Most modern Unix-based operating systems do set a reasonable, UTF8-based locale for the user. They've been doing that for about 15 years -- it was in 2003 that this started, when I had to switch from zsh back to bash because zsh didn't support UTF-8 yet, but switched back in 2005 when it gained support. On top of that, some even more recent Unix offerings -- namely, macOS and Android -- enforce that the default (or only!) locale encoding is UTF-8. Right now, Qt faithfully accepts the locale configuration set by the user in the environment. It can do that because it has QTextCodec, which is also backed by either the libiconv routines or by ICU, so it can deal with any encoding. In properly-configured environments, there's no problem. The two Python documents above (PEP-538 and 540) also discuss how Python changed its strategy. I'm proposing that we follow Python and go a little further. What's the problem? The problem is where the locale is not set up properly or it is explicitly overriden. See PEP-538 for examples in containers, but as can be seen from it, Linux will default to "POSIX" or empty, which means Qt will interpret the locale as US-ASCII, which is almost never what is intended. Moreover, because of our use of QString for file names, any name that contains code units above 0x7f will be deemed a filesystem corruption and ignored on directory listing -- they are not representable. Furthermore, it happens quite often that users and tools set LC_ALL to "C" in order to obtain messages in English, so they can be parsed by other tools or to be pasted in emails (every time you see me post an error message from a console, I've done that). There are alternative locales that can be used, like "C.UTF-8", "C.utf8" or "UTF-8", but those depend on the operating system and may not be actually available. Arguing that this is an incorrect setup, while factually correct, does not change the fact that it happens. Questions and options: a) What should Qt 6 assume when the locale is unset or is just "C"? This is the case of a simple environment where the variables are unset or have some legacy system-wide defaults, as well as when the user explicitly sets LC_ALL to "C". The options are: - accept them as-is - assume that C with UTF-8 support was intended The first option is what we have today. And if this is our option, then neither question b or c make sense. The second option implies doing the check in QCoreApplication right after setlocale(LC_ALL, ""): if (strcmp(setlocale(LC_ALL, NULL), "C") == 0) setlocale(LC_CTYPE, "C.UTF-8"); b) What should Qt 6 do if a different locale, other than C, is non-UTF8? This case is not an accident, most of the time. It can happen from time to time that someone is simply testing different languages and forces LC_ALL to something non-default to see what happens. They'll very quickly try the UTF-8 versions. But when it's not an accident, it means it was intended. This is the general state of Unix prior to 2003, when locales like "en_US", "en_GB", "fr_FR", "pt_BR" existed, as well as the 2001-2003 Euro variants "fr_FR@euro", "de_DE@euro", "nl_NL@euro", etc. Options are: - accept them as-is (this is what Python does) - assume that the UTF-8 variant was intended, just not properly set The first option is what we have today, aside from the C locale (question (a)). However, keeping that option working implies keeping either ICU or iconv working in Qt 6 and we might want to get rid of that dependency for codecs. The second option implies modifying the QCoreApplication change above. Instead of explicitly checking for