2015-10-13 8:36 GMT+02:00 Richard Wordingham < richard.wording...@ntlworld.com>:
> For > example, a MSKLC keyboard will deliver a supplementary character in > two WM_CHAR messages, one for the high surrogate and one for the low > surrogate. > I have not tested the actual behavior in 64-bit versions of Windows : is the message field of the WM_CHAR returned by the 64-bit version of the API still requires returning two messages and not a single one if that field has been extended to 64-bit ? In that case, no surrogates would be returned, but directly the supplementary character. But may be this has not changed so that the predefined Windows type for wide characters remains 16-bit (otherwise even in the 32-bit version of the API, a single message would have been enough with a 32-bit message data field): the "Unicode" version of the API's assume everywhere a 16-bit encoding of strings and the event message most probably uses the same size of code units. The actual behavior is also tricky as the basic layouts built with MSKLC will have its character data translated "transparently" to other "OEM" encodings according to the current input code page of the console (using one of the codepage mapping tables installed separately): the transcoder will also need to translate the 16-bit Unicode input from WM_CHAR messages into the 8-bit input stream used by the console, and this translation will need to read both surrogates at once before sending any output. Also I don't think this is specific to MSKLC drivers. A driver (not just keyboard layouts that actually contain no code but just a data structure, but also input methods using their own message loop to process and filter input events and delivering their own translated messages) built with any other tool will use the same message format. Any way, those Windows drivers cannot actually know how the editing application will finally process the two surrogates : if the application does not detect surrogates properly and chose to discard one but not the other, the driver is not at fault and it is a bug of the application. Those MSKLC drivers actually have no view on the input buffer, they process the input on the flow (but may be the a more advanced input driver with its own message processing loop could send its own messages to query the application about what is in its buffer, or to instruct it to perform some custom substring replacements/editing and update its caret position or selection). So in my view, this is not a bug of the layout drivers themselves and not even a bug of the Windows core API. The editing application (or the common interface component) has to be prepared to process both surrogates as one character, or discard lone surrogates it could see (after alerting the user with some beep message), or submit some custom replacement. It is this application or component that will need to manage its input buffer correctly. If that buffer uses 16-bit code units, deleting one position in the buffer (for example when pressing Backspace or delete) without looking at what is deleted, or performing text selection in the middle of a surrogates pair (and then blindly replacing that selection) will generate those lone surrogates in the input buffer. The same considerations would also apply to Linux input drivers and GUI components, that use 8-bit encodings including UTF-8 (this is more difficult because the Linux kernel is blind about the encoding, which is defined only in the user's input locale environment): the same havoc could happen if the editing application breaks in the middle of a multibyte UTF-8 sequence, and the applications must also be ready to accept random byte sequences including those not containing valid UTF-8 (but how those applications will actually handle the offending bytes remains also application dependant), and the same question will arise : how many code points are in the 8-bit string if it is not valid UTF-8 ? There will not be a unique answer because how the application will filter those errors will vary. You'd also have the same problem with console apps using the 8-bit BIOS/DOS input emulation API, or within terminal applications listening for input from a network socket sending 8-bit data streams (the emulation protocol will also need to filter that input and detect errors when the input does not validate the expected encoding, but how that protocol protocol will recover after the error will remain protocol dependant, and it's not sure that the emulation terminal provides notifications to the user when there are input errors, the protocol may as well interrupt the communication with an EOF event and the communication channel closed). In other words: as soon as there's a single error in some input for the UTF validation, you cannot assert any value for the whole input content.