On Tue, Dec 15, 2015 at 2:27 PM, Tim Roberts <t...@probo.com> wrote: > > The Windows console shell is an 8-bit entity. That means you only have > 256 characters available at any given time, similar to they way > non-Unicode strings work in Python 2.
The input and screen buffers of the console (conhost.exe) are UCS-2, which has been the case since NT 3.1 was released in 1993. There are display limits, such as not being able to mix narrow and wide glyphs and not handling characters composed with multiple codes (such as UTF-16 surrogate pairs). Regardless of what's displayed, the wide-character API preserves the underlying UTF-16 text. That said, handling the input buffer requires special care due to how it represents characters that aren't mapped by the current keyboard layout. In this case, the WindowProc of conhost.exe handles a WM_DROPFILES [1] message as if it's pasted from the clipboard. It loops over the string to create an INPUT_RECORD [2] array. Each character is mapped in the current keyboard layout via VkKeyScan [3]. If this fails, the console uses a sequence of Alt+Numpad key event records. (At the end of this reply I'm including a commented transcript of a session with a debugger attached to conhost.exe in Windows 10. I set a breakpoint on s_DoStringPaste to watch how it handled pasting "À" into the input buffer.) A client program that calls ReadConsoleW [4] doesn't have to worry about this. The console internally handles decoding the Alt+Numpad sequence when it writes the input to the caller's wide-character buffer. Microsoft's getwch function instead calls ReadConsoleInputW [5] to be able to read extended keys and avoid discarding non-keyboard events, but it doesn't handle the Alt+Numpad case. Handling these sequences requires a custom implementation of kbhit and getwch. An example that gets this right is the PDCurses [6] library, when compiled using the wide-character API. Christoph Gohlke has a Python curses module [7] for Windows that uses PDCurses, but only the Python 3 version is compiled with Unicode support. If extended key support (e.g. arrow and function keys) and preserving mouse, window buffer, and focus events doesn't matter, then just disable the console's line input and echo mode, and call ReadConsoleW to read a character at a time. This lets the console handle the Alt+Numpad events for you. Here's example ctypes code for this limited implementation of kbhit and getwch. It's not broadly tested, so caveat emptor. I did check that it worked with file drops and pasting Unicode strings into the console, as well as manual Alt+Numpad input. import msvcrt import ctypes from ctypes import wintypes kernel32 = ctypes.WinDLL('kernel32', use_last_error=True) STD_INPUT_HANDLE = -10 KEY_EVENT = 1 VK_MENU = 0x12 ENABLE_LINE_INPUT = 2 ENABLE_ECHO_INPUT = 4 wintypes.CHAR = ctypes.c_char class INPUT_RECORD(ctypes.Structure): class EVENT_RECORD(ctypes.Union): class KEY_EVENT_RECORD(ctypes.Structure): class UCHAR(ctypes.Union): _fields_ = (('UnicodeChar', wintypes.WCHAR), ('AsciiChar', wintypes.CHAR)) _fields_ = (('bKeyDown', wintypes.BOOL), ('wRepeatCount', wintypes.WORD), ('wVirtualKeyCode', wintypes.WORD), ('wVirtualScanCode', wintypes.WORD), ('uChar', UCHAR), ('dwControlKeyState', wintypes.DWORD)) _fields_ = (('KeyEvent', KEY_EVENT_RECORD),) _fields_ = (('EventType', wintypes.WORD), ('Event', EVENT_RECORD)) def kbhit(): handle = kernel32.GetStdHandle(STD_INPUT_HANDLE) npend = wintypes.DWORD() npeek = wintypes.DWORD() if (not kernel32.GetNumberOfConsoleInputEvents( handle, ctypes.byref(npend)) or npend.value == 0): return False inbuf = (INPUT_RECORD * npend.value)() if (not kernel32.PeekConsoleInputW( handle, inbuf, npend, ctypes.byref(npeek)) or npeek.value == 0): return False peek = (INPUT_RECORD * npeek.value).from_buffer(inbuf) for p in peek: if p.EventType != KEY_EVENT: continue e = p.Event.KeyEvent if (e.bKeyDown or (e.wVirtualKeyCode == VK_MENU and e.uChar.UnicodeChar)): return True return False def getwch(): handle = kernel32.GetStdHandle(STD_INPUT_HANDLE) old_mode = wintypes.DWORD() if not kernel32.GetConsoleMode(handle, ctypes.byref(old_mode)): raise ctypes.WinError(ctypes.get_last_error()) mode = old_mode.value & ~(ENABLE_LINE_INPUT | ENABLE_ECHO_INPUT) kernel32.SetConsoleMode(handle, mode) try: wc = wintypes.WCHAR() n = wintypes.DWORD() if not kernel32.ReadConsoleW( handle, ctypes.byref(wc), 1, ctypes.byref(n), None): raise ctypes.WinError(ctypes.get_last_error()) return wc.value finally: kernel32.SetConsoleMode(handle, old_mode) > Using msvcrt.getchw does not convert the console to a Unicode entity. > It merely means the characters you DO get are represented in Unicode. FYI, the CRT source code is distributed with Visual Studio. For example, with Windows 10 and Visual Studio 2015, it should be installed here: _getch, _kbhit %ProgramFiles(x86)%\Windows Kits\10\Source\10.0.10150.0\ucrt\conio\getch.cpp _getwch %ProgramFiles(x86)%\Windows Kits\10\Source\10.0.10150.0\ucrt\conio\getwch.cpp So there's no mystery about what these functions do. The mystery that requires digging into the debugger is how conhost.exe implements the public console API. Thankfully Microsoft's symbol server publishes the (public) conhost symbols, so it's relatively easy to find interesting functions to break on. > The Windows console theoretically supports a UTF-8 code page (chcp > 65001), and it does fix many of these problems, but there are some > console apps that won't like it. The console itself doesn't support codepage 65001 (UTF-8) well at all. Depending on the version of Windows, conhost.exe (or csrss.exe prior to Win 7) has several bugs and shortcomings with this codepage. For example: * For reading from the console, all versions I've used fail to correctly encode non-ASCII characters as UTF-8 via WideCharToMultibyte. If you request 10 bytes, it attempts to encode 10 characters, which fails for non-ASCII UTF-8. Instead of trying to dynamically step down the number of characters, it returns to the client that it 'successfully' read 0 bytes. This generally gets interpreted as EOF. For example, Python's REPL quits, and input() raise EOFError. * A buffered writer might flush and split a 2-4 byte UTF-8 sequence into two separate writes. But the console doesn't maintain the state of partially written characters (or reads if the above bug wasn't there). Instead you'll end up with 2-4 U+FFFD replacement characters written to the console. * Prior to Windows 8, WriteFile to the console incorrectly reports the number of Unicode characters written instead of the number of bytes. So buffered writers will loop repeatedly writing what they think is the remainder of the UTF-8 buffer. This causes a potentially long trail of junk text to be printed after every buffered write that contains non-ASCII characters. As mentioned above, here's the debug session with a breakpoint set on ConhostV2!Clipboard::s_DoStringPaste. (conhostv2.dll was added in Windows 10, as part of the update of the console interface. It seems they're modularizing and modernizing the design using C++ classes, perhaps to accommodate more improvements in future releases?) To follow this it helps to have a basic understanding of Microsoft's debugger commands [8] and x64 register usage [9]. Breakpoint 0 hit ConhostV2!Clipboard::s_DoStringPaste: 00007ffb`11086120 4885c9 test rcx,rcx 0:001> pc Allocate memory for the INPUT_RECORD array: ConhostV2!Clipboard::s_DoStringPaste+0x51: 00007ffb`11086171 ff1501470200 call qword ptr [ ConhostV2!_imp_RtlAllocateHeap (00007ffb`110aa878)] ds:00007ffb`110aa878={ ntdll!RtlAllocateHeap (00007ffb`1c6aebf0)} VkKeyScanW returns -1 (0xffff) because the character isn't mapped in the keyboard layout: 0:001> pc ConhostV2!Clipboard::s_DoStringPaste+0x122: 00007ffb`11086242 ff1520420200 call qword ptr [ ConhostV2!_imp_VkKeyScanW (00007ffb`110aa468)] ds:00007ffb`110aa468={ USER32!VkKeyScanW (00007ffb`1a6f6dc0)} 0:001> p; r rax rax=ffffffffffffffff So convert the character to the closest OEM character to create an Alt+Numpad sequence. Note that the OEM character is just for the sequence. The actual Unicode character is stored in the Alt key (VK_MENU) release event. 0:001> pc ConhostV2!Clipboard::s_DoStringPaste+0x1a9: 00007ffb`110862c9 e8cecd0000 call ConhostV2!ConvertToOem (00007ffb`1109309c) 0:001> ? @rcx Evaluate expression: 437 = 00000000`000001b5 0:001> du @rdx l1 000000d3`1935f850 "À" 0:001> r r9 r9=000000d31935f840 The closest character to L'À' in codepage 437 is ASCII 'A': 0:001> p; da d31935f840 l1 000000d3`1935f840 "A" Call _itoa_s to get the base 10 representation of the ordinal value of 'A' as the string "65": 0:001> pc ConhostV2!Clipboard::s_DoStringPaste+0x1c0: 00007ffb`110862e0 ff15d2440200 call qword ptr [ ConhostV2!_imp__itoa_s (00007ffb`110aa7b8)] ds:00007ffb`110aa7b8={ msvcrt!itoa_s (00007ffb`1c042af0)} 0:001> ?? (char)@rcx char 0n65 'A' 0:001> r rdx rdx=000000d31935f7c4 0:001> p; da d31935f7c4 000000d3`1935f7c4 "65" Create events for entering 6 and 5 on the numeric keypad. The corresponding wVirtualKeyCode values are VK_NUMPAD6 and VK_NUMPAD5. Also get the keyboard scan codes by calling MapVirtualKeyW. Call MapVirtualKeyW to get the wVirtualScanCode for VK_NUMPAD6: 0:001> pc ConhostV2!Clipboard::s_DoStringPaste+0x21c: 00007ffb`1108633c ff151e410200 call qword ptr [ ConhostV2!_imp_MapVirtualKeyW (00007ffb`110aa460)] ds:00007ffb`110aa460={ USER32!MapVirtualKeyW (00007ffb`1a6f3e00)} 0:001> r rcx rcx=0000000000000066 Call MapVirtualKeyW to get the wVirtualScanCode for VK_NUMPAD5: 0:001> pc ConhostV2!Clipboard::s_DoStringPaste+0x21c: 00007ffb`1108633c ff151e410200 call qword ptr [ ConhostV2!_imp_MapVirtualKeyW (00007ffb`110aa460)] ds:00007ffb`110aa460={ USER32!MapVirtualKeyW (00007ffb`1a6f3e00)} 0:001> r rcx rcx=0000000000000065 0:001> pc Write the INPUT_RECORD array to the input buffer. ConhostV2!Clipboard::s_DoStringPaste+0x3e8: 00007ffb`11086508 e87b66ffff call ConhostV2!WriteInputBuffer (00007ffb`1107cb88) This writes an array with 6 records: 0:001> r r8 r8=0000000000000006 Each record is 20 (0x14) bytes. VK_MENU (0x12) pressed: 0:001> dw (@rdx + 0*14) la 000000d3`16dc9010 0001 16df 0001 0000 0001 0012 0038 0000 000000d3`16dc9020 0002 0000 VK_NUMPAD6 (0x66) pressed: 0:001> dw (@rdx + 1*14) la 000000d3`16dc9024 0001 0000 0001 0000 0001 0066 004d 0000 000000d3`16dc9034 0002 0000 VK_NUMPAD6 (0x66) released: 0:001> dw (@rdx + 2*14) la 000000d3`16dc9038 0001 3474 0000 0000 0001 0066 004d 0000 000000d3`16dc9048 0002 0000 VK_NUMPAD5 (0x65) pressed: 0:001> dw (@rdx + 3*14) la 000000d3`16dc904c 0001 0000 0001 0000 0001 0065 004c 0000 000000d3`16dc905c 0002 0000 VK_NUMPAD5 (0x65) released: 0:001> dw (@rdx + 4*14) la 000000d3`16dc9060 0001 0000 0000 0000 0001 0065 004c 0000 000000d3`16dc9070 0002 0000 VK_MENU (0x12) released; UnicodeChar == U+00C0: 0:001> dw (@rdx + 5*14) la 000000d3`16dc9074 0001 0000 0000 0000 0001 0012 0038 00c0 000000d3`16dc9084 0000 0000 [1]: https://msdn.microsoft.com/en-us/library/bb774303 [2]: https://msdn.microsoft.com/en-us/library/ms683499 [3]: https://msdn.microsoft.com/en-us/library/ms646329 [4]: https://msdn.microsoft.com/en-us/library/ms684958 [5]: https://msdn.microsoft.com/en-us/library/ms684961 [6]: http://pdcurses.sourceforge.net [7]: http://www.lfd.uci.edu/~gohlke/pythonlibs/#curses [8]: https://msdn.microsoft.com/en-us/library/ff540507 [9]: https://msdn.microsoft.com/en-us/library/9z1stfyw _______________________________________________ python-win32 mailing list python-win32@python.org https://mail.python.org/mailman/listinfo/python-win32