On Thu, Jul 11, 2024 at 06:34:00PM +0100, ropers wrote: > On Thu, 11 Jul 2024 at 06:09, Crystal Kolipe <kolip...@exoticsilicon.com> > wrote: > > > On Thu, Jul 11, 2024 at 04:25:33AM +0100, ropers wrote: > > > It's long been a secret wishlist item for me to solicit/reach agreement > > on > > > which 256 (possibly 512) code points might merit inclusion in a minimal > > > > There is already preliminary support for propper UTF-8 handling in the > > framebuffer console on OpenBSD. It's still buggy, but work is on-going. > > > > Thank you very much. That's great news. > > It would be really nice if agreement could reached between all the BSDs > (and possibly other Unix-likes) on which characters to include in a > minimalist 256--or 512--character subset of Unicode.
Why? This makes absolutely no sense to me. Talking about a 512 character subset, your thinking seems to be influenced by either VGA hardware or the linux framebuffer console, neither of which is relevant for our purposes. > This would NOT mean OpenBSD's framebuffer console switching to CP1252 or > even adopting CP1252 -- no, OpenBSD would still be adopting UTF-8 and UTF-8 > only, There are several different issues at play here. The wsfont and rasops subsystems already support fonts with > 256 characters. This is the 'graphical' side of things, support for drawing those glyphs. Right now today, you can create a console font with 100,000 glyphs and load it in to the wscons subsystem. Of course, displaying ASCII text will never touch those glyphs, because the only displayable characters are 32 - 126. The framebuffer console is most commonly configured for ISO-8859-1. Try running: $ echo "\0377" # Octal representation of 0xff. and you get y with diaeresis. You can switch to a different NRCS, (National Replacement Character Set), which will pull in characters from beyond codepoint 127 and put them in the 7-bit ASCII range. On a real DEC terminal these glyphs were remapped from 8-bit positions in the proprietary DEC MCS, (which is similar to but not identical to ISO-8859-1). On OpenBSD, the translation to UCS codepoints is handled by tables defined in wsemul_vt100_chars.c. All of the mappings for regular alphabetical characters fall within the 8-bit ISO-8859-1 range, so even at this point you're not needing to go beyond 256. Some of the line-drawing and other graphics characters found in the DEC technical and special graphics sets are mapped to codepoints > 255 on OpenBSD. As a result, with the default spleen font, you won't see these glyphs. But if you add appropriate glyphs and re-compile the kernel, it'll work. For direct access to glyphs past 255, the OpenBSD wscons console provides UTF-8 emulation. For many years that support was broken and nobody noticed, which already suggests to me that interest is limited for using anything more than ASCII or at most ISO-8859-1 on the framebuffer console. https://marc.info/?l=openbsd-tech&m=167734639712745 That bug has been fixed. Others still exist. Work is on-going. But OpenBSD already supports ISO-8859-1 on the framebuffer console. If you have a desire to create your own 8-bit character set to cater for a particular niche case, it's not particularly difficult. You could just add a new control sequence and a new translation table, or modify an existing one. But the future is UTF-8. > however on the question of which of the hundreds of thousands of > Unicode characters might get one of the 256 limited-edition tickets to > "supported on console" prominence, There is no such limitation on OpenBSD. > It is my understanding that going for 512-character framebuffer console > charsets would require forgoing broader compatibility and the possible use > of 16 colours (512-character VGA framebuffer consoles can only do 8 > colours.) Thus limiting the subset to 256 characters seems advisable. This is a VGA hardware limitation. The framebuffer console is capable of pure 24-bit operation. I published patches last year to make it possible to use 256 colours with TERM=xterm-256color, and in fact the machine I am writing this email on has this set right now. > It would be possible to just put the C0 Control Pictures ( > enwp.org/Control_Pictures) there, which might make the plaintext column in > (suitably patched) hex editors slightly more informative (fewer dots, more > identifiable characters) Hexdump in base calls isprint() to decide whether to print the actual character or replace it with a dot. You can see fewer dots today with the following trivial patch: --- display.c.dist Wed Aug 24 04:13:45 2016 +++ display.c Sat Jul 13 09:15:03 2024 @@ -166,7 +166,7 @@ } break; case F_P: - (void)printf(pr->fmt, isprint(*bp) ? *bp : '.'); + (void)printf(pr->fmt, ((*bp & 0x7f) >= 32 && (*bp != 0x7f)) ? *bp : '.'); break; case F_STR: (void)printf(pr->fmt, (char *)bp); > One character I strongly feel should be included in a common minimalist > Unicode subset is the U+FFFD ??? REPLACEMENT CHARACTER Yeah, actually we do need that. We need to implement the REP control sequence to propperly support the use of TERM=xterm with the latest ncurses, and when REP follows a control character that is undefined behaviour. The most sensible thing to do really is display the replacement character, which we currently don't have available by default. > Regarding the OP's specific question - if the files being edited only > > contain those specific UTF-8 sequences and are otherwise plain ASCII text, > > then a simple work-around might be a script that replaces each two-byte > > sequence with the corresponding ISO-8859-1 character, writes that to a > > temporary file, invokes vi for editing the temporary file, then converts > > it back to UTF-8 afterwards. > > That is a pretty neat idea. For some value of "simple", I suppose. :-) You can do it in a few lines with sed. > Of course, this workaround might break in new and interesting ways once > what's in the files is no longer strictly limited to two-byte characters > also present in the ISO-8859-1 charset. Don't do that :-).