On Thu, Jul 11, 2024 at 06:34:00PM +0100, ropers wrote:
> On Thu, 11 Jul 2024 at 06:09, Crystal Kolipe <kolip...@exoticsilicon.com>
> wrote:
> 
> > On Thu, Jul 11, 2024 at 04:25:33AM +0100, ropers wrote:
> > > It's long been a secret wishlist item for me to solicit/reach agreement
> > on
> > > which 256 (possibly 512) code points might merit inclusion in a minimal
> >
> > There is already preliminary support for propper UTF-8 handling in the
> > framebuffer console on OpenBSD.  It's still buggy, but work is on-going.
> >
> 
> Thank you very much. That's great news.
> 
> It would be really nice if agreement could reached between all the BSDs
> (and possibly other Unix-likes) on which characters to include in a
> minimalist 256--or 512--character subset of Unicode.

Why?  This makes absolutely no sense to me.

Talking about a 512 character subset, your thinking seems to be influenced
by either VGA hardware or the linux framebuffer console, neither of which is
relevant for our purposes.

> This would NOT mean OpenBSD's framebuffer console switching to CP1252 or
> even adopting CP1252 -- no, OpenBSD would still be adopting UTF-8 and UTF-8
> only,

There are several different issues at play here.

The wsfont and rasops subsystems already support fonts with > 256 characters.
This is the 'graphical' side of things, support for drawing those glyphs.

Right now today, you can create a console font with 100,000 glyphs and load
it in to the wscons subsystem.

Of course, displaying ASCII text will never touch those glyphs, because the
only displayable characters are 32 - 126.

The framebuffer console is most commonly configured for ISO-8859-1.  Try
running:

$ echo "\0377" # Octal representation of 0xff.

and you get y with diaeresis.

You can switch to a different NRCS, (National Replacement Character Set),
which will pull in characters from beyond codepoint 127 and put them in the
7-bit ASCII range.

On a real DEC terminal these glyphs were remapped from 8-bit positions in the
proprietary DEC MCS, (which is similar to but not identical to ISO-8859-1).

On OpenBSD, the translation to UCS codepoints is handled by tables defined in
wsemul_vt100_chars.c.

All of the mappings for regular alphabetical characters fall within the 8-bit
ISO-8859-1 range, so even at this point you're not needing to go beyond 256.

Some of the line-drawing and other graphics characters found in the DEC
technical and special graphics sets are mapped to codepoints > 255 on OpenBSD.

As a result, with the default spleen font, you won't see these glyphs.  But
if you add appropriate glyphs and re-compile the kernel, it'll work.

For direct access to glyphs past 255, the OpenBSD wscons console provides
UTF-8 emulation.

For many years that support was broken and nobody noticed, which already
suggests to me that interest is limited for using anything more than ASCII or
at most ISO-8859-1 on the framebuffer console.

https://marc.info/?l=openbsd-tech&m=167734639712745

That bug has been fixed.  Others still exist.  Work is on-going.

But OpenBSD already supports ISO-8859-1 on the framebuffer console.

If you have a desire to create your own 8-bit character set to cater for
a particular niche case, it's not particularly difficult.  You could just add
a new control sequence and a new translation table, or modify an existing one.

But the future is UTF-8.

> however on the question of which of the hundreds of thousands of
> Unicode characters might get one of the 256 limited-edition tickets to
> "supported on console" prominence,

There is no such limitation on OpenBSD.

> It is my understanding that going for 512-character framebuffer console
> charsets would require forgoing broader compatibility and the possible use
> of 16 colours (512-character VGA framebuffer consoles can only do 8
> colours.) Thus limiting the subset to 256 characters seems advisable.

This is a VGA hardware limitation.  The framebuffer console is capable of
pure 24-bit operation.  I published patches last year to make it possible to
use 256 colours with TERM=xterm-256color, and in fact the machine I am writing
this email on has this set right now.

> It would be possible to just put the C0 Control Pictures (
> enwp.org/Control_Pictures) there, which might make the plaintext column in
> (suitably patched) hex editors slightly more informative (fewer dots, more
> identifiable characters)

Hexdump in base calls isprint() to decide whether to print the actual character
or replace it with a dot.  You can see fewer dots today with the following
trivial patch:

--- display.c.dist      Wed Aug 24 04:13:45 2016
+++ display.c   Sat Jul 13 09:15:03 2024
@@ -166,7 +166,7 @@
                }
                break;
        case F_P:
-               (void)printf(pr->fmt, isprint(*bp) ? *bp : '.');
+               (void)printf(pr->fmt, ((*bp & 0x7f) >= 32 && (*bp != 0x7f)) ? 
*bp : '.');
                break;
        case F_STR:
                (void)printf(pr->fmt, (char *)bp);


> One character I strongly feel should be included in a common minimalist
> Unicode subset is the U+FFFD ??? REPLACEMENT CHARACTER

Yeah, actually we do need that.  We need to implement the REP control sequence
to propperly support the use of TERM=xterm with the latest ncurses, and when
REP follows a control character that is undefined behaviour.  The most sensible
thing to do really is display the replacement character, which we currently
don't have available by default.

> Regarding the OP's specific question - if the files being edited only
> > contain those specific UTF-8 sequences and are otherwise plain ASCII text,
> > then a simple work-around might be a script that replaces each two-byte
> > sequence with the corresponding ISO-8859-1 character, writes that to a
> > temporary file, invokes vi for editing the temporary file, then converts
> > it back to UTF-8 afterwards.
> 
> That is a pretty neat idea. For some value of "simple", I suppose. :-)

You can do it in a few lines with sed.

> Of course, this workaround might break in new and interesting ways once
> what's in the files is no longer strictly limited to two-byte characters
> also present in the ISO-8859-1 charset.

Don't do that :-).

Reply via email to