On Sat, 19 Mar 2022, 'Pascal Jasmin' via Programming wrote:

(It probably would help everyone if there was a shorter retelling of the semantics, even assuming the reader was able to skim through most of it.)

There's a bulleted list at the end. Everything preceeding it is rationale.


there is a UCS-1 that is different from J's utf8?  Is UCS-1 actually an update of utf8 that has differences?

Sorry, I should have explicated this. UCS-1 is what I call j's unicode encoding of one byte per code unit, one code unit per code point. (Not very 'universal', as all it can represent is ASCII + a few unicode characters, but I don't know what else to call it.) There is nothing wrong with it, as an implementation strategy, but it should not be exposed to the user. No one else uses it; it is completely uninteresting from an interoperability standpoint.


Does your proposal's main concern is some ability to handle misformed unicode/utf8 sequences?

Random access to unicode, and no incoherent aliasing. Handling malformed sequences is gravy.

If handling means turn that "character" into null
[...]
The main idea may instead be that if there is malformed unicode, then instead of figuring out some result, whoever sent this garbage should be notified that it is garbage.

I proposed that both mechanisms be available, and the programmer can choose from among them at will.


If handling means turn that "character" into null, how do you guarantee the malformation wasn't a missing byte, and that the rest of the "stream" would be well formed (and the intent of message) if that missing byte could be guessed instead of consuming "the first byte of next character".

Low-level stream processing will need to do low-level encoding handling. This might entail handling the stream as a sequence of numbers rather than characters.

I will also note that my 'nulls' are typed; you get a separate one for every potentially bad source byte, so no information is thrown away. But you might not want to use that for byte-slices of a valid utf8 stream.


I believe you are also saying that UCS-1 or utf8 are ubiquitous in the outside world.  I can only understand the appeal as one of space/bandwidth saving.  A better space saving encoding is lempel-ziv (zip) or better compression on unicode4.

I am not proposing that utf8 be used as an internal representation. Emphatically the opposite. But because utf8 is ubiquitous, all interoperation should default to encoding/decoding as utf8.
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to