Re: [Jprogramming] RFC: unicode

Elijah Stone Sat, 19 Mar 2022 15:06:56 -0700

On Sat, 19 Mar 2022, 'Pascal Jasmin' via Programming wrote:

(It probably would help everyone if there was a shorter retelling of thesemantics, even assuming the reader was able to skim through most ofit.)

There's a bulleted list at the end. Everything preceeding it isrationale.

there is a UCS-1 that is different from J's utf8? Is UCS-1 actually anupdate of utf8 that has differences?

Sorry, I should have explicated this. UCS-1 is what I call j's unicodeencoding of one byte per code unit, one code unit per code point. (Notvery 'universal', as all it can represent is ASCII + a few unicodecharacters, but I don't know what else to call it.) There is nothingwrong with it, as an implementation strategy, but it should not be exposedto the user. No one else uses it; it is completely uninteresting from aninteroperability standpoint.

Does your proposal's main concern is some ability to handle misformedunicode/utf8 sequences?

Random access to unicode, and no incoherent aliasing. Handling malformedsequences is gravy.

If handling means turn that "character" into null
[...]
The main idea may instead be that if there is malformed unicode, theninstead of figuring out some result, whoever sent this garbage should benotified that it is garbage.

I proposed that both mechanisms be available, and the programmer canchoose from among them at will.

If handling means turn that "character" into null, how do you guaranteethe malformation wasn't a missing byte, and that the rest of the"stream" would be well formed (and the intent of message) if thatmissing byte could be guessed instead of consuming "the first byte ofnext character".

Low-level stream processing will need to do low-level encoding handling.This might entail handling the stream as a sequence of numbers rather thancharacters.

I will also note that my 'nulls' are typed; you get a separate one forevery potentially bad source byte, so no information is thrown away. Butyou might not want to use that for byte-slices of a valid utf8 stream.

I believe you are also saying that UCS-1 or utf8 are ubiquitous in theoutside world. I can only understand the appeal as one ofspace/bandwidth saving. A better space saving encoding is lempel-ziv(zip) or better compression on unicode4.

I am not proposing that utf8 be used as an internal representation.Emphatically the opposite. But because utf8 is ubiquitous, allinteroperation should default to encoding/decoding as utf8.

----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jprogramming] RFC: unicode

Reply via email to