Re: [Jprogramming] RFC: unicode

Elijah Stone Fri, 18 Mar 2022 18:33:20 -0700

1. As I demonstrated, j does _not_ treat its strings as though they areutf8-encoded, but rather as a bizarre hodge-podge of utf8 and 'ucs-1'.This is not a documentation problem. Again:


   x=: 8 u: 97 243 98
   y=: 9 u: x
   NB. x is a perfectly good utf8-encoded string
   datatype x
literal
   x
aób
   NB. y is a perfectly good ucs4-encoded string
   datatype y
unicode4
   y
aób
   NB. gobbledygook
   x,y
aÃ³baób

2. As a programmer, I don't see why I should care how strings areencoded, unless I am specifically dealing with external data which ismis-encoded.

3. And if the strings really are utf8, then indices should be in codepoints, not code units, and invalid utf8 should be impossible toconstruct.


On Sat, 19 Mar 2022, chris burke wrote:

I think the correct view is that J text strings are in utf8 and the
programmer should handle it.

This does mean that some sequences of bytes that are not uft8 look odd in
the terminal.

Perhaps the documentation should be improved?

On Sat, Mar 19, 2022 at 8:48 AM Elijah Stone <[email protected]> wrote:

An easy counter-argument to all of this runs as follows: I have proposed a
breaking change, but there is a great deal of 'break', and not much
'change' to show for it.  I will attempt to head this off peremptorily:

Attempting to do text-processing without considering unicode is like
trying to do math before russel, godel, et al.  It is necessary to
consider very carefully what we are doing, whether it is ok to do it, and
if so why.  The ultimate conclusion is inevitably that 99% of what we are
doing was fine, but 1) the existence of sound underpinnings _is_
significant; and 2) that 1% does matter.  J's text handling capabilities
are unsound, and this manifests in the form of inconsistencies, as
demonstrated.

  -E

On Fri, 18 Mar 2022, Elijah Stone wrote:

>   x=: 8 u: 97 243 98      NB. same as entering x=: 'aób'
>    y=: 9 u: x
>    z=: 10 u: 97 195 179 98
>    x
> aób
>    y
> aób
>    z
> aÃ³b
>
>    x-:y
> 0
>    NB. ??? they look the same
>
>    x-:z
> 1
>    NB. ??? they look different
>
>    $x
> 4
>    NB. ??? it looks like 3 characters, not 4
>
> Well, this is unicode.  There are good reasons why two things that look
> the same might not actually be the same.  For instance:
>
>    ]p=: 10 u: 97 243 98
> aób
>    ]q=: 10 u: 97 111 769 98
> aób
>    p-:q
> 0
>
> But in the above case, x doesn't match y for stupid reasons.  And x
> matches z for stupider ones.
>
> J's default (1-byte) character representation is a weird hodge-podge of
> 'UCS-1' (I don't know what else to call it) and UTF-8, and it does not
> seem well thought through.  The dictionary page for u: seems confused as
> to whether the 1-byte representation corresponds to ASCII or UTF-8, and
> similarly as to whether the 2-byte representation is coded as UCS-2 or
> UTF-16.
>
> Most charitably, this is exposing low-level aspects of the encoding to
> users, but if so, that is unsuitable for a high-level language such as
j,
> and it is inconsistent.  I do not have to worry that 0 1 1 0 1 1 0 1
will
> suddenly turn into 36169536663191680, nor that 2.718 will suddenly turn
> into 4613302810693613912, but _that is exactly what is happening in the
> above code_.
>
> I give you the crowning WTF (maybe it is not so surprising at this
> point...):
>
>    x;y;x,y                NB. pls j
> ┌───┬───┬───────┐
> │aób│aób│aÃ³baób│
> └───┴───┴───────┘
>
> Unicode is delicate and skittish, and must be approached delicately.  I
> think that there are some essential conflicts between unicode and j--as
> the above example with the combining character demonstrates--but also
that
> pandora's box is open: literal data _exists_ in j.  Given that that is
the
> case, I think it is possible and desirable to do much better than the
> current scheme.
>
> ---
>
> Unicode text can be broken up in a number of ways.  Graphemes,
characters,
> code points, code units...
>
> The composition of code units into code points is the only such
> demarcation which is stable and can be counted upon.  It is also a
> demarcation which is necessary for pretty much any interesting text
> processing (to the point that I would suggest any form of 'text
> processing' which does not consider code points is not actually
processing
> text).  Therefore, I suggest that, at a minimum, no user-exposed
> representation of text should acknowledge a delineation below that of
the
> code point.  If there is any primitive which deals in code units, it
> should be a foreign: scary, obscure, not for everyday use.
>
> A non-obvious but good result of the above is that all strings are
> correctly-formed by construction.  Not all sequences of code units are
> correctly formed and correspond to valid strings of text.  But all
> sequences of code points _are_, of necessity, correctly formed,
otherwise
> there would be ... problems following additions to unicode.  J currently
> allows us to create malformed strings, but then complains when we use
them
> in certain ways:
>
>    9 u: 1 u: 10 u: 254 255
> |domain error
> |   9     u:1 u:10 u:254 255
>
> ---
>
> It is a question whether j should natively recognise delineations above
> the code point.  It pains me to suggest that it should not.
>
> Raku (a pointer-chasing language) has the best-thought-out strings of
any
> programming language I have encountered.  (Unsurprising, given it was
> written by perl hackers.)  In raku, operations on strings are
> grapheme-oriented.  Raku also normalizes all text by default (which
solves
> the problem I presented above with combining characters--but rest
assured,
> it can not solve all such problems).  They even have a scheme for
> space-efficient random access to strings on this basis.
>
> But j is not raku, and it is telling that, though raku has
> multidimensional arrays, its strings are _not_ arrays, and it does not
> have characters.  The principle problem is a violation of the rules of
> conformability.  For instance, it is not guaranteed that, for vectors x
> and y, (#x,y) -: x +&# y.  This is not _so_ terrible (though it is
pretty
> bad), but from it follows an obvious problem with catenating higher-rank
> arrays.  Similar concerns apply at least to i., e., E., and }.  That
said,
> I would support the addition of primitives to perform normalization (as
> well as casefolding etc.) and identification of grapheme boundaries.
>
> ---
>
> It would be wise of me to address the elephant in the room.  Characters
> are not only used to represent text, but also arbitrary binary data,
e.g.
> from the network or files, which may in fact be malformed as text.  I
> submit that characters are clearly the wrong way to represent such data;
> the right way to represent a sequence of _octets_ is using _integers_.
> But people persist, and there are two issues: the first is
compatibility,
> and the second is performance.
>
> Regarding the second, an obvious solution is to add a 1-byte integer
> representation (as Marshall has suggested on at least one occasion), but
> this represents a potentially nontrivial development effort.  Therefore
I
> suggest an alternate solution, at least for the interim: foreigns (scary
> and obscure, per above) that will _intentionally misinterpret_ data from
> the outside world as 'UCS-1' and represent it compactly (or do the
> opposite).
>
> Regarding the issue of backwards compatibility, I propose the addition
of
> 256 'meta-characters', each corresponding to an octet.  Attempts to
decode
> correctly formed utf-8 from the outside world will succeed and produce
> corresponding unicode; attempts to decode malformed utf-8 may map each
> incorrect code unit to the corresponding meta-character.  When encoded,
> real characters will be utf-8 encoded, but each meta-character will be
> encoded as its corresponding octet.  In this way, arbitrary byte streams
> may be passed through j strings; but byte streams which consist entirely
> or partly of valid utf-8 can be sensibly interpreted.  This is similar
to
> raku's utf8-c8, and to python's surrogateescape.
>
> ---
>
> An implementation detail, sort of.  Variable-width representations (such
> as utf-8) should not be used internally.  Many fundamental array
> operations require constant-time random access (with the corresponding
> obvious caveats), which variable-width representations cannot provide;
and
> even operations which are inherently sequential--like E., i., ;., #--may
> be more difficult or impossible to optimize to the same degree.
> Fixed-width representations therefore provide more predictable
> performance, better performance in nearly all cases, and better
asymptotic
> performance for many interesting applications.
>
> (The UCS-1 misinterpretation mentioned above is a loophole which allows
> people who really care about space to do the variable-width part
> themselves.)
>
> ---
>
> I therefore suggest the following language changes, probably to be
> deferred to version 10:
>
> - 1, 2, and 4-byte character representations are still used internally.
>   They are fixed-width, with each code unit representing one code point.
>   In the 4-byte representation, because there are more 32-bit values than
>   unicode code points, some 32-bit values may correspond to
passed-through
>   bytes of misencoded utf8.  In this way, a j literal can round-trip
>   arbitrary byte sequences.  The remainder of the 32-bit value space is
>   completely inaccessible.
>
> - A new primitive verb U:, to replace u:.  u: is removed.  U: has a
>   different name, so that old code will break loudly, rather than
quietly.
>   If y is an array of integers, then U:y is an array of characters with
>   corresponding codepoints; and if y is an array of characters, then U:y
>   is an array of their code points.  (Alternately, make a. impractically
>   large and rely on a.i.y and x{a. for everything.  I disrecommend this
>   for the same reason that we have j. and r., and do not write x = 0j1 *
y
>   or x * ^ 0j1 * y.)
>
> - Foreigns for reading from files, like 1!:1 and 1!:11 permit 3 modes of
>   operation; foreigns for writing to files, 1!:2, 1!:3, and 1!:12, permit
>   2 modes of operation.  The reading modes are:
>
>   1. Throw on misencoded utf-8 (default).
>   2. Pass-through misencoded bytes as meta characters.
>   3. Intentionally misinterpret the file as being 'UCS-1' encoded rather
>      than utf-8 encoded.
>
>   The writing modes are:
>
>   1. Encode as utf-8, passing through meta characters as the
corresponding
>      octets (default).
>   2. Misinterpret output as 'UCS-1' and perform no encoding.  Only valid
>      for 1-byte characters.
>
> A recommendation: the UCS-1 misinterpretation should be removed if
1-byte
> integers are ever added.
>
> - A new foreign is provided to 'sneeze' character arrays.  This is
>   largely cosmetic, but may be useful for some.  If some string uses a
>   4-byte representation, but in fact, all of its elements' code points
are
>   below 65536, then the result will use a smaller representation.  (This
>   can also do work on integers, as it can convert them to a boolean
>   representation if they are all 0 or 1; this is, again, marginal.)
>
> Future directions:
>
> Provide functionality for unicode normalization, casefolding, grapheme
> boundary identification, unicode character properties, and others.
Maybe
> this should be done by turning U: into a trenchcoat function; or maybe
it
> should be done by library code.  There is the potential to reuse
existing
> primitives, e.g. <.y might be a lowercased y, but I am wary of such puns.
>
> Thoughts?  Comments?
>
>  -E
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
>
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jprogramming] RFC: unicode

Reply via email to