Re: [Jprogramming] RFC: unicode

bill lam Sat, 19 Mar 2022 07:49:38 -0700

Further clarification, J language itself knows nothing about unicode
standard.
u: is the only place when utf8 and utf16 etc are relevant.



On Sat, 19 Mar 2022 at 10:17 PM bill lam <[email protected]> wrote:

> I think the current behavior of u: is correct and intended.
> First of all, J utf8 is not a unicode datatype, it is merely a
> interpretation of 1 byte literal.
> Similarly 2 byte and 4 byte literal aren't exactly ucs2 and uft32, and
> this is intended.
> Operation and comparison between different types of literal are done by
> promotion atom by atom. This will explain the results that you quoted.
>
> The handling of unicode in J is not perfect but it is consistent with J
> fundamental concepts such as rank.
>
> On Sat, 19 Mar 2022 at 7:17 AM Elijah Stone <[email protected]> wrote:
>
>>     x=: 8 u: 97 243 98      NB. same as entering x=: 'aób'
>>     y=: 9 u: x
>>     z=: 10 u: 97 195 179 98
>>     x
>> aób
>>     y
>> aób
>>     z
>> aÃ³b
>>
>>     x-:y
>> 0
>>     NB. ??? they look the same
>>
>>     x-:z
>> 1
>>     NB. ??? they look different
>>
>>     $x
>> 4
>>     NB. ??? it looks like 3 characters, not 4
>>
>> Well, this is unicode.  There are good reasons why two things that look
>> the same might not actually be the same.  For instance:
>>
>>     ]p=: 10 u: 97 243 98
>> aób
>>     ]q=: 10 u: 97 111 769 98
>> aób
>>     p-:q
>> 0
>>
>> But in the above case, x doesn't match y for stupid reasons.  And x
>> matches z for stupider ones.
>>
>> J's default (1-byte) character representation is a weird hodge-podge of
>> 'UCS-1' (I don't know what else to call it) and UTF-8, and it does not
>> seem well thought through.  The dictionary page for u: seems confused as
>> to whether the 1-byte representation corresponds to ASCII or UTF-8, and
>> similarly as to whether the 2-byte representation is coded as UCS-2 or
>> UTF-16.
>>
>> Most charitably, this is exposing low-level aspects of the encoding to
>> users, but if so, that is unsuitable for a high-level language such as j,
>> and it is inconsistent.  I do not have to worry that 0 1 1 0 1 1 0 1 will
>> suddenly turn into 36169536663191680, nor that 2.718 will suddenly turn
>> into 4613302810693613912, but _that is exactly what is happening in the
>> above code_.
>>
>> I give you the crowning WTF (maybe it is not so surprising at this
>> point...):
>>
>>     x;y;x,y                NB. pls j
>> ┌───┬───┬───────┐
>> │aób│aób│aÃ³baób│
>> └───┴───┴───────┘
>>
>> Unicode is delicate and skittish, and must be approached delicately.  I
>> think that there are some essential conflicts between unicode and j--as
>> the above example with the combining character demonstrates--but also
>> that
>> pandora's box is open: literal data _exists_ in j.  Given that that is
>> the
>> case, I think it is possible and desirable to do much better than the
>> current scheme.
>>
>> ---
>>
>> Unicode text can be broken up in a number of ways.  Graphemes,
>> characters,
>> code points, code units...
>>
>> The composition of code units into code points is the only such
>> demarcation which is stable and can be counted upon.  It is also a
>> demarcation which is necessary for pretty much any interesting text
>> processing (to the point that I would suggest any form of 'text
>> processing' which does not consider code points is not actually
>> processing
>> text).  Therefore, I suggest that, at a minimum, no user-exposed
>> representation of text should acknowledge a delineation below that of the
>> code point.  If there is any primitive which deals in code units, it
>> should be a foreign: scary, obscure, not for everyday use.
>>
>> A non-obvious but good result of the above is that all strings are
>> correctly-formed by construction.  Not all sequences of code units are
>> correctly formed and correspond to valid strings of text.  But all
>> sequences of code points _are_, of necessity, correctly formed, otherwise
>> there would be ... problems following additions to unicode.  J currently
>> allows us to create malformed strings, but then complains when we use
>> them
>> in certain ways:
>>
>>     9 u: 1 u: 10 u: 254 255
>> |domain error
>> |   9     u:1 u:10 u:254 255
>>
>> ---
>>
>> It is a question whether j should natively recognise delineations above
>> the code point.  It pains me to suggest that it should not.
>>
>> Raku (a pointer-chasing language) has the best-thought-out strings of any
>> programming language I have encountered.  (Unsurprising, given it was
>> written by perl hackers.)  In raku, operations on strings are
>> grapheme-oriented.  Raku also normalizes all text by default (which
>> solves
>> the problem I presented above with combining characters--but rest
>> assured,
>> it can not solve all such problems).  They even have a scheme for
>> space-efficient random access to strings on this basis.
>>
>> But j is not raku, and it is telling that, though raku has
>> multidimensional arrays, its strings are _not_ arrays, and it does not
>> have characters.  The principle problem is a violation of the rules of
>> conformability.  For instance, it is not guaranteed that, for vectors x
>> and y, (#x,y) -: x +&# y.  This is not _so_ terrible (though it is pretty
>> bad), but from it follows an obvious problem with catenating higher-rank
>> arrays.  Similar concerns apply at least to i., e., E., and }.  That
>> said,
>> I would support the addition of primitives to perform normalization (as
>> well as casefolding etc.) and identification of grapheme boundaries.
>>
>> ---
>>
>> It would be wise of me to address the elephant in the room.  Characters
>> are not only used to represent text, but also arbitrary binary data, e.g.
>> from the network or files, which may in fact be malformed as text.  I
>> submit that characters are clearly the wrong way to represent such data;
>> the right way to represent a sequence of _octets_ is using _integers_.
>> But people persist, and there are two issues: the first is compatibility,
>> and the second is performance.
>>
>> Regarding the second, an obvious solution is to add a 1-byte integer
>> representation (as Marshall has suggested on at least one occasion), but
>> this represents a potentially nontrivial development effort.  Therefore I
>> suggest an alternate solution, at least for the interim: foreigns (scary
>> and obscure, per above) that will _intentionally misinterpret_ data from
>> the outside world as 'UCS-1' and represent it compactly (or do the
>> opposite).
>>
>> Regarding the issue of backwards compatibility, I propose the addition of
>> 256 'meta-characters', each corresponding to an octet.  Attempts to
>> decode
>> correctly formed utf-8 from the outside world will succeed and produce
>> corresponding unicode; attempts to decode malformed utf-8 may map each
>> incorrect code unit to the corresponding meta-character.  When encoded,
>> real characters will be utf-8 encoded, but each meta-character will be
>> encoded as its corresponding octet.  In this way, arbitrary byte streams
>> may be passed through j strings; but byte streams which consist entirely
>> or partly of valid utf-8 can be sensibly interpreted.  This is similar to
>> raku's utf8-c8, and to python's surrogateescape.
>>
>> ---
>>
>> An implementation detail, sort of.  Variable-width representations (such
>> as utf-8) should not be used internally.  Many fundamental array
>> operations require constant-time random access (with the corresponding
>> obvious caveats), which variable-width representations cannot provide;
>> and
>> even operations which are inherently sequential--like E., i., ;., #--may
>> be more difficult or impossible to optimize to the same degree.
>> Fixed-width representations therefore provide more predictable
>> performance, better performance in nearly all cases, and better
>> asymptotic
>> performance for many interesting applications.
>>
>> (The UCS-1 misinterpretation mentioned above is a loophole which allows
>> people who really care about space to do the variable-width part
>> themselves.)
>>
>> ---
>>
>> I therefore suggest the following language changes, probably to be
>> deferred to version 10:
>>
>> - 1, 2, and 4-byte character representations are still used internally.
>>    They are fixed-width, with each code unit representing one code point.
>>    In the 4-byte representation, because there are more 32-bit values than
>>    unicode code points, some 32-bit values may correspond to
>> passed-through
>>    bytes of misencoded utf8.  In this way, a j literal can round-trip
>>    arbitrary byte sequences.  The remainder of the 32-bit value space is
>>    completely inaccessible.
>>
>> - A new primitive verb U:, to replace u:.  u: is removed.  U: has a
>>    different name, so that old code will break loudly, rather than
>> quietly.
>>    If y is an array of integers, then U:y is an array of characters with
>>    corresponding codepoints; and if y is an array of characters, then U:y
>>    is an array of their code points.  (Alternately, make a. impractically
>>    large and rely on a.i.y and x{a. for everything.  I disrecommend this
>>    for the same reason that we have j. and r., and do not write x = 0j1 *
>> y
>>    or x * ^ 0j1 * y.)
>>
>> - Foreigns for reading from files, like 1!:1 and 1!:11 permit 3 modes of
>>    operation; foreigns for writing to files, 1!:2, 1!:3, and 1!:12, permit
>>    2 modes of operation.  The reading modes are:
>>
>>    1. Throw on misencoded utf-8 (default).
>>    2. Pass-through misencoded bytes as meta characters.
>>    3. Intentionally misinterpret the file as being 'UCS-1' encoded rather
>>       than utf-8 encoded.
>>
>>    The writing modes are:
>>
>>    1. Encode as utf-8, passing through meta characters as the
>> corresponding
>>       octets (default).
>>    2. Misinterpret output as 'UCS-1' and perform no encoding.  Only valid
>>       for 1-byte characters.
>>
>> A recommendation: the UCS-1 misinterpretation should be removed if 1-byte
>> integers are ever added.
>>
>> - A new foreign is provided to 'sneeze' character arrays.  This is
>>    largely cosmetic, but may be useful for some.  If some string uses a
>>    4-byte representation, but in fact, all of its elements' code points
>> are
>>    below 65536, then the result will use a smaller representation.  (This
>>    can also do work on integers, as it can convert them to a boolean
>>    representation if they are all 0 or 1; this is, again, marginal.)
>>
>> Future directions:
>>
>> Provide functionality for unicode normalization, casefolding, grapheme
>> boundary identification, unicode character properties, and others.  Maybe
>> this should be done by turning U: into a trenchcoat function; or maybe it
>> should be done by library code.  There is the potential to reuse existing
>> primitives, e.g. <.y might be a lowercased y, but I am wary of such puns.
>>
>> Thoughts?  Comments?
>>
>>   -E
>> ----------------------------------------------------------------------
>> For information about J forums see http://www.jsoftware.com/forums.htm
>>
>
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jprogramming] RFC: unicode

Reply via email to