Re: [Jprogramming] RFC: unicode

bill lam Sat, 19 Mar 2022 16:25:37 -0700

datatype '👩‍🦰'
literal

   a.i. '👩‍🦰'
240 159 145 169 226 128 141 240 159 166 176


   9 u: '👩‍🦰'
👩‍🦰

   datatype 9 u: '👩‍🦰'
unicode4

   # 9 u: '👩‍🦰'
3

  {. 9 u: '👩‍🦰'

👩



   1{ 9 u: '👩‍🦰'

‍



   2{ 9 u: '👩‍🦰'

🦰

the emoji has 3 Unicode codepoints. Therefore the reason why it can't be
represented a an atom is not a deficiency of J.

On Sun, Mar 20, 2022, 4:18 AM Don Guinn <[email protected]> wrote:

> I apologise for not looking at the changes to unicode. Now limited to 20
> bits (I think), unicode4 now covers all unicode code-points in one atom.
> But then I stumbled onto this: 👩‍ - Girl with red hair. Emoji ZWG sequence
> has characters that require more than one unicode code point. I guess that
> this is a little beyond unicode. Oh well.
>
> 3 u: '👩‍🦰'
> 240 159 145 169 226 128 141 240 159 166 176
>
>    ucpcount '👩‍🦰'
> 5
>
> On Sat, Mar 19, 2022 at 9:36 AM bill lam <[email protected]> wrote:
>
> > I don't get it. Can you demo with an example?
> >
> > On Sat, Mar 19, 2022 at 11:15 PM Don Guinn <[email protected]> wrote:
> >
> > > I use UTF-16 and UTF-32 to try to get the code-point of UTF-8
> characters
> > so
> > > I can get each character onto one atom. That way I don't have to worry
> > > about how many atoms each character takes. Unfortunately UTF-16 and
> > UTF-32
> > > don't guarantee the characters are in one atom each. It would be nice
> if
> > U:
> > > had an option to give the code-points of unicode characters.
> > >
> > > On Sat, Mar 19, 2022 at 8:49 AM bill lam <[email protected]> wrote:
> > >
> > > > Further clarification, J language itself knows nothing about unicode
> > > > standard.
> > > > u: is the only place when utf8 and utf16 etc are relevant.
> > > >
> > > >
> > > > On Sat, 19 Mar 2022 at 10:17 PM bill lam <[email protected]>
> wrote:
> > > >
> > > > > I think the current behavior of u: is correct and intended.
> > > > > First of all, J utf8 is not a unicode datatype, it is merely a
> > > > > interpretation of 1 byte literal.
> > > > > Similarly 2 byte and 4 byte literal aren't exactly ucs2 and uft32,
> > and
> > > > > this is intended.
> > > > > Operation and comparison between different types of literal are
> done
> > by
> > > > > promotion atom by atom. This will explain the results that you
> > quoted.
> > > > >
> > > > > The handling of unicode in J is not perfect but it is consistent
> > with J
> > > > > fundamental concepts such as rank.
> > > > >
> > > > > On Sat, 19 Mar 2022 at 7:17 AM Elijah Stone <[email protected]>
> > > wrote:
> > > > >
> > > > >>     x=: 8 u: 97 243 98      NB. same as entering x=: 'aób'
> > > > >>     y=: 9 u: x
> > > > >>     z=: 10 u: 97 195 179 98
> > > > >>     x
> > > > >> aób
> > > > >>     y
> > > > >> aób
> > > > >>     z
> > > > >> aÃ³b
> > > > >>
> > > > >>     x-:y
> > > > >> 0
> > > > >>     NB. ??? they look the same
> > > > >>
> > > > >>     x-:z
> > > > >> 1
> > > > >>     NB. ??? they look different
> > > > >>
> > > > >>     $x
> > > > >> 4
> > > > >>     NB. ??? it looks like 3 characters, not 4
> > > > >>
> > > > >> Well, this is unicode.  There are good reasons why two things that
> > > look
> > > > >> the same might not actually be the same.  For instance:
> > > > >>
> > > > >>     ]p=: 10 u: 97 243 98
> > > > >> aób
> > > > >>     ]q=: 10 u: 97 111 769 98
> > > > >> aób
> > > > >>     p-:q
> > > > >> 0
> > > > >>
> > > > >> But in the above case, x doesn't match y for stupid reasons.  And
> x
> > > > >> matches z for stupider ones.
> > > > >>
> > > > >> J's default (1-byte) character representation is a weird
> hodge-podge
> > > of
> > > > >> 'UCS-1' (I don't know what else to call it) and UTF-8, and it does
> > not
> > > > >> seem well thought through.  The dictionary page for u: seems
> > confused
> > > as
> > > > >> to whether the 1-byte representation corresponds to ASCII or
> UTF-8,
> > > and
> > > > >> similarly as to whether the 2-byte representation is coded as
> UCS-2
> > or
> > > > >> UTF-16.
> > > > >>
> > > > >> Most charitably, this is exposing low-level aspects of the
> encoding
> > to
> > > > >> users, but if so, that is unsuitable for a high-level language
> such
> > as
> > > > j,
> > > > >> and it is inconsistent.  I do not have to worry that 0 1 1 0 1 1
> 0 1
> > > > will
> > > > >> suddenly turn into 36169536663191680, nor that 2.718 will suddenly
> > > turn
> > > > >> into 4613302810693613912, but _that is exactly what is happening
> in
> > > the
> > > > >> above code_.
> > > > >>
> > > > >> I give you the crowning WTF (maybe it is not so surprising at this
> > > > >> point...):
> > > > >>
> > > > >>     x;y;x,y                NB. pls j
> > > > >> ┌───┬───┬───────┐
> > > > >> │aób│aób│aÃ³baób│
> > > > >> └───┴───┴───────┘
> > > > >>
> > > > >> Unicode is delicate and skittish, and must be approached
> delicately.
> > > I
> > > > >> think that there are some essential conflicts between unicode and
> > > j--as
> > > > >> the above example with the combining character demonstrates--but
> > also
> > > > >> that
> > > > >> pandora's box is open: literal data _exists_ in j.  Given that
> that
> > is
> > > > >> the
> > > > >> case, I think it is possible and desirable to do much better than
> > the
> > > > >> current scheme.
> > > > >>
> > > > >> ---
> > > > >>
> > > > >> Unicode text can be broken up in a number of ways.  Graphemes,
> > > > >> characters,
> > > > >> code points, code units...
> > > > >>
> > > > >> The composition of code units into code points is the only such
> > > > >> demarcation which is stable and can be counted upon.  It is also a
> > > > >> demarcation which is necessary for pretty much any interesting
> text
> > > > >> processing (to the point that I would suggest any form of 'text
> > > > >> processing' which does not consider code points is not actually
> > > > >> processing
> > > > >> text).  Therefore, I suggest that, at a minimum, no user-exposed
> > > > >> representation of text should acknowledge a delineation below that
> > of
> > > > the
> > > > >> code point.  If there is any primitive which deals in code units,
> it
> > > > >> should be a foreign: scary, obscure, not for everyday use.
> > > > >>
> > > > >> A non-obvious but good result of the above is that all strings are
> > > > >> correctly-formed by construction.  Not all sequences of code units
> > are
> > > > >> correctly formed and correspond to valid strings of text.  But all
> > > > >> sequences of code points _are_, of necessity, correctly formed,
> > > > otherwise
> > > > >> there would be ... problems following additions to unicode.  J
> > > currently
> > > > >> allows us to create malformed strings, but then complains when we
> > use
> > > > >> them
> > > > >> in certain ways:
> > > > >>
> > > > >>     9 u: 1 u: 10 u: 254 255
> > > > >> |domain error
> > > > >> |   9     u:1 u:10 u:254 255
> > > > >>
> > > > >> ---
> > > > >>
> > > > >> It is a question whether j should natively recognise delineations
> > > above
> > > > >> the code point.  It pains me to suggest that it should not.
> > > > >>
> > > > >> Raku (a pointer-chasing language) has the best-thought-out strings
> > of
> > > > any
> > > > >> programming language I have encountered.  (Unsurprising, given it
> > was
> > > > >> written by perl hackers.)  In raku, operations on strings are
> > > > >> grapheme-oriented.  Raku also normalizes all text by default
> (which
> > > > >> solves
> > > > >> the problem I presented above with combining characters--but rest
> > > > >> assured,
> > > > >> it can not solve all such problems).  They even have a scheme for
> > > > >> space-efficient random access to strings on this basis.
> > > > >>
> > > > >> But j is not raku, and it is telling that, though raku has
> > > > >> multidimensional arrays, its strings are _not_ arrays, and it does
> > not
> > > > >> have characters.  The principle problem is a violation of the
> rules
> > of
> > > > >> conformability.  For instance, it is not guaranteed that, for
> > vectors
> > > x
> > > > >> and y, (#x,y) -: x +&# y.  This is not _so_ terrible (though it is
> > > > pretty
> > > > >> bad), but from it follows an obvious problem with catenating
> > > higher-rank
> > > > >> arrays.  Similar concerns apply at least to i., e., E., and }.
> That
> > > > >> said,
> > > > >> I would support the addition of primitives to perform
> normalization
> > > (as
> > > > >> well as casefolding etc.) and identification of grapheme
> boundaries.
> > > > >>
> > > > >> ---
> > > > >>
> > > > >> It would be wise of me to address the elephant in the room.
> > > Characters
> > > > >> are not only used to represent text, but also arbitrary binary
> data,
> > > > e.g.
> > > > >> from the network or files, which may in fact be malformed as text.
> > I
> > > > >> submit that characters are clearly the wrong way to represent such
> > > data;
> > > > >> the right way to represent a sequence of _octets_ is using
> > _integers_.
> > > > >> But people persist, and there are two issues: the first is
> > > > compatibility,
> > > > >> and the second is performance.
> > > > >>
> > > > >> Regarding the second, an obvious solution is to add a 1-byte
> integer
> > > > >> representation (as Marshall has suggested on at least one
> occasion),
> > > but
> > > > >> this represents a potentially nontrivial development effort.
> > > Therefore
> > > > I
> > > > >> suggest an alternate solution, at least for the interim: foreigns
> > > (scary
> > > > >> and obscure, per above) that will _intentionally misinterpret_
> data
> > > from
> > > > >> the outside world as 'UCS-1' and represent it compactly (or do the
> > > > >> opposite).
> > > > >>
> > > > >> Regarding the issue of backwards compatibility, I propose the
> > addition
> > > > of
> > > > >> 256 'meta-characters', each corresponding to an octet.  Attempts
> to
> > > > >> decode
> > > > >> correctly formed utf-8 from the outside world will succeed and
> > produce
> > > > >> corresponding unicode; attempts to decode malformed utf-8 may map
> > each
> > > > >> incorrect code unit to the corresponding meta-character.  When
> > > encoded,
> > > > >> real characters will be utf-8 encoded, but each meta-character
> will
> > be
> > > > >> encoded as its corresponding octet.  In this way, arbitrary byte
> > > streams
> > > > >> may be passed through j strings; but byte streams which consist
> > > entirely
> > > > >> or partly of valid utf-8 can be sensibly interpreted.  This is
> > similar
> > > > to
> > > > >> raku's utf8-c8, and to python's surrogateescape.
> > > > >>
> > > > >> ---
> > > > >>
> > > > >> An implementation detail, sort of.  Variable-width representations
> > > (such
> > > > >> as utf-8) should not be used internally.  Many fundamental array
> > > > >> operations require constant-time random access (with the
> > corresponding
> > > > >> obvious caveats), which variable-width representations cannot
> > provide;
> > > > >> and
> > > > >> even operations which are inherently sequential--like E., i., ;.,
> > > #--may
> > > > >> be more difficult or impossible to optimize to the same degree.
> > > > >> Fixed-width representations therefore provide more predictable
> > > > >> performance, better performance in nearly all cases, and better
> > > > >> asymptotic
> > > > >> performance for many interesting applications.
> > > > >>
> > > > >> (The UCS-1 misinterpretation mentioned above is a loophole which
> > > allows
> > > > >> people who really care about space to do the variable-width part
> > > > >> themselves.)
> > > > >>
> > > > >> ---
> > > > >>
> > > > >> I therefore suggest the following language changes, probably to be
> > > > >> deferred to version 10:
> > > > >>
> > > > >> - 1, 2, and 4-byte character representations are still used
> > > internally.
> > > > >>    They are fixed-width, with each code unit representing one code
> > > > point.
> > > > >>    In the 4-byte representation, because there are more 32-bit
> > values
> > > > than
> > > > >>    unicode code points, some 32-bit values may correspond to
> > > > >> passed-through
> > > > >>    bytes of misencoded utf8.  In this way, a j literal can
> > round-trip
> > > > >>    arbitrary byte sequences.  The remainder of the 32-bit value
> > space
> > > is
> > > > >>    completely inaccessible.
> > > > >>
> > > > >> - A new primitive verb U:, to replace u:.  u: is removed.  U: has
> a
> > > > >>    different name, so that old code will break loudly, rather than
> > > > >> quietly.
> > > > >>    If y is an array of integers, then U:y is an array of
> characters
> > > with
> > > > >>    corresponding codepoints; and if y is an array of characters,
> > then
> > > > U:y
> > > > >>    is an array of their code points.  (Alternately, make a.
> > > > impractically
> > > > >>    large and rely on a.i.y and x{a. for everything.  I
> disrecommend
> > > this
> > > > >>    for the same reason that we have j. and r., and do not write x
> =
> > > 0j1
> > > > *
> > > > >> y
> > > > >>    or x * ^ 0j1 * y.)
> > > > >>
> > > > >> - Foreigns for reading from files, like 1!:1 and 1!:11 permit 3
> > modes
> > > of
> > > > >>    operation; foreigns for writing to files, 1!:2, 1!:3, and
> 1!:12,
> > > > permit
> > > > >>    2 modes of operation.  The reading modes are:
> > > > >>
> > > > >>    1. Throw on misencoded utf-8 (default).
> > > > >>    2. Pass-through misencoded bytes as meta characters.
> > > > >>    3. Intentionally misinterpret the file as being 'UCS-1' encoded
> > > > rather
> > > > >>       than utf-8 encoded.
> > > > >>
> > > > >>    The writing modes are:
> > > > >>
> > > > >>    1. Encode as utf-8, passing through meta characters as the
> > > > >> corresponding
> > > > >>       octets (default).
> > > > >>    2. Misinterpret output as 'UCS-1' and perform no encoding.
> Only
> > > > valid
> > > > >>       for 1-byte characters.
> > > > >>
> > > > >> A recommendation: the UCS-1 misinterpretation should be removed if
> > > > 1-byte
> > > > >> integers are ever added.
> > > > >>
> > > > >> - A new foreign is provided to 'sneeze' character arrays.  This is
> > > > >>    largely cosmetic, but may be useful for some.  If some string
> > uses
> > > a
> > > > >>    4-byte representation, but in fact, all of its elements' code
> > > points
> > > > >> are
> > > > >>    below 65536, then the result will use a smaller representation.
> > > > (This
> > > > >>    can also do work on integers, as it can convert them to a
> boolean
> > > > >>    representation if they are all 0 or 1; this is, again,
> marginal.)
> > > > >>
> > > > >> Future directions:
> > > > >>
> > > > >> Provide functionality for unicode normalization, casefolding,
> > grapheme
> > > > >> boundary identification, unicode character properties, and others.
> > > > Maybe
> > > > >> this should be done by turning U: into a trenchcoat function; or
> > maybe
> > > > it
> > > > >> should be done by library code.  There is the potential to reuse
> > > > existing
> > > > >> primitives, e.g. <.y might be a lowercased y, but I am wary of
> such
> > > > puns.
> > > > >>
> > > > >> Thoughts?  Comments?
> > > > >>
> > > > >>   -E
> > > > >>
> > ----------------------------------------------------------------------
> > > > >> For information about J forums see
> > > http://www.jsoftware.com/forums.htm
> > > > >>
> > > > >
> > > >
> ----------------------------------------------------------------------
> > > > For information about J forums see
> http://www.jsoftware.com/forums.htm
> > > >
> > > ----------------------------------------------------------------------
> > > For information about J forums see http://www.jsoftware.com/forums.htm
> > >
> > ----------------------------------------------------------------------
> > For information about J forums see http://www.jsoftware.com/forums.htm
> >
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
>
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jprogramming] RFC: unicode

Reply via email to