Further clarification, J language itself knows nothing about unicode standard. u: is the only place when utf8 and utf16 etc are relevant.
On Sat, 19 Mar 2022 at 10:17 PM bill lam <[email protected]> wrote: > I think the current behavior of u: is correct and intended. > First of all, J utf8 is not a unicode datatype, it is merely a > interpretation of 1 byte literal. > Similarly 2 byte and 4 byte literal aren't exactly ucs2 and uft32, and > this is intended. > Operation and comparison between different types of literal are done by > promotion atom by atom. This will explain the results that you quoted. > > The handling of unicode in J is not perfect but it is consistent with J > fundamental concepts such as rank. > > On Sat, 19 Mar 2022 at 7:17 AM Elijah Stone <[email protected]> wrote: > >> x=: 8 u: 97 243 98 NB. same as entering x=: 'aób' >> y=: 9 u: x >> z=: 10 u: 97 195 179 98 >> x >> aób >> y >> aób >> z >> aób >> >> x-:y >> 0 >> NB. ??? they look the same >> >> x-:z >> 1 >> NB. ??? they look different >> >> $x >> 4 >> NB. ??? it looks like 3 characters, not 4 >> >> Well, this is unicode. There are good reasons why two things that look >> the same might not actually be the same. For instance: >> >> ]p=: 10 u: 97 243 98 >> aób >> ]q=: 10 u: 97 111 769 98 >> aób >> p-:q >> 0 >> >> But in the above case, x doesn't match y for stupid reasons. And x >> matches z for stupider ones. >> >> J's default (1-byte) character representation is a weird hodge-podge of >> 'UCS-1' (I don't know what else to call it) and UTF-8, and it does not >> seem well thought through. The dictionary page for u: seems confused as >> to whether the 1-byte representation corresponds to ASCII or UTF-8, and >> similarly as to whether the 2-byte representation is coded as UCS-2 or >> UTF-16. >> >> Most charitably, this is exposing low-level aspects of the encoding to >> users, but if so, that is unsuitable for a high-level language such as j, >> and it is inconsistent. I do not have to worry that 0 1 1 0 1 1 0 1 will >> suddenly turn into 36169536663191680, nor that 2.718 will suddenly turn >> into 4613302810693613912, but _that is exactly what is happening in the >> above code_. >> >> I give you the crowning WTF (maybe it is not so surprising at this >> point...): >> >> x;y;x,y NB. pls j >> ┌───┬───┬───────┐ >> │aób│aób│aóbaób│ >> └───┴───┴───────┘ >> >> Unicode is delicate and skittish, and must be approached delicately. I >> think that there are some essential conflicts between unicode and j--as >> the above example with the combining character demonstrates--but also >> that >> pandora's box is open: literal data _exists_ in j. Given that that is >> the >> case, I think it is possible and desirable to do much better than the >> current scheme. >> >> --- >> >> Unicode text can be broken up in a number of ways. Graphemes, >> characters, >> code points, code units... >> >> The composition of code units into code points is the only such >> demarcation which is stable and can be counted upon. It is also a >> demarcation which is necessary for pretty much any interesting text >> processing (to the point that I would suggest any form of 'text >> processing' which does not consider code points is not actually >> processing >> text). Therefore, I suggest that, at a minimum, no user-exposed >> representation of text should acknowledge a delineation below that of the >> code point. If there is any primitive which deals in code units, it >> should be a foreign: scary, obscure, not for everyday use. >> >> A non-obvious but good result of the above is that all strings are >> correctly-formed by construction. Not all sequences of code units are >> correctly formed and correspond to valid strings of text. But all >> sequences of code points _are_, of necessity, correctly formed, otherwise >> there would be ... problems following additions to unicode. J currently >> allows us to create malformed strings, but then complains when we use >> them >> in certain ways: >> >> 9 u: 1 u: 10 u: 254 255 >> |domain error >> | 9 u:1 u:10 u:254 255 >> >> --- >> >> It is a question whether j should natively recognise delineations above >> the code point. It pains me to suggest that it should not. >> >> Raku (a pointer-chasing language) has the best-thought-out strings of any >> programming language I have encountered. (Unsurprising, given it was >> written by perl hackers.) In raku, operations on strings are >> grapheme-oriented. Raku also normalizes all text by default (which >> solves >> the problem I presented above with combining characters--but rest >> assured, >> it can not solve all such problems). They even have a scheme for >> space-efficient random access to strings on this basis. >> >> But j is not raku, and it is telling that, though raku has >> multidimensional arrays, its strings are _not_ arrays, and it does not >> have characters. The principle problem is a violation of the rules of >> conformability. For instance, it is not guaranteed that, for vectors x >> and y, (#x,y) -: x +&# y. This is not _so_ terrible (though it is pretty >> bad), but from it follows an obvious problem with catenating higher-rank >> arrays. Similar concerns apply at least to i., e., E., and }. That >> said, >> I would support the addition of primitives to perform normalization (as >> well as casefolding etc.) and identification of grapheme boundaries. >> >> --- >> >> It would be wise of me to address the elephant in the room. Characters >> are not only used to represent text, but also arbitrary binary data, e.g. >> from the network or files, which may in fact be malformed as text. I >> submit that characters are clearly the wrong way to represent such data; >> the right way to represent a sequence of _octets_ is using _integers_. >> But people persist, and there are two issues: the first is compatibility, >> and the second is performance. >> >> Regarding the second, an obvious solution is to add a 1-byte integer >> representation (as Marshall has suggested on at least one occasion), but >> this represents a potentially nontrivial development effort. Therefore I >> suggest an alternate solution, at least for the interim: foreigns (scary >> and obscure, per above) that will _intentionally misinterpret_ data from >> the outside world as 'UCS-1' and represent it compactly (or do the >> opposite). >> >> Regarding the issue of backwards compatibility, I propose the addition of >> 256 'meta-characters', each corresponding to an octet. Attempts to >> decode >> correctly formed utf-8 from the outside world will succeed and produce >> corresponding unicode; attempts to decode malformed utf-8 may map each >> incorrect code unit to the corresponding meta-character. When encoded, >> real characters will be utf-8 encoded, but each meta-character will be >> encoded as its corresponding octet. In this way, arbitrary byte streams >> may be passed through j strings; but byte streams which consist entirely >> or partly of valid utf-8 can be sensibly interpreted. This is similar to >> raku's utf8-c8, and to python's surrogateescape. >> >> --- >> >> An implementation detail, sort of. Variable-width representations (such >> as utf-8) should not be used internally. Many fundamental array >> operations require constant-time random access (with the corresponding >> obvious caveats), which variable-width representations cannot provide; >> and >> even operations which are inherently sequential--like E., i., ;., #--may >> be more difficult or impossible to optimize to the same degree. >> Fixed-width representations therefore provide more predictable >> performance, better performance in nearly all cases, and better >> asymptotic >> performance for many interesting applications. >> >> (The UCS-1 misinterpretation mentioned above is a loophole which allows >> people who really care about space to do the variable-width part >> themselves.) >> >> --- >> >> I therefore suggest the following language changes, probably to be >> deferred to version 10: >> >> - 1, 2, and 4-byte character representations are still used internally. >> They are fixed-width, with each code unit representing one code point. >> In the 4-byte representation, because there are more 32-bit values than >> unicode code points, some 32-bit values may correspond to >> passed-through >> bytes of misencoded utf8. In this way, a j literal can round-trip >> arbitrary byte sequences. The remainder of the 32-bit value space is >> completely inaccessible. >> >> - A new primitive verb U:, to replace u:. u: is removed. U: has a >> different name, so that old code will break loudly, rather than >> quietly. >> If y is an array of integers, then U:y is an array of characters with >> corresponding codepoints; and if y is an array of characters, then U:y >> is an array of their code points. (Alternately, make a. impractically >> large and rely on a.i.y and x{a. for everything. I disrecommend this >> for the same reason that we have j. and r., and do not write x = 0j1 * >> y >> or x * ^ 0j1 * y.) >> >> - Foreigns for reading from files, like 1!:1 and 1!:11 permit 3 modes of >> operation; foreigns for writing to files, 1!:2, 1!:3, and 1!:12, permit >> 2 modes of operation. The reading modes are: >> >> 1. Throw on misencoded utf-8 (default). >> 2. Pass-through misencoded bytes as meta characters. >> 3. Intentionally misinterpret the file as being 'UCS-1' encoded rather >> than utf-8 encoded. >> >> The writing modes are: >> >> 1. Encode as utf-8, passing through meta characters as the >> corresponding >> octets (default). >> 2. Misinterpret output as 'UCS-1' and perform no encoding. Only valid >> for 1-byte characters. >> >> A recommendation: the UCS-1 misinterpretation should be removed if 1-byte >> integers are ever added. >> >> - A new foreign is provided to 'sneeze' character arrays. This is >> largely cosmetic, but may be useful for some. If some string uses a >> 4-byte representation, but in fact, all of its elements' code points >> are >> below 65536, then the result will use a smaller representation. (This >> can also do work on integers, as it can convert them to a boolean >> representation if they are all 0 or 1; this is, again, marginal.) >> >> Future directions: >> >> Provide functionality for unicode normalization, casefolding, grapheme >> boundary identification, unicode character properties, and others. Maybe >> this should be done by turning U: into a trenchcoat function; or maybe it >> should be done by library code. There is the potential to reuse existing >> primitives, e.g. <.y might be a lowercased y, but I am wary of such puns. >> >> Thoughts? Comments? >> >> -E >> ---------------------------------------------------------------------- >> For information about J forums see http://www.jsoftware.com/forums.htm >> > ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm
