It's even more complicated than that. The complexity we see at https://unicode.org/ucd/ is daunting.
Consider, for example, the different *kinds* of line break handling documented at http://www.unicode.org/Public/UCD/latest/ucd/LineBreak.txt -- what would the implications be for a tokenizer? Basically... tokenizing unicode is just asking for problems. In practical terms, we almost always work with a subset of the complexity offered by unicode (which basically means we ignore most of the standard's documentation). That said... if someone would be willing construct a database representing the character class for each unicode character (roughly speaking, the mj values at http://jsoftware.com/pipermail/programming/2020-November/056788.html for 2^21 unicode characters), I would be willing to build a J model which parses unicode according to that database. This hypothetical unicode database would not even have to be completely accurate -- I am just looking for a compact representation which would allow people to go in and mark characters to be handled as tokens, alphanumeric, whitespace, or whatever. (But, of course, if new classifications were to be supported, that would require code updates.) This isn't an easy task, though, and I wouldn't expect the consequent code to be pretty. Thanks, -- Raul On Sun, Nov 22, 2020 at 12:37 AM bill lam <[email protected]> wrote: > > unicode contains both letters and symbols, and there are no trivial ways to > distinguish them from codepoints. > eg > This is expected > echo (0;sj;mj) ;: 'A÷B' > +-+-+-+ > |A|÷|B| > +-+-+-+ > > But this is not so useful, ideally it should be recognized as a single > token. > echo (0;sj;mj) ;: 'Schrödinger' > +----+-+------+ > |Schr|ö|dinger| > +----+-+------+ > > > > On Sun, Nov 22, 2020 at 3:07 AM Don Guinn <[email protected]> wrote: > > > Yes. I forgot about J=._1 . And I just looked at the UTF-8 description and > > saw that it had been restricted to 3 continuation bytes, for a maximum of > > 21 bits, in 2003. Previously allowed up to 7 continuation bytes. > > > > Hey! This has been a fun exercise whether or not it is put into production. > > > > On Sat, Nov 21, 2020 at 11:48 AM Raul Miller <[email protected]> > > wrote: > > > > > I think we need to be a little clearer about what you're trying to > > > achieve when you say "catch the utf-8 characters". > > > > > > Your change to mj made all utf-8 *octets* be treated as alphabetic. > > > That's... an approach, certainly. > > > > > > Meanwhile, according to https://en.wikipedia.org/wiki/UTF-8#Encoding > > > there are three different utf-8 starts. I don't know what you are > > > referring to when you say that there are seven possible U starts. > > > > > > But there is a way for ;: to report errors in the stream. See attached > > > for a demonstration (I made quoted strings be an error.) > > > > > > Thanks, > > > > > > -- > > > Raul > > > > > > On Sat, Nov 21, 2020 at 1:05 PM Don Guinn <[email protected]> wrote: > > > > > > > > I was not precise in my earlier response. I should have said that > > > detecting > > > > the wrong number of UTF-8 continuation bytes would be difficult in the > > > > sequential machine as you would probably need to detect seven possible > > U > > > > starts and seven U continues to properly check. That would make a very > > > > large JS, although only checking for up to three continuation bytes > > would > > > > probably be sufficient. > > > > > > > > And I should have said that your SJ and MJ did not catch the UTF-8 > > > > characters in my test. Each byte was still treated as "other". Your > > > > approach is better as it allows the possibility of treating UTF-8 as > > > > "other", but would contain more than one byte - the entire UTF-8 > > > sequence. > > > > I haven't looked at your SJ yet to try to find out why it doesn't catch > > > the > > > > UTF-8. > > > > > > > > But what does one do if there is an error in the data? ;: returns > > errors > > > if > > > > SJ and MJ are not constructed properly, but there is no way to report > > an > > > > error for bad data. And if there were, what would a programmer do about > > > it? > > > > So is it necessary to detect bad UTF-8 sequences? Probably not. And for > > > now > > > > at least treating all UTF-8 like alp/num would probably be what one > > would > > > > want. Let the display of the data show the errors. > > > > > > > > On Sat, Nov 21, 2020 at 9:12 AM Raul Miller <[email protected]> > > > wrote: > > > > > > > > > (That said, it's also worth noting that the state table I presented > > > > > here doesn't give an error for unbalanced quotes, either. So if you > > > > > want errors to be thrown, you should probably be updating the state > > > > > table to force an error there, also.) > > > > > > > > > > (And, I should note that I haven't taken a look at what the resulting > > > > > errors look like. So I don't know how informative the resulting error > > > > > messages would be...) > > > > > > > > > > Thanks again, > > > > > > > > > > -- > > > > > Raul > > > > > > > > > > On Sat, Nov 21, 2020 at 11:04 AM Raul Miller <[email protected]> > > > > > wrote: > > > > > > > > > > > > Yes, I was surprised (and pleased) that the text file came through > > > ok. > > > > > > > > > > > > As for throwing errors on malformed utf-8 sequences, here's how > > that > > > > > > could be implemented: > > > > > > > > > > > > (1) We introduce an "error row" which uses operation 2 for all > > > > > > character classes (and, for consistency, identifies the next row as > > > > > > itself -- this is mostly so that improper use of the error row > > would > > > > > > be relatively obvious). (Operation 2 emits a token, but the > > important > > > > > > thing here is that it throws an error if j=-1.) > > > > > > > > > > > > (2) For each row which is part of a partially complete utf-8 > > > > > > character, any appearance of any character class which is not a > > utf-8 > > > > > > suffix character would use operation 3 and would identify the next > > > row > > > > > > as the error row. (Operation 3 emits a token, but the important > > thing > > > > > > here is that it sets j=-1.) > > > > > > > > > > > > (3) Each of the different utf-8 prefixes would lead to a different > > > row > > > > > > in the state table. For example, the character class containing > > > > > > character 224 would get a beginning of token row (which gets the > > > error > > > > > > row treatment and) which leads to a row that expects a utf-8 suffix > > > > > > (which gets the error row treatment) followed by a second utf-8 > > > suffix > > > > > > (which follows the pattern set by row 1 of the state table). > > > > > > > > > > > > Hopefully that makes sense. > > > > > > > > > > > > Note that the final token in a string could not trigger an error. > > > > > > That's a limitation of the engine and corresponds approximately to > > > how > > > > > > utf-8 must be treated when a low level buffer boundary splits a > > utf-8 > > > > > > character. > > > > > > > > > > > > Anyways, the point is that the sequential machine can support the > > > sort > > > > > > of "count a small number of steps" which is needed here. The > > > > > > difficulty is more that the machine stops when it reaches the end > > of > > > > > > the string. If that's a sensitivity, this could be handled by > > > > > > appending a linefeed character to the end of the string before > > > > > > processing and then removing a final linefeed character from the > > last > > > > > > token after tokenization. > > > > > > > > > > > > Again, I hope this makes sense... > > > > > > > > > > > > Thanks, > > > > > > > > > > > > -- > > > > > > Raul > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Sat, Nov 21, 2020 at 9:22 AM Don Guinn <[email protected]> > > > wrote: > > > > > > > > > > > > > > The text file worked great! > > > > > > > > > > > > > > As to the UTF-8 codes. What is important is to avoid splitting > > the > > > > > start > > > > > > > bytes from the continuation bytes. Validating the UTF-8 codes is > > a > > > > > > > difficult task. The start byte includes the number of > > continuation > > > > > bytes to > > > > > > > follow. It would be an error if the number of continuation bytes > > > didn't > > > > > > > agree. > > > > > > > > > > > > > > I tried my test on your definitions and it failed. Attached is a > > > text > > > > > file > > > > > > > with your definitions and my test following. > > > > > > > > > > > > > > Well, I can't view the attachment. I don't know if it's there or > > > not. > > > > > Just > > > > > > > in case, here is my test. > > > > > > > > > > > > > > > > > > > > > NB. A noun to show the handling of UTF-8 in ;: > > > > > > > test=:{{)n > > > > > > > The symbol for the Euro is ₠ > > > > > > > Other symools like π show up also > > > > > > > How about ⌹ in APL NB. ⌹ > > > > > > > Common expressions like 'H₂O' for water > > > > > > > Common expressions like H₂O for water > > > > > > > }} > > > > > > > > > > > > > > NB. How ;: this sj and mj handles it > > > > > > > ,.<;.2(0;sj;mj);:test > > > > > > > > > > > > > > NB. Assigning UTF8 as character > > > > > > > mj=: 2 (128+i.128)}mj > > > > > > > > > > > > > > NB. How UTF-8 is now handled > > > > > > > ,.<;.2(0;sj;mj);:test > > > > > > > > > > > > > > On Fri, Nov 20, 2020 at 2:46 PM Raul Miller < > > [email protected] > > > > > > > > > wrote: > > > > > > > > > > > > > > > Here's an updated version, which also retains utf-8 character > > > > > > > > sequences within token boundaries (instead of splitting them up > > > into > > > > > > > > multiple tokens). I had originally posted this to the jbeta > > > forum, > > > > > but > > > > > > > > it's really a programming topic, and probably belongs here. > > > > > > > > > > > > > > > > mj=: 256$0 NB. X other > > > > > > > > mj=: 1 (9,a.i.' ')}mj NB. S space and tab > > > > > > > > mj=: 2 (,(a.i.'Aa')+/i.26)}mj NB. A A-Z a-z excluding N B > > > > > > > > mj=: 3 (a.i.'N')}mj NB. N the letter N > > > > > > > > mj=: 4 (a.i.'B')}mj NB. B the letter B > > > > > > > > mj=: 5 (a.i.'0123456789_')}mj NB. 9 digits and _ > > > > > > > > mj=: 6 (a.i.'.')}mj NB. . the decimal point > > > > > > > > mj=: 7 (a.i.':')}mj NB. : the colon > > > > > > > > mj=: 8 (a.i.'''')}mj NB. Q quote > > > > > > > > mj=: 9 (a.i.'{')}mj NB. { the left curly brace > > > > > > > > mj=:10 (10)} mj NB. LF > > > > > > > > mj=:11 (a.i.'}')}mj NB. } the right curly brace > > > > > > > > mj=:12 (192+i.64)}mj NB. U utf-8 octet prefix > > > > > > > > mj=:13 (128+i.64)}mj NB. V utf-8 octet suffix > > > > > > > > > > > > > > > > sj=: 0 10#:10*}.".;._2(0 :0) > > > > > > > > ' X S A N B 9 . : Q { LF } U V']0 > > > > > > > > 1.1 0.0 2.1 3.1 2.1 6.1 1.1 1.1 7.1 11.1 10.1 12.1 15.1 16.1 > > > NB. 0 > > > > > space > > > > > > > > 1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 11.2 10.2 12.2 15.2 16.2 > > > NB. 1 > > > > > other > > > > > > > > 1.2 0.3 2.0 2.0 2.0 2.0 1.0 1.0 7.2 11.2 10.2 12.2 15.2 16.2 > > > NB. 2 > > > > > alp/num > > > > > > > > 1.2 0.3 2.0 2.0 4.0 2.0 1.0 1.0 7.2 11.2 10.2 12.2 15.2 16.2 > > > NB. 3 N > > > > > > > > 1.2 0.3 2.0 2.0 2.0 2.0 5.0 1.0 7.2 11.2 10.2 12.2 15.2 16.2 > > > NB. 4 > > > > > NB > > > > > > > > 9.0 9.0 9.0 9.0 9.0 9.0 1.0 1.0 9.0 9.0 10.2 9.0 9.0 9.0 > > > NB. 5 > > > > > NB. > > > > > > > > 1.4 0.5 6.0 6.0 6.0 6.0 6.0 1.0 7.4 11.4 10.2 12.4 15.2 16.2 > > > NB. 6 > > > > > num > > > > > > > > 7.0 7.0 7.0 7.0 7.0 7.0 7.0 7.0 8.0 7.0 7.0 7.0 7.0 7.0 > > > NB. 7 ' > > > > > > > > 1.2 0.3 2.2 3.2 2.2 6.2 1.2 1.2 7.0 11.2 10.2 12.2 15.2 16.2 > > > NB. 8 > > > > > '' > > > > > > > > 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0 10.2 9.0 9.0 9.0 > > > NB. 9 > > > > > comment > > > > > > > > 1.2 0.2 2.2 3.2 2.2 6.2 1.2 1.2 7.2 11.2 10.2 12.2 15.2 16.2 > > > NB. 10 > > > > > LF > > > > > > > > 1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 13.0 10.2 1.2 15.2 16.2 > > > NB. 11 > > > > > { > > > > > > > > 1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 1.2 10.2 14.0 15.2 16.2 > > > NB. 12 > > > > > } > > > > > > > > 1.2 0.3 2.2 3.2 2.2 6.2 1.7 1.7 7.2 1.2 10.2 1.2 15.2 16.2 > > > NB. 13 > > > > > {{ > > > > > > > > 1.2 0.3 2.2 3.2 2.2 6.2 1.7 1.7 7.2 1.2 10.2 1.2 15.2 16.2 > > > NB. 14 > > > > > }} > > > > > > > > 1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 11.2 10.2 12.2 15.0 16.0 > > > NB. 15 > > > > > > > > partial > > > > > > > > 1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 11.2 10.2 12.2 15.2 16.0 > > > NB. 16 > > > > > utf-8 > > > > > > > > ) > > > > > > > > > > > > > > > > As I noted in the beta forum -- increasing the complexity of > > the > > > > > state > > > > > > > > table adds both rows and columns (size grows proportional to or > > > > > faster > > > > > > > > than the square of the number of distinct character types when > > > each > > > > > > > > character type requires a distinct handling). So it's good to > > > keep > > > > > > > > this thing as simple as possible. > > > > > > > > > > > > > > > > Also, I've not tested this extensively, and it's possible I'll > > > need > > > > > to > > > > > > > > make further changes (let me know if you spot any problems). > > > > > > > > > > > > > > > > That said... note also that I have *not* implemented the > > unicode > > > > > > > > guideline which might suggest that the tokenizer should throw > > an > > > > > error > > > > > > > > on malformed utf-8 sequences. That would require several more > > > rows > > > > > and > > > > > > > > columns to achieve the recommended inconvenience. This would > > also > > > > > > > > introduce email line wrap, because the state table would become > > > that > > > > > > > > fat. (I'll attach a copy here as a .txt file, to see if an > > > earlier > > > > > > > > suggestion -- that the forum would preserve .txt attachments -- > > > might > > > > > > > > be a way of working around that issue. I suspect not, but it's > > > easy > > > > > > > > enough to test...) > > > > > > > > > > > > > > > > This has *not* been implemented in the current jbeta as the ;: > > > monad. > > > > > > > > I am not sure if it should, since J the language is based on > > > ascii, > > > > > > > > not unicode -- it's just convenient that unicode supports an > > > ascii > > > > > > > > subset. > > > > > > > > > > > > > > > > Still... we often do have reason to work with utf-8. > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > > > -- > > > > > > > > Raul > > > > > > > > > > > > > > > > On Sun, Nov 8, 2020 at 11:01 AM Raul Miller < > > > [email protected]> > > > > > wrote: > > > > > > > > > > > > > > > > > > Oh, oops, I should have spotted that. Thanks. > > > > > > > > > > > > > > > > > > Updated state table: > > > > > > > > > > > > > > > > > > mj=: 256$0 NB. X other > > > > > > > > > mj=: 1 (9,a.i.' ')}mj NB. S space and tab > > > > > > > > > mj=: 2 (,(a.i.'Aa')+/i.26)}mj NB. A A-Z a-z excluding N B > > > > > > > > > mj=: 3 (a.i.'N')}mj NB. N the letter N > > > > > > > > > mj=: 4 (a.i.'B')}mj NB. B the letter B > > > > > > > > > mj=: 5 (a.i.'0123456789_')}mj NB. 9 digits and _ > > > > > > > > > mj=: 6 (a.i.'.')}mj NB. . the decimal point > > > > > > > > > mj=: 7 (a.i.':')}mj NB. : the colon > > > > > > > > > mj=: 8 (a.i.'''')}mj NB. Q quote > > > > > > > > > mj=: 9 (a.i.'{')}mj NB. { the left curly brace > > > > > > > > > mj=:10 (10)} mj NB. LF > > > > > > > > > mj=:11 (a.i.'}')}mj NB. } the right curly brace > > > > > > > > > > > > > > > > > > sj=: 0 10#:10*}.".;._2(0 :0) > > > > > > > > > ' X S A N B 9 . : Q { LF }']0 > > > > > > > > > 1.1 0.0 2.1 3.1 2.1 6.1 1.1 1.1 7.1 11.1 10.1 12.1 NB. 0 > > space > > > > > > > > > 1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 11.2 10.2 12.2 NB. 1 > > other > > > > > > > > > 1.2 0.3 2.0 2.0 2.0 2.0 1.0 1.0 7.2 11.2 10.2 12.2 NB. 2 > > > alp/num > > > > > > > > > 1.2 0.3 2.0 2.0 4.0 2.0 1.0 1.0 7.2 11.2 10.2 12.2 NB. 3 N > > > > > > > > > 1.2 0.3 2.0 2.0 2.0 2.0 5.0 1.0 7.2 11.2 10.2 12.2 NB. 4 NB > > > > > > > > > 9.0 9.0 9.0 9.0 9.0 9.0 1.0 1.0 9.0 9.0 10.2 9.0 NB. 5 NB. > > > > > > > > > 1.4 0.5 6.0 6.0 6.0 6.0 6.0 1.0 7.4 11.4 10.2 12.4 NB. 6 num > > > > > > > > > 7.0 7.0 7.0 7.0 7.0 7.0 7.0 7.0 8.0 7.0 7.0 7.0 NB. 7 ' > > > > > > > > > 1.2 0.3 2.2 3.2 2.2 6.2 1.2 1.2 7.0 11.2 10.2 12.2 NB. 8 '' > > > > > > > > > 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0 10.2 9.0 NB. 9 > > > comment > > > > > > > > > 1.2 0.2 2.2 3.2 2.2 6.2 1.2 1.2 7.2 11.2 10.2 12.2 NB. 10 LF > > > > > > > > > 1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 13.0 10.2 1.2 NB. 11 { > > > > > > > > > 1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 1.2 10.2 14.0 NB. 12 } > > > > > > > > > 1.2 0.3 2.2 3.2 2.2 6.2 1.7 1.7 7.2 1.2 10.2 1.2 NB. 13 {{ > > > > > > > > > 1.2 0.3 2.2 3.2 2.2 6.2 1.7 1.7 7.2 1.2 10.2 1.2 NB. 14 }} > > > > > > > > > ) > > > > > > > > > > > > > > > > > > (Note that I haven't coerced this state table to integer form > > > -- > > > > > > > > > floats and integers occupy the same space on 64 bit systems, > > > and > > > > > the > > > > > > > > > model doesn't really care about representation.) > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > > > > > -- > > > > > > > > > Raul > > > > > > > > > > > > > > > > > > On Sun, Nov 8, 2020 at 10:53 AM Henry Rich < > > > [email protected]> > > > > > wrote: > > > > > > > > > > > > > > > > > > > > I was talking about > > > > > > > > > > > > > > > > > > > > ;: LF,'.' > > > > > > > > > > +-+-+ > > > > > > > > > > | |.| > > > > > > > > > > +-+-+ > > > > > > > > > > > > > > > > > > > > Henry Rich > > > > > > > > > > > > > > > > > > > > On 11/8/2020 8:38 AM, Raul Miller wrote: > > > > > > > > > > > I tested for that case: > > > > > > > > > > > > > > > > > > > > > > #;:'NB.',LF,LF > > > > > > > > > > > 3 > > > > > > > > > > > #(0;sj;mj) sq 'NB.',LF,LF > > > > > > > > > > > 3 > > > > > > > > > > > #(0;sj;mj) sq 'NB.',LF,LF,LF > > > > > > > > > > > 4 > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > This email has been checked for viruses by AVG. > > > > > > > > > > https://www.avg.com > > > > > > > > > > > > > > > > > > > > > > > > > > > ---------------------------------------------------------------------- > > > > > > > > > > For information about J forums see > > > > > http://www.jsoftware.com/forums.htm > > > > > > > > > > > > > > > ---------------------------------------------------------------------- > > > > > > > > For information about J forums see > > > > > http://www.jsoftware.com/forums.htm > > > > > > > > > > > > > > > > > > ---------------------------------------------------------------------- > > > > > > > For information about J forums see > > > http://www.jsoftware.com/forums.htm > > > > > > > ---------------------------------------------------------------------- > > > > > For information about J forums see > > http://www.jsoftware.com/forums.htm > > > > > > > > > ---------------------------------------------------------------------- > > > > For information about J forums see http://www.jsoftware.com/forums.htm > > > ---------------------------------------------------------------------- > > > For information about J forums see http://www.jsoftware.com/forums.htm > > > > > ---------------------------------------------------------------------- > > For information about J forums see http://www.jsoftware.com/forums.htm > > > ---------------------------------------------------------------------- > For information about J forums see http://www.jsoftware.com/forums.htm ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm
