Hmm... First off, it looks like you're working from the buggy version of utf-8 support, where I was not handling properly utf-8 sequences longer than length 2.
Please use the implementation which has lines ending like this: 15.0 16.0 NB. 15 partial 15.2 16.2 NB. 16 utf-8 Line 15 is the line which we land on after recognizing the beginning of a utf-8 character. This should always be followed by one or more utf-8 suffix octets. So operation 0 is the right operation here. (Operation zero means: just keep on treating further characters as part of this same sequence o f characters that we've been working on up to this point. (Either part of the same token, or part of a non-token which will not be emitted -- which is which depends on how the previous characters were handled.)) Line 16 is the line which we land on after recognizing a utf-8 suffix octet. We can have an arbitrary number of these in the character (1, 2 or 3), and the character ends when we encounter something which is not one of these suffix octets. So operation 2 is the right operation here. (Operation 2 means: finish this token and start a new token.) Now... if you want unicode characters in this mechanism to be treated as alphanumeric (which probably wouldn't be the right thing for a variety of unicode characters, but that's getting into the need to have a database classifying the purpose of each unicode character), what you could do is make the entries which transition from alphanumeric characters accept unicode sequences as a part of the current token, and similarly you would not end the token when recognizing the end of the unicode character. Basically, you'd be switching a bunch of operations from 2 (or 3, in the case of numeric characters) to zeros. Thanks, -- Raul On Sat, Nov 21, 2020 at 1:43 PM Don Guinn <[email protected]> wrote: > > Okay. I changed your lines for UTF-8 to: > > 1.1 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 11.2 10.2 12.2 15.0 16.0 NB. 15 partial > 1.0 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 11.2 10.2 12.2 15.2 16.0 NB. 16 utf-8 > > Notice that this treats each UTF-8 sequence as "other" instead of alph/num > but the box contains the entire UTF-8 sequence. Notice that in your > definition the subscript 2 in water shows up as a separate box as an > "other" in my test. Where in my case the subscript 2 is considered an > alph/num. > > NB. A noun to show the handling of UTF-8 in ;: > > test=:{{)n > > The symbol for the Euro is ₠ > > Other symols like π show up also > > How about ⌹ in APL NB. ⌹ > > Common expressions like 'H₂O' for water > > Common expressions like H₂O for water > > }} > > NB. How ;: in beta handles it > > ,.<;.2(0;sj;mj);:test > > +-------------------------------------------+ > > |+---+------+---+---+----+--+-+-+ | > > ||The|symbol|for|the|Euro|is|₠| | | > > |+---+------+---+---+----+--+-+-+ | > > +-------------------------------------------+ > > |+-----+------+----+-+----+--+----+-+ | > > ||Other|symols|like|π|show|up|also| | | > > |+-----+------+----+-+----+--+----+-+ | > > +-------------------------------------------+ > > |+---+-----+-+--+---+-----+-+ | > > ||How|about|⌹|in|APL|NB. ⌹| | | > > |+---+-----+-+--+---+-----+-+ | > > +-------------------------------------------+ > > |+------+-----------+----+-----+---+-----+-+| > > ||Common|expressions|like|'H₂O'|for|water| || > > |+------+-----------+----+-----+---+-----+-+| > > +-------------------------------------------+ > > |+------+-----------+----+-+-+-+---+-----+-+| > > ||Common|expressions|like|H|₂|O|for|water| || > > |+------+-----------+----+-+-+-+---+-----+-+| > > +-------------------------------------------+ > > NB. Assigning UTF8 as character > > mj=: 2 (128+i.128)}mj > > NB. How UTF-8 is now handled > > ,.<;.2(0;sj;mj);:test > > +-------------------------------------------+ > > |+---+------+---+---+----+--+-+-+ | > > ||The|symbol|for|the|Euro|is|₠| | | > > |+---+------+---+---+----+--+-+-+ | > > +-------------------------------------------+ > > |+-----+------+----+-+----+--+----+-+ | > > ||Other|symols|like|π|show|up|also| | | > > |+-----+------+----+-+----+--+----+-+ | > > +-------------------------------------------+ > > |+---+-----+-+--+---+-----+-+ | > > ||How|about|⌹|in|APL|NB. ⌹| | | > > |+---+-----+-+--+---+-----+-+ | > > +-------------------------------------------+ > > |+------+-----------+----+-----+---+-----+-+| > > ||Common|expressions|like|'H₂O'|for|water| || > > |+------+-----------+----+-----+---+-----+-+| > > +-------------------------------------------+ > > |+------+-----------+----+---+---+-----+-+ | > > ||Common|expressions|like|H₂O|for|water| | | > > |+------+-----------+----+---+---+-----+-+ | > > +-------------------------------------------+ > > > On Sat, Nov 21, 2020 at 11:05 AM Don Guinn <[email protected]> wrote: > > > I was not precise in my earlier response. I should have said that > > detecting the wrong number of UTF-8 continuation bytes would be difficult > > in the sequential machine as you would probably need to detect seven > > possible U starts and seven U continues to properly check. That would make > > a very large JS, although only checking for up to three continuation bytes > > would probably be sufficient. > > > > And I should have said that your SJ and MJ did not catch the UTF-8 > > characters in my test. Each byte was still treated as "other". Your > > approach is better as it allows the possibility of treating UTF-8 as > > "other", but would contain more than one byte - the entire UTF-8 sequence. > > I haven't looked at your SJ yet to try to find out why it doesn't catch the > > UTF-8. > > > > But what does one do if there is an error in the data? ;: returns errors > > if SJ and MJ are not constructed properly, but there is no way to report an > > error for bad data. And if there were, what would a programmer do about it? > > So is it necessary to detect bad UTF-8 sequences? Probably not. And for now > > at least treating all UTF-8 like alp/num would probably be what one would > > want. Let the display of the data show the errors. > > > > On Sat, Nov 21, 2020 at 9:12 AM Raul Miller <[email protected]> wrote: > > > >> (That said, it's also worth noting that the state table I presented > >> here doesn't give an error for unbalanced quotes, either. So if you > >> want errors to be thrown, you should probably be updating the state > >> table to force an error there, also.) > >> > >> (And, I should note that I haven't taken a look at what the resulting > >> errors look like. So I don't know how informative the resulting error > >> messages would be...) > >> > >> Thanks again, > >> > >> -- > >> Raul > >> > >> On Sat, Nov 21, 2020 at 11:04 AM Raul Miller <[email protected]> > >> wrote: > >> > > >> > Yes, I was surprised (and pleased) that the text file came through ok. > >> > > >> > As for throwing errors on malformed utf-8 sequences, here's how that > >> > could be implemented: > >> > > >> > (1) We introduce an "error row" which uses operation 2 for all > >> > character classes (and, for consistency, identifies the next row as > >> > itself -- this is mostly so that improper use of the error row would > >> > be relatively obvious). (Operation 2 emits a token, but the important > >> > thing here is that it throws an error if j=-1.) > >> > > >> > (2) For each row which is part of a partially complete utf-8 > >> > character, any appearance of any character class which is not a utf-8 > >> > suffix character would use operation 3 and would identify the next row > >> > as the error row. (Operation 3 emits a token, but the important thing > >> > here is that it sets j=-1.) > >> > > >> > (3) Each of the different utf-8 prefixes would lead to a different row > >> > in the state table. For example, the character class containing > >> > character 224 would get a beginning of token row (which gets the error > >> > row treatment and) which leads to a row that expects a utf-8 suffix > >> > (which gets the error row treatment) followed by a second utf-8 suffix > >> > (which follows the pattern set by row 1 of the state table). > >> > > >> > Hopefully that makes sense. > >> > > >> > Note that the final token in a string could not trigger an error. > >> > That's a limitation of the engine and corresponds approximately to how > >> > utf-8 must be treated when a low level buffer boundary splits a utf-8 > >> > character. > >> > > >> > Anyways, the point is that the sequential machine can support the sort > >> > of "count a small number of steps" which is needed here. The > >> > difficulty is more that the machine stops when it reaches the end of > >> > the string. If that's a sensitivity, this could be handled by > >> > appending a linefeed character to the end of the string before > >> > processing and then removing a final linefeed character from the last > >> > token after tokenization. > >> > > >> > Again, I hope this makes sense... > >> > > >> > Thanks, > >> > > >> > -- > >> > Raul > >> > > >> > > >> > > >> > > >> > On Sat, Nov 21, 2020 at 9:22 AM Don Guinn <[email protected]> wrote: > >> > > > >> > > The text file worked great! > >> > > > >> > > As to the UTF-8 codes. What is important is to avoid splitting the > >> start > >> > > bytes from the continuation bytes. Validating the UTF-8 codes is a > >> > > difficult task. The start byte includes the number of continuation > >> bytes to > >> > > follow. It would be an error if the number of continuation bytes > >> didn't > >> > > agree. > >> > > > >> > > I tried my test on your definitions and it failed. Attached is a text > >> file > >> > > with your definitions and my test following. > >> > > > >> > > Well, I can't view the attachment. I don't know if it's there or not. > >> Just > >> > > in case, here is my test. > >> > > > >> > > > >> > > NB. A noun to show the handling of UTF-8 in ;: > >> > > test=:{{)n > >> > > The symbol for the Euro is ₠ > >> > > Other symools like π show up also > >> > > How about ⌹ in APL NB. ⌹ > >> > > Common expressions like 'H₂O' for water > >> > > Common expressions like H₂O for water > >> > > }} > >> > > > >> > > NB. How ;: this sj and mj handles it > >> > > ,.<;.2(0;sj;mj);:test > >> > > > >> > > NB. Assigning UTF8 as character > >> > > mj=: 2 (128+i.128)}mj > >> > > > >> > > NB. How UTF-8 is now handled > >> > > ,.<;.2(0;sj;mj);:test > >> > > > >> > > On Fri, Nov 20, 2020 at 2:46 PM Raul Miller <[email protected]> > >> wrote: > >> > > > >> > > > Here's an updated version, which also retains utf-8 character > >> > > > sequences within token boundaries (instead of splitting them up into > >> > > > multiple tokens). I had originally posted this to the jbeta forum, > >> but > >> > > > it's really a programming topic, and probably belongs here. > >> > > > > >> > > > mj=: 256$0 NB. X other > >> > > > mj=: 1 (9,a.i.' ')}mj NB. S space and tab > >> > > > mj=: 2 (,(a.i.'Aa')+/i.26)}mj NB. A A-Z a-z excluding N B > >> > > > mj=: 3 (a.i.'N')}mj NB. N the letter N > >> > > > mj=: 4 (a.i.'B')}mj NB. B the letter B > >> > > > mj=: 5 (a.i.'0123456789_')}mj NB. 9 digits and _ > >> > > > mj=: 6 (a.i.'.')}mj NB. . the decimal point > >> > > > mj=: 7 (a.i.':')}mj NB. : the colon > >> > > > mj=: 8 (a.i.'''')}mj NB. Q quote > >> > > > mj=: 9 (a.i.'{')}mj NB. { the left curly brace > >> > > > mj=:10 (10)} mj NB. LF > >> > > > mj=:11 (a.i.'}')}mj NB. } the right curly brace > >> > > > mj=:12 (192+i.64)}mj NB. U utf-8 octet prefix > >> > > > mj=:13 (128+i.64)}mj NB. V utf-8 octet suffix > >> > > > > >> > > > sj=: 0 10#:10*}.".;._2(0 :0) > >> > > > ' X S A N B 9 . : Q { LF } U V']0 > >> > > > 1.1 0.0 2.1 3.1 2.1 6.1 1.1 1.1 7.1 11.1 10.1 12.1 15.1 16.1 NB. 0 > >> space > >> > > > 1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 11.2 10.2 12.2 15.2 16.2 NB. 1 > >> other > >> > > > 1.2 0.3 2.0 2.0 2.0 2.0 1.0 1.0 7.2 11.2 10.2 12.2 15.2 16.2 NB. 2 > >> alp/num > >> > > > 1.2 0.3 2.0 2.0 4.0 2.0 1.0 1.0 7.2 11.2 10.2 12.2 15.2 16.2 NB. 3 > >> N > >> > > > 1.2 0.3 2.0 2.0 2.0 2.0 5.0 1.0 7.2 11.2 10.2 12.2 15.2 16.2 NB. 4 > >> NB > >> > > > 9.0 9.0 9.0 9.0 9.0 9.0 1.0 1.0 9.0 9.0 10.2 9.0 9.0 9.0 NB. 5 > >> NB. > >> > > > 1.4 0.5 6.0 6.0 6.0 6.0 6.0 1.0 7.4 11.4 10.2 12.4 15.2 16.2 NB. 6 > >> num > >> > > > 7.0 7.0 7.0 7.0 7.0 7.0 7.0 7.0 8.0 7.0 7.0 7.0 7.0 7.0 NB. 7 > >> ' > >> > > > 1.2 0.3 2.2 3.2 2.2 6.2 1.2 1.2 7.0 11.2 10.2 12.2 15.2 16.2 NB. 8 > >> '' > >> > > > 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0 10.2 9.0 9.0 9.0 NB. 9 > >> comment > >> > > > 1.2 0.2 2.2 3.2 2.2 6.2 1.2 1.2 7.2 11.2 10.2 12.2 15.2 16.2 NB. > >> 10 LF > >> > > > 1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 13.0 10.2 1.2 15.2 16.2 NB. > >> 11 { > >> > > > 1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 1.2 10.2 14.0 15.2 16.2 NB. > >> 12 } > >> > > > 1.2 0.3 2.2 3.2 2.2 6.2 1.7 1.7 7.2 1.2 10.2 1.2 15.2 16.2 NB. > >> 13 {{ > >> > > > 1.2 0.3 2.2 3.2 2.2 6.2 1.7 1.7 7.2 1.2 10.2 1.2 15.2 16.2 NB. > >> 14 }} > >> > > > 1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 11.2 10.2 12.2 15.0 16.0 NB. 15 > >> > > > partial > >> > > > 1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 11.2 10.2 12.2 15.2 16.0 NB. > >> 16 utf-8 > >> > > > ) > >> > > > > >> > > > As I noted in the beta forum -- increasing the complexity of the > >> state > >> > > > table adds both rows and columns (size grows proportional to or > >> faster > >> > > > than the square of the number of distinct character types when each > >> > > > character type requires a distinct handling). So it's good to keep > >> > > > this thing as simple as possible. > >> > > > > >> > > > Also, I've not tested this extensively, and it's possible I'll need > >> to > >> > > > make further changes (let me know if you spot any problems). > >> > > > > >> > > > That said... note also that I have *not* implemented the unicode > >> > > > guideline which might suggest that the tokenizer should throw an > >> error > >> > > > on malformed utf-8 sequences. That would require several more rows > >> and > >> > > > columns to achieve the recommended inconvenience. This would also > >> > > > introduce email line wrap, because the state table would become that > >> > > > fat. (I'll attach a copy here as a .txt file, to see if an earlier > >> > > > suggestion -- that the forum would preserve .txt attachments -- > >> might > >> > > > be a way of working around that issue. I suspect not, but it's easy > >> > > > enough to test...) > >> > > > > >> > > > This has *not* been implemented in the current jbeta as the ;: > >> monad. > >> > > > I am not sure if it should, since J the language is based on ascii, > >> > > > not unicode -- it's just convenient that unicode supports an ascii > >> > > > subset. > >> > > > > >> > > > Still... we often do have reason to work with utf-8. > >> > > > > >> > > > Thanks, > >> > > > > >> > > > -- > >> > > > Raul > >> > > > > >> > > > On Sun, Nov 8, 2020 at 11:01 AM Raul Miller <[email protected]> > >> wrote: > >> > > > > > >> > > > > Oh, oops, I should have spotted that. Thanks. > >> > > > > > >> > > > > Updated state table: > >> > > > > > >> > > > > mj=: 256$0 NB. X other > >> > > > > mj=: 1 (9,a.i.' ')}mj NB. S space and tab > >> > > > > mj=: 2 (,(a.i.'Aa')+/i.26)}mj NB. A A-Z a-z excluding N B > >> > > > > mj=: 3 (a.i.'N')}mj NB. N the letter N > >> > > > > mj=: 4 (a.i.'B')}mj NB. B the letter B > >> > > > > mj=: 5 (a.i.'0123456789_')}mj NB. 9 digits and _ > >> > > > > mj=: 6 (a.i.'.')}mj NB. . the decimal point > >> > > > > mj=: 7 (a.i.':')}mj NB. : the colon > >> > > > > mj=: 8 (a.i.'''')}mj NB. Q quote > >> > > > > mj=: 9 (a.i.'{')}mj NB. { the left curly brace > >> > > > > mj=:10 (10)} mj NB. LF > >> > > > > mj=:11 (a.i.'}')}mj NB. } the right curly brace > >> > > > > > >> > > > > sj=: 0 10#:10*}.".;._2(0 :0) > >> > > > > ' X S A N B 9 . : Q { LF }']0 > >> > > > > 1.1 0.0 2.1 3.1 2.1 6.1 1.1 1.1 7.1 11.1 10.1 12.1 NB. 0 space > >> > > > > 1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 11.2 10.2 12.2 NB. 1 other > >> > > > > 1.2 0.3 2.0 2.0 2.0 2.0 1.0 1.0 7.2 11.2 10.2 12.2 NB. 2 alp/num > >> > > > > 1.2 0.3 2.0 2.0 4.0 2.0 1.0 1.0 7.2 11.2 10.2 12.2 NB. 3 N > >> > > > > 1.2 0.3 2.0 2.0 2.0 2.0 5.0 1.0 7.2 11.2 10.2 12.2 NB. 4 NB > >> > > > > 9.0 9.0 9.0 9.0 9.0 9.0 1.0 1.0 9.0 9.0 10.2 9.0 NB. 5 NB. > >> > > > > 1.4 0.5 6.0 6.0 6.0 6.0 6.0 1.0 7.4 11.4 10.2 12.4 NB. 6 num > >> > > > > 7.0 7.0 7.0 7.0 7.0 7.0 7.0 7.0 8.0 7.0 7.0 7.0 NB. 7 ' > >> > > > > 1.2 0.3 2.2 3.2 2.2 6.2 1.2 1.2 7.0 11.2 10.2 12.2 NB. 8 '' > >> > > > > 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0 10.2 9.0 NB. 9 comment > >> > > > > 1.2 0.2 2.2 3.2 2.2 6.2 1.2 1.2 7.2 11.2 10.2 12.2 NB. 10 LF > >> > > > > 1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 13.0 10.2 1.2 NB. 11 { > >> > > > > 1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 1.2 10.2 14.0 NB. 12 } > >> > > > > 1.2 0.3 2.2 3.2 2.2 6.2 1.7 1.7 7.2 1.2 10.2 1.2 NB. 13 {{ > >> > > > > 1.2 0.3 2.2 3.2 2.2 6.2 1.7 1.7 7.2 1.2 10.2 1.2 NB. 14 }} > >> > > > > ) > >> > > > > > >> > > > > (Note that I haven't coerced this state table to integer form -- > >> > > > > floats and integers occupy the same space on 64 bit systems, and > >> the > >> > > > > model doesn't really care about representation.) > >> > > > > > >> > > > > Thanks, > >> > > > > > >> > > > > -- > >> > > > > Raul > >> > > > > > >> > > > > On Sun, Nov 8, 2020 at 10:53 AM Henry Rich <[email protected]> > >> wrote: > >> > > > > > > >> > > > > > I was talking about > >> > > > > > > >> > > > > > ;: LF,'.' > >> > > > > > +-+-+ > >> > > > > > | |.| > >> > > > > > +-+-+ > >> > > > > > > >> > > > > > Henry Rich > >> > > > > > > >> > > > > > On 11/8/2020 8:38 AM, Raul Miller wrote: > >> > > > > > > I tested for that case: > >> > > > > > > > >> > > > > > > #;:'NB.',LF,LF > >> > > > > > > 3 > >> > > > > > > #(0;sj;mj) sq 'NB.',LF,LF > >> > > > > > > 3 > >> > > > > > > #(0;sj;mj) sq 'NB.',LF,LF,LF > >> > > > > > > 4 > >> > > > > > > > >> > > > > > > Thanks, > >> > > > > > > > >> > > > > > > >> > > > > > > >> > > > > > -- > >> > > > > > This email has been checked for viruses by AVG. > >> > > > > > https://www.avg.com > >> > > > > > > >> > > > > > > >> ---------------------------------------------------------------------- > >> > > > > > For information about J forums see > >> http://www.jsoftware.com/forums.htm > >> > > > > >> ---------------------------------------------------------------------- > >> > > > For information about J forums see > >> http://www.jsoftware.com/forums.htm > >> > > > > >> > > ---------------------------------------------------------------------- > >> > > For information about J forums see > >> http://www.jsoftware.com/forums.htm > >> ---------------------------------------------------------------------- > >> For information about J forums see http://www.jsoftware.com/forums.htm > >> > > > ---------------------------------------------------------------------- > For information about J forums see http://www.jsoftware.com/forums.htm ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm
