Yes, I was surprised (and pleased) that the text file came through ok.

As for throwing errors on malformed utf-8 sequences, here's how that
could be implemented:

(1) We introduce an "error row" which uses operation 2 for all
character classes (and, for consistency, identifies the next row as
itself -- this is mostly so that improper use of the error row would
be relatively obvious). (Operation 2 emits a token, but the important
thing here is that it throws an error if j=-1.)

(2) For each row which is part of a partially complete utf-8
character, any appearance of any character class which is not a utf-8
suffix character would use operation 3 and would identify the next row
as the error row. (Operation 3 emits a token, but the important thing
here is that it sets j=-1.)

(3) Each of the different utf-8 prefixes would lead to a different row
in the state table. For example, the character class containing
character 224 would get a beginning of token row (which gets the error
row treatment and) which leads to a row that expects a utf-8 suffix
(which gets the error row treatment) followed by a second utf-8 suffix
(which follows the pattern set by row 1 of the state table).

Hopefully that makes sense.

Note that the final token in a string could not trigger an error.
That's a limitation of the engine and corresponds approximately to how
utf-8 must be treated when a low level buffer boundary splits a utf-8
character.

Anyways, the point is that the sequential machine can support the sort
of "count a small number of steps" which is needed here. The
difficulty is more that the machine stops when it reaches the end of
the string.  If that's a sensitivity, this could be handled by
appending a linefeed character to the end of the string before
processing and then removing a final linefeed character from the last
token after tokenization.

Again, I hope this makes sense...

Thanks,

-- 
Raul




On Sat, Nov 21, 2020 at 9:22 AM Don Guinn <[email protected]> wrote:
>
> The text file worked great!
>
> As to the UTF-8 codes. What is important is to avoid splitting the start
> bytes from the continuation bytes. Validating the UTF-8 codes is a
> difficult task. The start byte includes the number of continuation bytes to
> follow. It would be an error if the number of continuation bytes didn't
> agree.
>
> I tried my test on your definitions and it failed. Attached is a text file
> with your definitions and my test following.
>
> Well, I can't view the attachment. I don't know if it's there or not. Just
> in case, here is my test.
>
>
> NB. A noun to show the handling of UTF-8 in ;:
> test=:{{)n
> The symbol for the Euro is ₠
> Other symools like π show up also
> How about ⌹ in APL NB. ⌹
> Common expressions like 'H₂O' for water
> Common expressions like H₂O for water
> }}
>
> NB. How ;: this sj and mj handles it
> ,.<;.2(0;sj;mj);:test
>
> NB. Assigning UTF8 as character
> mj=: 2 (128+i.128)}mj
>
> NB. How UTF-8 is now handled
> ,.<;.2(0;sj;mj);:test
>
> On Fri, Nov 20, 2020 at 2:46 PM Raul Miller <[email protected]> wrote:
>
> > Here's an updated version, which also retains utf-8 character
> > sequences within token boundaries (instead of splitting them up into
> > multiple tokens). I had originally posted this to the jbeta forum, but
> > it's really a programming topic, and probably belongs here.
> >
> > mj=: 256$0                     NB. X other
> > mj=: 1 (9,a.i.' ')}mj          NB. S space and tab
> > mj=: 2 (,(a.i.'Aa')+/i.26)}mj  NB. A A-Z a-z excluding N B
> > mj=: 3 (a.i.'N')}mj            NB. N the letter N
> > mj=: 4 (a.i.'B')}mj            NB. B the letter B
> > mj=: 5 (a.i.'0123456789_')}mj  NB. 9 digits and _
> > mj=: 6 (a.i.'.')}mj            NB. . the decimal point
> > mj=: 7 (a.i.':')}mj            NB. : the colon
> > mj=: 8 (a.i.'''')}mj           NB. Q quote
> > mj=: 9 (a.i.'{')}mj            NB. { the left curly brace
> > mj=:10 (10)} mj                NB. LF
> > mj=:11 (a.i.'}')}mj            NB. } the right curly brace
> > mj=:12 (192+i.64)}mj           NB. U utf-8 octet prefix
> > mj=:13 (128+i.64)}mj           NB. V utf-8 octet suffix
> >
> > sj=: 0 10#:10*}.".;._2(0 :0)
> > ' X   S   A   N   B   9   .   :   Q    {    LF   }   U    V']0
> >  1.1 0.0 2.1 3.1 2.1 6.1 1.1 1.1 7.1 11.1 10.1 12.1 15.1 16.1 NB. 0 space
> >  1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 11.2 10.2 12.2 15.2 16.2 NB. 1 other
> >  1.2 0.3 2.0 2.0 2.0 2.0 1.0 1.0 7.2 11.2 10.2 12.2 15.2 16.2 NB. 2 alp/num
> >  1.2 0.3 2.0 2.0 4.0 2.0 1.0 1.0 7.2 11.2 10.2 12.2 15.2 16.2 NB. 3 N
> >  1.2 0.3 2.0 2.0 2.0 2.0 5.0 1.0 7.2 11.2 10.2 12.2 15.2 16.2 NB. 4 NB
> >  9.0 9.0 9.0 9.0 9.0 9.0 1.0 1.0 9.0  9.0 10.2  9.0  9.0  9.0 NB. 5 NB.
> >  1.4 0.5 6.0 6.0 6.0 6.0 6.0 1.0 7.4 11.4 10.2 12.4 15.2 16.2 NB. 6 num
> >  7.0 7.0 7.0 7.0 7.0 7.0 7.0 7.0 8.0  7.0  7.0  7.0  7.0  7.0 NB. 7 '
> >  1.2 0.3 2.2 3.2 2.2 6.2 1.2 1.2 7.0 11.2 10.2 12.2 15.2 16.2 NB. 8 ''
> >  9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0  9.0 10.2  9.0  9.0  9.0 NB. 9 comment
> >  1.2 0.2 2.2 3.2 2.2 6.2 1.2 1.2 7.2 11.2 10.2 12.2 15.2 16.2 NB. 10 LF
> >  1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 13.0 10.2  1.2 15.2 16.2 NB. 11 {
> >  1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2  1.2 10.2 14.0 15.2 16.2 NB. 12 }
> >  1.2 0.3 2.2 3.2 2.2 6.2 1.7 1.7 7.2  1.2 10.2  1.2 15.2 16.2 NB. 13 {{
> >  1.2 0.3 2.2 3.2 2.2 6.2 1.7 1.7 7.2  1.2 10.2  1.2 15.2 16.2 NB. 14 }}
> >  1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 11.2 10.2 12.2 15.0 16.0 NB. 15
> > partial
> >  1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 11.2 10.2 12.2 15.2 16.0 NB. 16 utf-8
> > )
> >
> > As I noted in the beta forum -- increasing the complexity of the state
> > table adds both rows and columns (size grows proportional to or faster
> > than the square of the number of distinct character types when each
> > character type requires a distinct handling). So it's good to keep
> > this thing as simple as possible.
> >
> > Also, I've not tested this extensively, and it's possible I'll need to
> > make further changes (let me know if you spot any problems).
> >
> > That said... note also that I have *not* implemented the unicode
> > guideline which might suggest that the tokenizer should throw an error
> > on malformed utf-8 sequences. That would require several more rows and
> > columns to achieve the recommended inconvenience. This would also
> > introduce email line wrap, because the state table would become that
> > fat. (I'll attach a copy here as a .txt file, to see if an earlier
> > suggestion -- that the forum would preserve .txt attachments -- might
> > be a way of working around that issue. I suspect not, but it's easy
> > enough to test...)
> >
> > This has *not* been implemented in the current jbeta as the ;: monad.
> > I am not sure if it should, since J the language is based on ascii,
> > not unicode -- it's just convenient that unicode supports an ascii
> > subset.
> >
> > Still... we often do have reason to work with utf-8.
> >
> > Thanks,
> >
> > --
> > Raul
> >
> > On Sun, Nov 8, 2020 at 11:01 AM Raul Miller <[email protected]> wrote:
> > >
> > > Oh, oops, I should have spotted that. Thanks.
> > >
> > > Updated state table:
> > >
> > > mj=: 256$0                     NB. X other
> > > mj=: 1 (9,a.i.' ')}mj          NB. S space and tab
> > > mj=: 2 (,(a.i.'Aa')+/i.26)}mj  NB. A A-Z a-z excluding N B
> > > mj=: 3 (a.i.'N')}mj            NB. N the letter N
> > > mj=: 4 (a.i.'B')}mj            NB. B the letter B
> > > mj=: 5 (a.i.'0123456789_')}mj  NB. 9 digits and _
> > > mj=: 6 (a.i.'.')}mj            NB. . the decimal point
> > > mj=: 7 (a.i.':')}mj            NB. : the colon
> > > mj=: 8 (a.i.'''')}mj           NB. Q quote
> > > mj=: 9 (a.i.'{')}mj            NB. { the left curly brace
> > > mj=:10 (10)} mj                NB. LF
> > > mj=:11 (a.i.'}')}mj            NB. } the right curly brace
> > >
> > > sj=: 0 10#:10*}.".;._2(0 :0)
> > > ' X   S   A   N   B   9   .   :   Q    {    LF   }']0
> > >  1.1 0.0 2.1 3.1 2.1 6.1 1.1 1.1 7.1 11.1 10.1 12.1 NB. 0 space
> > >  1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 11.2 10.2 12.2 NB. 1 other
> > >  1.2 0.3 2.0 2.0 2.0 2.0 1.0 1.0 7.2 11.2 10.2 12.2 NB. 2 alp/num
> > >  1.2 0.3 2.0 2.0 4.0 2.0 1.0 1.0 7.2 11.2 10.2 12.2 NB. 3 N
> > >  1.2 0.3 2.0 2.0 2.0 2.0 5.0 1.0 7.2 11.2 10.2 12.2 NB. 4 NB
> > >  9.0 9.0 9.0 9.0 9.0 9.0 1.0 1.0 9.0  9.0 10.2  9.0 NB. 5 NB.
> > >  1.4 0.5 6.0 6.0 6.0 6.0 6.0 1.0 7.4 11.4 10.2 12.4 NB. 6 num
> > >  7.0 7.0 7.0 7.0 7.0 7.0 7.0 7.0 8.0  7.0  7.0  7.0 NB. 7 '
> > >  1.2 0.3 2.2 3.2 2.2 6.2 1.2 1.2 7.0 11.2 10.2 12.2 NB. 8 ''
> > >  9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0  9.0 10.2  9.0 NB. 9 comment
> > >  1.2 0.2 2.2 3.2 2.2 6.2 1.2 1.2 7.2 11.2 10.2 12.2 NB. 10 LF
> > >  1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 13.0 10.2  1.2 NB. 11 {
> > >  1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2  1.2 10.2 14.0 NB. 12 }
> > >  1.2 0.3 2.2 3.2 2.2 6.2 1.7 1.7 7.2  1.2 10.2  1.2 NB. 13 {{
> > >  1.2 0.3 2.2 3.2 2.2 6.2 1.7 1.7 7.2  1.2 10.2  1.2 NB. 14 }}
> > > )
> > >
> > > (Note that I haven't coerced this state table to integer form --
> > > floats and integers occupy the same space on 64 bit systems, and the
> > > model doesn't really care about representation.)
> > >
> > > Thanks,
> > >
> > > --
> > > Raul
> > >
> > > On Sun, Nov 8, 2020 at 10:53 AM Henry Rich <[email protected]> wrote:
> > > >
> > > > I was talking about
> > > >
> > > >     ;: LF,'.'
> > > > +-+-+
> > > > | |.|
> > > > +-+-+
> > > >
> > > > Henry Rich
> > > >
> > > > On 11/8/2020 8:38 AM, Raul Miller wrote:
> > > > > I tested for that case:
> > > > >
> > > > >     #;:'NB.',LF,LF
> > > > > 3
> > > > >    #(0;sj;mj) sq 'NB.',LF,LF
> > > > > 3
> > > > >     #(0;sj;mj) sq 'NB.',LF,LF,LF
> > > > > 4
> > > > >
> > > > > Thanks,
> > > > >
> > > >
> > > >
> > > > --
> > > > This email has been checked for viruses by AVG.
> > > > https://www.avg.com
> > > >
> > > > ----------------------------------------------------------------------
> > > > For information about J forums see http://www.jsoftware.com/forums.htm
> > ----------------------------------------------------------------------
> > For information about J forums see http://www.jsoftware.com/forums.htm
> >
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to