The text file worked great!
As to the UTF-8 codes. What is important is to avoid splitting the start
bytes from the continuation bytes. Validating the UTF-8 codes is a
difficult task. The start byte includes the number of continuation bytes to
follow. It would be an error if the number of continuation bytes didn't
agree.
I tried my test on your definitions and it failed. Attached is a text file
with your definitions and my test following.
Well, I can't view the attachment. I don't know if it's there or not. Just
in case, here is my test.
NB. A noun to show the handling of UTF-8 in ;:
test=:{{)n
The symbol for the Euro is ₠
Other symools like π show up also
How about ⌹ in APL NB. ⌹
Common expressions like 'H₂O' for water
Common expressions like H₂O for water
}}
NB. How ;: this sj and mj handles it
,.<;.2(0;sj;mj);:test
NB. Assigning UTF8 as character
mj=: 2 (128+i.128)}mj
NB. How UTF-8 is now handled
,.<;.2(0;sj;mj);:test
On Fri, Nov 20, 2020 at 2:46 PM Raul Miller <[email protected]> wrote:
> Here's an updated version, which also retains utf-8 character
> sequences within token boundaries (instead of splitting them up into
> multiple tokens). I had originally posted this to the jbeta forum, but
> it's really a programming topic, and probably belongs here.
>
> mj=: 256$0 NB. X other
> mj=: 1 (9,a.i.' ')}mj NB. S space and tab
> mj=: 2 (,(a.i.'Aa')+/i.26)}mj NB. A A-Z a-z excluding N B
> mj=: 3 (a.i.'N')}mj NB. N the letter N
> mj=: 4 (a.i.'B')}mj NB. B the letter B
> mj=: 5 (a.i.'0123456789_')}mj NB. 9 digits and _
> mj=: 6 (a.i.'.')}mj NB. . the decimal point
> mj=: 7 (a.i.':')}mj NB. : the colon
> mj=: 8 (a.i.'''')}mj NB. Q quote
> mj=: 9 (a.i.'{')}mj NB. { the left curly brace
> mj=:10 (10)} mj NB. LF
> mj=:11 (a.i.'}')}mj NB. } the right curly brace
> mj=:12 (192+i.64)}mj NB. U utf-8 octet prefix
> mj=:13 (128+i.64)}mj NB. V utf-8 octet suffix
>
> sj=: 0 10#:10*}.".;._2(0 :0)
> ' X S A N B 9 . : Q { LF } U V']0
> 1.1 0.0 2.1 3.1 2.1 6.1 1.1 1.1 7.1 11.1 10.1 12.1 15.1 16.1 NB. 0 space
> 1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 11.2 10.2 12.2 15.2 16.2 NB. 1 other
> 1.2 0.3 2.0 2.0 2.0 2.0 1.0 1.0 7.2 11.2 10.2 12.2 15.2 16.2 NB. 2 alp/num
> 1.2 0.3 2.0 2.0 4.0 2.0 1.0 1.0 7.2 11.2 10.2 12.2 15.2 16.2 NB. 3 N
> 1.2 0.3 2.0 2.0 2.0 2.0 5.0 1.0 7.2 11.2 10.2 12.2 15.2 16.2 NB. 4 NB
> 9.0 9.0 9.0 9.0 9.0 9.0 1.0 1.0 9.0 9.0 10.2 9.0 9.0 9.0 NB. 5 NB.
> 1.4 0.5 6.0 6.0 6.0 6.0 6.0 1.0 7.4 11.4 10.2 12.4 15.2 16.2 NB. 6 num
> 7.0 7.0 7.0 7.0 7.0 7.0 7.0 7.0 8.0 7.0 7.0 7.0 7.0 7.0 NB. 7 '
> 1.2 0.3 2.2 3.2 2.2 6.2 1.2 1.2 7.0 11.2 10.2 12.2 15.2 16.2 NB. 8 ''
> 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0 10.2 9.0 9.0 9.0 NB. 9 comment
> 1.2 0.2 2.2 3.2 2.2 6.2 1.2 1.2 7.2 11.2 10.2 12.2 15.2 16.2 NB. 10 LF
> 1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 13.0 10.2 1.2 15.2 16.2 NB. 11 {
> 1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 1.2 10.2 14.0 15.2 16.2 NB. 12 }
> 1.2 0.3 2.2 3.2 2.2 6.2 1.7 1.7 7.2 1.2 10.2 1.2 15.2 16.2 NB. 13 {{
> 1.2 0.3 2.2 3.2 2.2 6.2 1.7 1.7 7.2 1.2 10.2 1.2 15.2 16.2 NB. 14 }}
> 1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 11.2 10.2 12.2 15.0 16.0 NB. 15
> partial
> 1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 11.2 10.2 12.2 15.2 16.0 NB. 16 utf-8
> )
>
> As I noted in the beta forum -- increasing the complexity of the state
> table adds both rows and columns (size grows proportional to or faster
> than the square of the number of distinct character types when each
> character type requires a distinct handling). So it's good to keep
> this thing as simple as possible.
>
> Also, I've not tested this extensively, and it's possible I'll need to
> make further changes (let me know if you spot any problems).
>
> That said... note also that I have *not* implemented the unicode
> guideline which might suggest that the tokenizer should throw an error
> on malformed utf-8 sequences. That would require several more rows and
> columns to achieve the recommended inconvenience. This would also
> introduce email line wrap, because the state table would become that
> fat. (I'll attach a copy here as a .txt file, to see if an earlier
> suggestion -- that the forum would preserve .txt attachments -- might
> be a way of working around that issue. I suspect not, but it's easy
> enough to test...)
>
> This has *not* been implemented in the current jbeta as the ;: monad.
> I am not sure if it should, since J the language is based on ascii,
> not unicode -- it's just convenient that unicode supports an ascii
> subset.
>
> Still... we often do have reason to work with utf-8.
>
> Thanks,
>
> --
> Raul
>
> On Sun, Nov 8, 2020 at 11:01 AM Raul Miller <[email protected]> wrote:
> >
> > Oh, oops, I should have spotted that. Thanks.
> >
> > Updated state table:
> >
> > mj=: 256$0 NB. X other
> > mj=: 1 (9,a.i.' ')}mj NB. S space and tab
> > mj=: 2 (,(a.i.'Aa')+/i.26)}mj NB. A A-Z a-z excluding N B
> > mj=: 3 (a.i.'N')}mj NB. N the letter N
> > mj=: 4 (a.i.'B')}mj NB. B the letter B
> > mj=: 5 (a.i.'0123456789_')}mj NB. 9 digits and _
> > mj=: 6 (a.i.'.')}mj NB. . the decimal point
> > mj=: 7 (a.i.':')}mj NB. : the colon
> > mj=: 8 (a.i.'''')}mj NB. Q quote
> > mj=: 9 (a.i.'{')}mj NB. { the left curly brace
> > mj=:10 (10)} mj NB. LF
> > mj=:11 (a.i.'}')}mj NB. } the right curly brace
> >
> > sj=: 0 10#:10*}.".;._2(0 :0)
> > ' X S A N B 9 . : Q { LF }']0
> > 1.1 0.0 2.1 3.1 2.1 6.1 1.1 1.1 7.1 11.1 10.1 12.1 NB. 0 space
> > 1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 11.2 10.2 12.2 NB. 1 other
> > 1.2 0.3 2.0 2.0 2.0 2.0 1.0 1.0 7.2 11.2 10.2 12.2 NB. 2 alp/num
> > 1.2 0.3 2.0 2.0 4.0 2.0 1.0 1.0 7.2 11.2 10.2 12.2 NB. 3 N
> > 1.2 0.3 2.0 2.0 2.0 2.0 5.0 1.0 7.2 11.2 10.2 12.2 NB. 4 NB
> > 9.0 9.0 9.0 9.0 9.0 9.0 1.0 1.0 9.0 9.0 10.2 9.0 NB. 5 NB.
> > 1.4 0.5 6.0 6.0 6.0 6.0 6.0 1.0 7.4 11.4 10.2 12.4 NB. 6 num
> > 7.0 7.0 7.0 7.0 7.0 7.0 7.0 7.0 8.0 7.0 7.0 7.0 NB. 7 '
> > 1.2 0.3 2.2 3.2 2.2 6.2 1.2 1.2 7.0 11.2 10.2 12.2 NB. 8 ''
> > 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0 10.2 9.0 NB. 9 comment
> > 1.2 0.2 2.2 3.2 2.2 6.2 1.2 1.2 7.2 11.2 10.2 12.2 NB. 10 LF
> > 1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 13.0 10.2 1.2 NB. 11 {
> > 1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 1.2 10.2 14.0 NB. 12 }
> > 1.2 0.3 2.2 3.2 2.2 6.2 1.7 1.7 7.2 1.2 10.2 1.2 NB. 13 {{
> > 1.2 0.3 2.2 3.2 2.2 6.2 1.7 1.7 7.2 1.2 10.2 1.2 NB. 14 }}
> > )
> >
> > (Note that I haven't coerced this state table to integer form --
> > floats and integers occupy the same space on 64 bit systems, and the
> > model doesn't really care about representation.)
> >
> > Thanks,
> >
> > --
> > Raul
> >
> > On Sun, Nov 8, 2020 at 10:53 AM Henry Rich <[email protected]> wrote:
> > >
> > > I was talking about
> > >
> > > ;: LF,'.'
> > > +-+-+
> > > | |.|
> > > +-+-+
> > >
> > > Henry Rich
> > >
> > > On 11/8/2020 8:38 AM, Raul Miller wrote:
> > > > I tested for that case:
> > > >
> > > > #;:'NB.',LF,LF
> > > > 3
> > > > #(0;sj;mj) sq 'NB.',LF,LF
> > > > 3
> > > > #(0;sj;mj) sq 'NB.',LF,LF,LF
> > > > 4
> > > >
> > > > Thanks,
> > > >
> > >
> > >
> > > --
> > > This email has been checked for viruses by AVG.
> > > https://www.avg.com
> > >
> > > ----------------------------------------------------------------------
> > > For information about J forums see http://www.jsoftware.com/forums.htm
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
>
mj=: 256$0 NB. X other
mj=: 1 (9,a.i.' ')}mj NB. S space and tab
mj=: 2 (,(a.i.'Aa')+/i.26)}mj NB. A A-Z a-z excluding N B
mj=: 3 (a.i.'N')}mj NB. N the letter N
mj=: 4 (a.i.'B')}mj NB. B the letter B
mj=: 5 (a.i.'0123456789_')}mj NB. 9 digits and _
mj=: 6 (a.i.'.')}mj NB. . the decimal point
mj=: 7 (a.i.':')}mj NB. : the colon
mj=: 8 (a.i.'''')}mj NB. Q quote
mj=: 9 (a.i.'{')}mj NB. { the left curly brace
mj=:10 (10)} mj NB. LF
mj=:11 (a.i.'}')}mj NB. } the right curly brace
mj=:12 (192+i.64)}mj NB. U utf-8 octet prefix
mj=:13 (128+i.64)}mj NB. V utf-8 octet suffix
sj=: 0 10#:10*}.".;._2(0 :0)
' X S A N B 9 . : Q { LF } U V']0
1.1 0.0 2.1 3.1 2.1 6.1 1.1 1.1 7.1 11.1 10.1 12.1 15.1 16.1 NB. 0 space
1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 11.2 10.2 12.2 15.2 16.2 NB. 1 other
1.2 0.3 2.0 2.0 2.0 2.0 1.0 1.0 7.2 11.2 10.2 12.2 15.2 16.2 NB. 2 alp/num
1.2 0.3 2.0 2.0 4.0 2.0 1.0 1.0 7.2 11.2 10.2 12.2 15.2 16.2 NB. 3 N
1.2 0.3 2.0 2.0 2.0 2.0 5.0 1.0 7.2 11.2 10.2 12.2 15.2 16.2 NB. 4 NB
9.0 9.0 9.0 9.0 9.0 9.0 1.0 1.0 9.0 9.0 10.2 9.0 9.0 9.0 NB. 5 NB.
1.4 0.5 6.0 6.0 6.0 6.0 6.0 1.0 7.4 11.4 10.2 12.4 15.2 16.2 NB. 6 num
7.0 7.0 7.0 7.0 7.0 7.0 7.0 7.0 8.0 7.0 7.0 7.0 7.0 7.0 NB. 7 '
1.2 0.3 2.2 3.2 2.2 6.2 1.2 1.2 7.0 11.2 10.2 12.2 15.2 16.2 NB. 8 ''
9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0 10.2 9.0 9.0 9.0 NB. 9 comment
1.2 0.2 2.2 3.2 2.2 6.2 1.2 1.2 7.2 11.2 10.2 12.2 15.2 16.2 NB. 10 LF
1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 13.0 10.2 1.2 15.2 16.2 NB. 11 {
1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 1.2 10.2 14.0 15.2 16.2 NB. 12 }
1.2 0.3 2.2 3.2 2.2 6.2 1.7 1.7 7.2 1.2 10.2 1.2 15.2 16.2 NB. 13 {{
1.2 0.3 2.2 3.2 2.2 6.2 1.7 1.7 7.2 1.2 10.2 1.2 15.2 16.2 NB. 14 }}
1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 11.2 10.2 12.2 15.0 16.0 NB. 15 partial
1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 11.2 10.2 12.2 15.2 16.2 NB. 16 utf-8
)
NB. A noun to show the handling of UTF-8 in ;:
test=:{{)n
The symbol for the Euro is ₠
Other symools like π show up also
How about ⌹ in APL NB. ⌹
Common expressions like 'H₂O' for water
Common expressions like H₂O for water
}}
NB. How ;: in beta handles it
,.<;.2(0;sj;mj);:test
NB. Assigning UTF8 as character
mj=: 2 (128+i.128)}mj
NB. How UTF-8 is now handled
,.<;.2(0;sj;mj);:test----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm