Yes. I forgot about J=._1 . And I just looked at the UTF-8 description and
saw that it had been restricted to 3 continuation bytes, for a maximum of
21 bits, in 2003. Previously allowed up to 7 continuation bytes.

Hey! This has been a fun exercise whether or not it is put into production.

On Sat, Nov 21, 2020 at 11:48 AM Raul Miller <[email protected]> wrote:

> I think we need to be a little clearer about what you're trying to
> achieve when you say "catch the utf-8 characters".
>
> Your change to mj made all utf-8 *octets* be treated as alphabetic.
> That's... an approach, certainly.
>
> Meanwhile, according to https://en.wikipedia.org/wiki/UTF-8#Encoding
> there are three different utf-8 starts. I don't know what you are
> referring to when you say that there are seven possible U starts.
>
> But there is a way for ;: to report errors in the stream. See attached
> for a demonstration (I made quoted strings be an error.)
>
> Thanks,
>
> --
> Raul
>
> On Sat, Nov 21, 2020 at 1:05 PM Don Guinn <[email protected]> wrote:
> >
> > I was not precise in my earlier response. I should have said that
> detecting
> > the wrong number of UTF-8 continuation bytes would be difficult in the
> > sequential machine as you would probably need to detect seven possible U
> > starts and seven U continues to properly check. That would make a very
> > large JS, although only checking for up to three continuation bytes would
> > probably be sufficient.
> >
> > And I should have said that your SJ and MJ did not catch the UTF-8
> > characters in my test. Each byte was still treated as "other". Your
> > approach is better as it allows the possibility of treating UTF-8 as
> > "other", but would contain more than one byte - the entire UTF-8
> sequence.
> > I haven't looked at your SJ yet to try to find out why it doesn't catch
> the
> > UTF-8.
> >
> > But what does one do if there is an error in the data? ;: returns errors
> if
> > SJ and MJ are not constructed properly, but there is no way to report an
> > error for bad data. And if there were, what would a programmer do about
> it?
> > So is it necessary to detect bad UTF-8 sequences? Probably not. And for
> now
> > at least treating all UTF-8 like alp/num would probably be what one would
> > want. Let the display of the data show the errors.
> >
> > On Sat, Nov 21, 2020 at 9:12 AM Raul Miller <[email protected]>
> wrote:
> >
> > > (That said, it's also worth noting that the state table I presented
> > > here doesn't give an error for unbalanced quotes, either. So if you
> > > want errors to be thrown, you should probably be updating the state
> > > table to force an error there, also.)
> > >
> > > (And, I should note that I haven't taken a look at what the resulting
> > > errors look like. So I don't know how informative the resulting error
> > > messages would be...)
> > >
> > > Thanks again,
> > >
> > > --
> > > Raul
> > >
> > > On Sat, Nov 21, 2020 at 11:04 AM Raul Miller <[email protected]>
> > > wrote:
> > > >
> > > > Yes, I was surprised (and pleased) that the text file came through
> ok.
> > > >
> > > > As for throwing errors on malformed utf-8 sequences, here's how that
> > > > could be implemented:
> > > >
> > > > (1) We introduce an "error row" which uses operation 2 for all
> > > > character classes (and, for consistency, identifies the next row as
> > > > itself -- this is mostly so that improper use of the error row would
> > > > be relatively obvious). (Operation 2 emits a token, but the important
> > > > thing here is that it throws an error if j=-1.)
> > > >
> > > > (2) For each row which is part of a partially complete utf-8
> > > > character, any appearance of any character class which is not a utf-8
> > > > suffix character would use operation 3 and would identify the next
> row
> > > > as the error row. (Operation 3 emits a token, but the important thing
> > > > here is that it sets j=-1.)
> > > >
> > > > (3) Each of the different utf-8 prefixes would lead to a different
> row
> > > > in the state table. For example, the character class containing
> > > > character 224 would get a beginning of token row (which gets the
> error
> > > > row treatment and) which leads to a row that expects a utf-8 suffix
> > > > (which gets the error row treatment) followed by a second utf-8
> suffix
> > > > (which follows the pattern set by row 1 of the state table).
> > > >
> > > > Hopefully that makes sense.
> > > >
> > > > Note that the final token in a string could not trigger an error.
> > > > That's a limitation of the engine and corresponds approximately to
> how
> > > > utf-8 must be treated when a low level buffer boundary splits a utf-8
> > > > character.
> > > >
> > > > Anyways, the point is that the sequential machine can support the
> sort
> > > > of "count a small number of steps" which is needed here. The
> > > > difficulty is more that the machine stops when it reaches the end of
> > > > the string.  If that's a sensitivity, this could be handled by
> > > > appending a linefeed character to the end of the string before
> > > > processing and then removing a final linefeed character from the last
> > > > token after tokenization.
> > > >
> > > > Again, I hope this makes sense...
> > > >
> > > > Thanks,
> > > >
> > > > --
> > > > Raul
> > > >
> > > >
> > > >
> > > >
> > > > On Sat, Nov 21, 2020 at 9:22 AM Don Guinn <[email protected]>
> wrote:
> > > > >
> > > > > The text file worked great!
> > > > >
> > > > > As to the UTF-8 codes. What is important is to avoid splitting the
> > > start
> > > > > bytes from the continuation bytes. Validating the UTF-8 codes is a
> > > > > difficult task. The start byte includes the number of continuation
> > > bytes to
> > > > > follow. It would be an error if the number of continuation bytes
> didn't
> > > > > agree.
> > > > >
> > > > > I tried my test on your definitions and it failed. Attached is a
> text
> > > file
> > > > > with your definitions and my test following.
> > > > >
> > > > > Well, I can't view the attachment. I don't know if it's there or
> not.
> > > Just
> > > > > in case, here is my test.
> > > > >
> > > > >
> > > > > NB. A noun to show the handling of UTF-8 in ;:
> > > > > test=:{{)n
> > > > > The symbol for the Euro is ₠
> > > > > Other symools like π show up also
> > > > > How about ⌹ in APL NB. ⌹
> > > > > Common expressions like 'H₂O' for water
> > > > > Common expressions like H₂O for water
> > > > > }}
> > > > >
> > > > > NB. How ;: this sj and mj handles it
> > > > > ,.<;.2(0;sj;mj);:test
> > > > >
> > > > > NB. Assigning UTF8 as character
> > > > > mj=: 2 (128+i.128)}mj
> > > > >
> > > > > NB. How UTF-8 is now handled
> > > > > ,.<;.2(0;sj;mj);:test
> > > > >
> > > > > On Fri, Nov 20, 2020 at 2:46 PM Raul Miller <[email protected]
> >
> > > wrote:
> > > > >
> > > > > > Here's an updated version, which also retains utf-8 character
> > > > > > sequences within token boundaries (instead of splitting them up
> into
> > > > > > multiple tokens). I had originally posted this to the jbeta
> forum,
> > > but
> > > > > > it's really a programming topic, and probably belongs here.
> > > > > >
> > > > > > mj=: 256$0                     NB. X other
> > > > > > mj=: 1 (9,a.i.' ')}mj          NB. S space and tab
> > > > > > mj=: 2 (,(a.i.'Aa')+/i.26)}mj  NB. A A-Z a-z excluding N B
> > > > > > mj=: 3 (a.i.'N')}mj            NB. N the letter N
> > > > > > mj=: 4 (a.i.'B')}mj            NB. B the letter B
> > > > > > mj=: 5 (a.i.'0123456789_')}mj  NB. 9 digits and _
> > > > > > mj=: 6 (a.i.'.')}mj            NB. . the decimal point
> > > > > > mj=: 7 (a.i.':')}mj            NB. : the colon
> > > > > > mj=: 8 (a.i.'''')}mj           NB. Q quote
> > > > > > mj=: 9 (a.i.'{')}mj            NB. { the left curly brace
> > > > > > mj=:10 (10)} mj                NB. LF
> > > > > > mj=:11 (a.i.'}')}mj            NB. } the right curly brace
> > > > > > mj=:12 (192+i.64)}mj           NB. U utf-8 octet prefix
> > > > > > mj=:13 (128+i.64)}mj           NB. V utf-8 octet suffix
> > > > > >
> > > > > > sj=: 0 10#:10*}.".;._2(0 :0)
> > > > > > ' X   S   A   N   B   9   .   :   Q    {    LF   }   U    V']0
> > > > > >  1.1 0.0 2.1 3.1 2.1 6.1 1.1 1.1 7.1 11.1 10.1 12.1 15.1 16.1
> NB. 0
> > > space
> > > > > >  1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 11.2 10.2 12.2 15.2 16.2
> NB. 1
> > > other
> > > > > >  1.2 0.3 2.0 2.0 2.0 2.0 1.0 1.0 7.2 11.2 10.2 12.2 15.2 16.2
> NB. 2
> > > alp/num
> > > > > >  1.2 0.3 2.0 2.0 4.0 2.0 1.0 1.0 7.2 11.2 10.2 12.2 15.2 16.2
> NB. 3 N
> > > > > >  1.2 0.3 2.0 2.0 2.0 2.0 5.0 1.0 7.2 11.2 10.2 12.2 15.2 16.2
> NB. 4
> > > NB
> > > > > >  9.0 9.0 9.0 9.0 9.0 9.0 1.0 1.0 9.0  9.0 10.2  9.0  9.0  9.0
> NB. 5
> > > NB.
> > > > > >  1.4 0.5 6.0 6.0 6.0 6.0 6.0 1.0 7.4 11.4 10.2 12.4 15.2 16.2
> NB. 6
> > > num
> > > > > >  7.0 7.0 7.0 7.0 7.0 7.0 7.0 7.0 8.0  7.0  7.0  7.0  7.0  7.0
> NB. 7 '
> > > > > >  1.2 0.3 2.2 3.2 2.2 6.2 1.2 1.2 7.0 11.2 10.2 12.2 15.2 16.2
> NB. 8
> > > ''
> > > > > >  9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0  9.0 10.2  9.0  9.0  9.0
> NB. 9
> > > comment
> > > > > >  1.2 0.2 2.2 3.2 2.2 6.2 1.2 1.2 7.2 11.2 10.2 12.2 15.2 16.2
> NB. 10
> > > LF
> > > > > >  1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 13.0 10.2  1.2 15.2 16.2
> NB. 11
> > > {
> > > > > >  1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2  1.2 10.2 14.0 15.2 16.2
> NB. 12
> > > }
> > > > > >  1.2 0.3 2.2 3.2 2.2 6.2 1.7 1.7 7.2  1.2 10.2  1.2 15.2 16.2
> NB. 13
> > > {{
> > > > > >  1.2 0.3 2.2 3.2 2.2 6.2 1.7 1.7 7.2  1.2 10.2  1.2 15.2 16.2
> NB. 14
> > > }}
> > > > > >  1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 11.2 10.2 12.2 15.0 16.0
> NB. 15
> > > > > > partial
> > > > > >  1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 11.2 10.2 12.2 15.2 16.0
> NB. 16
> > > utf-8
> > > > > > )
> > > > > >
> > > > > > As I noted in the beta forum -- increasing the complexity of the
> > > state
> > > > > > table adds both rows and columns (size grows proportional to or
> > > faster
> > > > > > than the square of the number of distinct character types when
> each
> > > > > > character type requires a distinct handling). So it's good to
> keep
> > > > > > this thing as simple as possible.
> > > > > >
> > > > > > Also, I've not tested this extensively, and it's possible I'll
> need
> > > to
> > > > > > make further changes (let me know if you spot any problems).
> > > > > >
> > > > > > That said... note also that I have *not* implemented the unicode
> > > > > > guideline which might suggest that the tokenizer should throw an
> > > error
> > > > > > on malformed utf-8 sequences. That would require several more
> rows
> > > and
> > > > > > columns to achieve the recommended inconvenience. This would also
> > > > > > introduce email line wrap, because the state table would become
> that
> > > > > > fat. (I'll attach a copy here as a .txt file, to see if an
> earlier
> > > > > > suggestion -- that the forum would preserve .txt attachments --
> might
> > > > > > be a way of working around that issue. I suspect not, but it's
> easy
> > > > > > enough to test...)
> > > > > >
> > > > > > This has *not* been implemented in the current jbeta as the ;:
> monad.
> > > > > > I am not sure if it should, since J the language is based on
> ascii,
> > > > > > not unicode -- it's just convenient that unicode supports an
> ascii
> > > > > > subset.
> > > > > >
> > > > > > Still... we often do have reason to work with utf-8.
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > --
> > > > > > Raul
> > > > > >
> > > > > > On Sun, Nov 8, 2020 at 11:01 AM Raul Miller <
> [email protected]>
> > > wrote:
> > > > > > >
> > > > > > > Oh, oops, I should have spotted that. Thanks.
> > > > > > >
> > > > > > > Updated state table:
> > > > > > >
> > > > > > > mj=: 256$0                     NB. X other
> > > > > > > mj=: 1 (9,a.i.' ')}mj          NB. S space and tab
> > > > > > > mj=: 2 (,(a.i.'Aa')+/i.26)}mj  NB. A A-Z a-z excluding N B
> > > > > > > mj=: 3 (a.i.'N')}mj            NB. N the letter N
> > > > > > > mj=: 4 (a.i.'B')}mj            NB. B the letter B
> > > > > > > mj=: 5 (a.i.'0123456789_')}mj  NB. 9 digits and _
> > > > > > > mj=: 6 (a.i.'.')}mj            NB. . the decimal point
> > > > > > > mj=: 7 (a.i.':')}mj            NB. : the colon
> > > > > > > mj=: 8 (a.i.'''')}mj           NB. Q quote
> > > > > > > mj=: 9 (a.i.'{')}mj            NB. { the left curly brace
> > > > > > > mj=:10 (10)} mj                NB. LF
> > > > > > > mj=:11 (a.i.'}')}mj            NB. } the right curly brace
> > > > > > >
> > > > > > > sj=: 0 10#:10*}.".;._2(0 :0)
> > > > > > > ' X   S   A   N   B   9   .   :   Q    {    LF   }']0
> > > > > > >  1.1 0.0 2.1 3.1 2.1 6.1 1.1 1.1 7.1 11.1 10.1 12.1 NB. 0 space
> > > > > > >  1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 11.2 10.2 12.2 NB. 1 other
> > > > > > >  1.2 0.3 2.0 2.0 2.0 2.0 1.0 1.0 7.2 11.2 10.2 12.2 NB. 2
> alp/num
> > > > > > >  1.2 0.3 2.0 2.0 4.0 2.0 1.0 1.0 7.2 11.2 10.2 12.2 NB. 3 N
> > > > > > >  1.2 0.3 2.0 2.0 2.0 2.0 5.0 1.0 7.2 11.2 10.2 12.2 NB. 4 NB
> > > > > > >  9.0 9.0 9.0 9.0 9.0 9.0 1.0 1.0 9.0  9.0 10.2  9.0 NB. 5 NB.
> > > > > > >  1.4 0.5 6.0 6.0 6.0 6.0 6.0 1.0 7.4 11.4 10.2 12.4 NB. 6 num
> > > > > > >  7.0 7.0 7.0 7.0 7.0 7.0 7.0 7.0 8.0  7.0  7.0  7.0 NB. 7 '
> > > > > > >  1.2 0.3 2.2 3.2 2.2 6.2 1.2 1.2 7.0 11.2 10.2 12.2 NB. 8 ''
> > > > > > >  9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0  9.0 10.2  9.0 NB. 9
> comment
> > > > > > >  1.2 0.2 2.2 3.2 2.2 6.2 1.2 1.2 7.2 11.2 10.2 12.2 NB. 10 LF
> > > > > > >  1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 13.0 10.2  1.2 NB. 11 {
> > > > > > >  1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2  1.2 10.2 14.0 NB. 12 }
> > > > > > >  1.2 0.3 2.2 3.2 2.2 6.2 1.7 1.7 7.2  1.2 10.2  1.2 NB. 13 {{
> > > > > > >  1.2 0.3 2.2 3.2 2.2 6.2 1.7 1.7 7.2  1.2 10.2  1.2 NB. 14 }}
> > > > > > > )
> > > > > > >
> > > > > > > (Note that I haven't coerced this state table to integer form
> --
> > > > > > > floats and integers occupy the same space on 64 bit systems,
> and
> > > the
> > > > > > > model doesn't really care about representation.)
> > > > > > >
> > > > > > > Thanks,
> > > > > > >
> > > > > > > --
> > > > > > > Raul
> > > > > > >
> > > > > > > On Sun, Nov 8, 2020 at 10:53 AM Henry Rich <
> [email protected]>
> > > wrote:
> > > > > > > >
> > > > > > > > I was talking about
> > > > > > > >
> > > > > > > >     ;: LF,'.'
> > > > > > > > +-+-+
> > > > > > > > | |.|
> > > > > > > > +-+-+
> > > > > > > >
> > > > > > > > Henry Rich
> > > > > > > >
> > > > > > > > On 11/8/2020 8:38 AM, Raul Miller wrote:
> > > > > > > > > I tested for that case:
> > > > > > > > >
> > > > > > > > >     #;:'NB.',LF,LF
> > > > > > > > > 3
> > > > > > > > >    #(0;sj;mj) sq 'NB.',LF,LF
> > > > > > > > > 3
> > > > > > > > >     #(0;sj;mj) sq 'NB.',LF,LF,LF
> > > > > > > > > 4
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > This email has been checked for viruses by AVG.
> > > > > > > > https://www.avg.com
> > > > > > > >
> > > > > > > >
> > > ----------------------------------------------------------------------
> > > > > > > > For information about J forums see
> > > http://www.jsoftware.com/forums.htm
> > > > > >
> > > ----------------------------------------------------------------------
> > > > > > For information about J forums see
> > > http://www.jsoftware.com/forums.htm
> > > > > >
> > > > >
> ----------------------------------------------------------------------
> > > > > For information about J forums see
> http://www.jsoftware.com/forums.htm
> > > ----------------------------------------------------------------------
> > > For information about J forums see http://www.jsoftware.com/forums.htm
> > >
> > ----------------------------------------------------------------------
> > For information about J forums see http://www.jsoftware.com/forums.htm
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
>
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to