Hmm...

First off, it looks like you're working from the buggy version of
utf-8 support, where I was not handling properly utf-8 sequences
longer than length 2.

Please use the implementation which has lines ending like this:

15.0 16.0 NB. 15 partial
15.2 16.2 NB. 16 utf-8

Line 15 is the line which we land on after recognizing the beginning
of a utf-8 character. This should always be followed by one or more
utf-8 suffix octets. So operation 0 is the right operation here.
(Operation zero means: just keep on treating further characters as
part of this same sequence o f characters that we've been working on
up to this point. (Either part of the same token, or part of a
non-token which will not be emitted -- which is which depends on how
the previous characters were handled.))

Line 16 is the line which we land on after recognizing a utf-8 suffix
octet. We can have an arbitrary number of these in the character (1, 2
or 3), and the character ends when we encounter something which is not
one of these suffix octets. So operation 2 is the right operation
here. (Operation 2 means: finish this token and start a new token.)

Now... if you want unicode characters in this mechanism to be treated
as alphanumeric (which probably wouldn't be the right thing for a
variety of unicode characters, but that's getting into the need to
have a database classifying the purpose of each unicode character),
what you could do is make the entries which transition from
alphanumeric characters accept unicode sequences as a part of the
current token, and similarly you would not end the token when
recognizing the end of the unicode character.  Basically, you'd be
switching a bunch of operations from 2 (or 3, in the case of numeric
characters) to zeros.

Thanks,

-- 
Raul



On Sat, Nov 21, 2020 at 1:43 PM Don Guinn <[email protected]> wrote:
>
> Okay. I changed your lines for UTF-8 to:
>
>  1.1 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 11.2 10.2 12.2 15.0 16.0 NB. 15 partial
>  1.0 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 11.2 10.2 12.2 15.2 16.0 NB. 16 utf-8
>
> Notice that this treats each UTF-8 sequence as "other" instead of alph/num
> but the box contains the entire UTF-8 sequence. Notice that in your
> definition the subscript 2 in water shows up as a separate box as an
> "other" in my test. Where in my case the subscript 2 is considered an
> alph/num.
>
> NB. A noun to show the handling of UTF-8 in ;:
>
> test=:{{)n
>
> The symbol for the Euro is ₠
>
> Other symols like π show up also
>
> How about ⌹ in APL NB. ⌹
>
> Common expressions like 'H₂O' for water
>
> Common expressions like H₂O for water
>
> }}
>
> NB. How ;: in beta handles it
>
> ,.<;.2(0;sj;mj);:test
>
> +-------------------------------------------+
>
> |+---+------+---+---+----+--+-+-+ |
>
> ||The|symbol|for|the|Euro|is|₠| | |
>
> |+---+------+---+---+----+--+-+-+ |
>
> +-------------------------------------------+
>
> |+-----+------+----+-+----+--+----+-+ |
>
> ||Other|symols|like|π|show|up|also| | |
>
> |+-----+------+----+-+----+--+----+-+ |
>
> +-------------------------------------------+
>
> |+---+-----+-+--+---+-----+-+ |
>
> ||How|about|⌹|in|APL|NB. ⌹| | |
>
> |+---+-----+-+--+---+-----+-+ |
>
> +-------------------------------------------+
>
> |+------+-----------+----+-----+---+-----+-+|
>
> ||Common|expressions|like|'H₂O'|for|water| ||
>
> |+------+-----------+----+-----+---+-----+-+|
>
> +-------------------------------------------+
>
> |+------+-----------+----+-+-+-+---+-----+-+|
>
> ||Common|expressions|like|H|₂|O|for|water| ||
>
> |+------+-----------+----+-+-+-+---+-----+-+|
>
> +-------------------------------------------+
>
> NB. Assigning UTF8 as character
>
> mj=: 2 (128+i.128)}mj
>
> NB. How UTF-8 is now handled
>
> ,.<;.2(0;sj;mj);:test
>
> +-------------------------------------------+
>
> |+---+------+---+---+----+--+-+-+ |
>
> ||The|symbol|for|the|Euro|is|₠| | |
>
> |+---+------+---+---+----+--+-+-+ |
>
> +-------------------------------------------+
>
> |+-----+------+----+-+----+--+----+-+ |
>
> ||Other|symols|like|π|show|up|also| | |
>
> |+-----+------+----+-+----+--+----+-+ |
>
> +-------------------------------------------+
>
> |+---+-----+-+--+---+-----+-+ |
>
> ||How|about|⌹|in|APL|NB. ⌹| | |
>
> |+---+-----+-+--+---+-----+-+ |
>
> +-------------------------------------------+
>
> |+------+-----------+----+-----+---+-----+-+|
>
> ||Common|expressions|like|'H₂O'|for|water| ||
>
> |+------+-----------+----+-----+---+-----+-+|
>
> +-------------------------------------------+
>
> |+------+-----------+----+---+---+-----+-+ |
>
> ||Common|expressions|like|H₂O|for|water| | |
>
> |+------+-----------+----+---+---+-----+-+ |
>
> +-------------------------------------------+
>
>
> On Sat, Nov 21, 2020 at 11:05 AM Don Guinn <[email protected]> wrote:
>
> > I was not precise in my earlier response. I should have said that
> > detecting the wrong number of UTF-8 continuation bytes would be difficult
> > in the sequential machine as you would probably need to detect seven
> > possible U starts and seven U continues to properly check. That would make
> > a very large JS, although only checking for up to three continuation bytes
> > would probably be sufficient.
> >
> > And I should have said that your SJ and MJ did not catch the UTF-8
> > characters in my test. Each byte was still treated as "other". Your
> > approach is better as it allows the possibility of treating UTF-8 as
> > "other", but would contain more than one byte - the entire UTF-8 sequence.
> > I haven't looked at your SJ yet to try to find out why it doesn't catch the
> > UTF-8.
> >
> > But what does one do if there is an error in the data? ;: returns errors
> > if SJ and MJ are not constructed properly, but there is no way to report an
> > error for bad data. And if there were, what would a programmer do about it?
> > So is it necessary to detect bad UTF-8 sequences? Probably not. And for now
> > at least treating all UTF-8 like alp/num would probably be what one would
> > want. Let the display of the data show the errors.
> >
> > On Sat, Nov 21, 2020 at 9:12 AM Raul Miller <[email protected]> wrote:
> >
> >> (That said, it's also worth noting that the state table I presented
> >> here doesn't give an error for unbalanced quotes, either. So if you
> >> want errors to be thrown, you should probably be updating the state
> >> table to force an error there, also.)
> >>
> >> (And, I should note that I haven't taken a look at what the resulting
> >> errors look like. So I don't know how informative the resulting error
> >> messages would be...)
> >>
> >> Thanks again,
> >>
> >> --
> >> Raul
> >>
> >> On Sat, Nov 21, 2020 at 11:04 AM Raul Miller <[email protected]>
> >> wrote:
> >> >
> >> > Yes, I was surprised (and pleased) that the text file came through ok.
> >> >
> >> > As for throwing errors on malformed utf-8 sequences, here's how that
> >> > could be implemented:
> >> >
> >> > (1) We introduce an "error row" which uses operation 2 for all
> >> > character classes (and, for consistency, identifies the next row as
> >> > itself -- this is mostly so that improper use of the error row would
> >> > be relatively obvious). (Operation 2 emits a token, but the important
> >> > thing here is that it throws an error if j=-1.)
> >> >
> >> > (2) For each row which is part of a partially complete utf-8
> >> > character, any appearance of any character class which is not a utf-8
> >> > suffix character would use operation 3 and would identify the next row
> >> > as the error row. (Operation 3 emits a token, but the important thing
> >> > here is that it sets j=-1.)
> >> >
> >> > (3) Each of the different utf-8 prefixes would lead to a different row
> >> > in the state table. For example, the character class containing
> >> > character 224 would get a beginning of token row (which gets the error
> >> > row treatment and) which leads to a row that expects a utf-8 suffix
> >> > (which gets the error row treatment) followed by a second utf-8 suffix
> >> > (which follows the pattern set by row 1 of the state table).
> >> >
> >> > Hopefully that makes sense.
> >> >
> >> > Note that the final token in a string could not trigger an error.
> >> > That's a limitation of the engine and corresponds approximately to how
> >> > utf-8 must be treated when a low level buffer boundary splits a utf-8
> >> > character.
> >> >
> >> > Anyways, the point is that the sequential machine can support the sort
> >> > of "count a small number of steps" which is needed here. The
> >> > difficulty is more that the machine stops when it reaches the end of
> >> > the string.  If that's a sensitivity, this could be handled by
> >> > appending a linefeed character to the end of the string before
> >> > processing and then removing a final linefeed character from the last
> >> > token after tokenization.
> >> >
> >> > Again, I hope this makes sense...
> >> >
> >> > Thanks,
> >> >
> >> > --
> >> > Raul
> >> >
> >> >
> >> >
> >> >
> >> > On Sat, Nov 21, 2020 at 9:22 AM Don Guinn <[email protected]> wrote:
> >> > >
> >> > > The text file worked great!
> >> > >
> >> > > As to the UTF-8 codes. What is important is to avoid splitting the
> >> start
> >> > > bytes from the continuation bytes. Validating the UTF-8 codes is a
> >> > > difficult task. The start byte includes the number of continuation
> >> bytes to
> >> > > follow. It would be an error if the number of continuation bytes
> >> didn't
> >> > > agree.
> >> > >
> >> > > I tried my test on your definitions and it failed. Attached is a text
> >> file
> >> > > with your definitions and my test following.
> >> > >
> >> > > Well, I can't view the attachment. I don't know if it's there or not.
> >> Just
> >> > > in case, here is my test.
> >> > >
> >> > >
> >> > > NB. A noun to show the handling of UTF-8 in ;:
> >> > > test=:{{)n
> >> > > The symbol for the Euro is ₠
> >> > > Other symools like π show up also
> >> > > How about ⌹ in APL NB. ⌹
> >> > > Common expressions like 'H₂O' for water
> >> > > Common expressions like H₂O for water
> >> > > }}
> >> > >
> >> > > NB. How ;: this sj and mj handles it
> >> > > ,.<;.2(0;sj;mj);:test
> >> > >
> >> > > NB. Assigning UTF8 as character
> >> > > mj=: 2 (128+i.128)}mj
> >> > >
> >> > > NB. How UTF-8 is now handled
> >> > > ,.<;.2(0;sj;mj);:test
> >> > >
> >> > > On Fri, Nov 20, 2020 at 2:46 PM Raul Miller <[email protected]>
> >> wrote:
> >> > >
> >> > > > Here's an updated version, which also retains utf-8 character
> >> > > > sequences within token boundaries (instead of splitting them up into
> >> > > > multiple tokens). I had originally posted this to the jbeta forum,
> >> but
> >> > > > it's really a programming topic, and probably belongs here.
> >> > > >
> >> > > > mj=: 256$0                     NB. X other
> >> > > > mj=: 1 (9,a.i.' ')}mj          NB. S space and tab
> >> > > > mj=: 2 (,(a.i.'Aa')+/i.26)}mj  NB. A A-Z a-z excluding N B
> >> > > > mj=: 3 (a.i.'N')}mj            NB. N the letter N
> >> > > > mj=: 4 (a.i.'B')}mj            NB. B the letter B
> >> > > > mj=: 5 (a.i.'0123456789_')}mj  NB. 9 digits and _
> >> > > > mj=: 6 (a.i.'.')}mj            NB. . the decimal point
> >> > > > mj=: 7 (a.i.':')}mj            NB. : the colon
> >> > > > mj=: 8 (a.i.'''')}mj           NB. Q quote
> >> > > > mj=: 9 (a.i.'{')}mj            NB. { the left curly brace
> >> > > > mj=:10 (10)} mj                NB. LF
> >> > > > mj=:11 (a.i.'}')}mj            NB. } the right curly brace
> >> > > > mj=:12 (192+i.64)}mj           NB. U utf-8 octet prefix
> >> > > > mj=:13 (128+i.64)}mj           NB. V utf-8 octet suffix
> >> > > >
> >> > > > sj=: 0 10#:10*}.".;._2(0 :0)
> >> > > > ' X   S   A   N   B   9   .   :   Q    {    LF   }   U    V']0
> >> > > >  1.1 0.0 2.1 3.1 2.1 6.1 1.1 1.1 7.1 11.1 10.1 12.1 15.1 16.1 NB. 0
> >> space
> >> > > >  1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 11.2 10.2 12.2 15.2 16.2 NB. 1
> >> other
> >> > > >  1.2 0.3 2.0 2.0 2.0 2.0 1.0 1.0 7.2 11.2 10.2 12.2 15.2 16.2 NB. 2
> >> alp/num
> >> > > >  1.2 0.3 2.0 2.0 4.0 2.0 1.0 1.0 7.2 11.2 10.2 12.2 15.2 16.2 NB. 3
> >> N
> >> > > >  1.2 0.3 2.0 2.0 2.0 2.0 5.0 1.0 7.2 11.2 10.2 12.2 15.2 16.2 NB. 4
> >> NB
> >> > > >  9.0 9.0 9.0 9.0 9.0 9.0 1.0 1.0 9.0  9.0 10.2  9.0  9.0  9.0 NB. 5
> >> NB.
> >> > > >  1.4 0.5 6.0 6.0 6.0 6.0 6.0 1.0 7.4 11.4 10.2 12.4 15.2 16.2 NB. 6
> >> num
> >> > > >  7.0 7.0 7.0 7.0 7.0 7.0 7.0 7.0 8.0  7.0  7.0  7.0  7.0  7.0 NB. 7
> >> '
> >> > > >  1.2 0.3 2.2 3.2 2.2 6.2 1.2 1.2 7.0 11.2 10.2 12.2 15.2 16.2 NB. 8
> >> ''
> >> > > >  9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0  9.0 10.2  9.0  9.0  9.0 NB. 9
> >> comment
> >> > > >  1.2 0.2 2.2 3.2 2.2 6.2 1.2 1.2 7.2 11.2 10.2 12.2 15.2 16.2 NB.
> >> 10 LF
> >> > > >  1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 13.0 10.2  1.2 15.2 16.2 NB.
> >> 11 {
> >> > > >  1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2  1.2 10.2 14.0 15.2 16.2 NB.
> >> 12 }
> >> > > >  1.2 0.3 2.2 3.2 2.2 6.2 1.7 1.7 7.2  1.2 10.2  1.2 15.2 16.2 NB.
> >> 13 {{
> >> > > >  1.2 0.3 2.2 3.2 2.2 6.2 1.7 1.7 7.2  1.2 10.2  1.2 15.2 16.2 NB.
> >> 14 }}
> >> > > >  1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 11.2 10.2 12.2 15.0 16.0 NB. 15
> >> > > > partial
> >> > > >  1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 11.2 10.2 12.2 15.2 16.0 NB.
> >> 16 utf-8
> >> > > > )
> >> > > >
> >> > > > As I noted in the beta forum -- increasing the complexity of the
> >> state
> >> > > > table adds both rows and columns (size grows proportional to or
> >> faster
> >> > > > than the square of the number of distinct character types when each
> >> > > > character type requires a distinct handling). So it's good to keep
> >> > > > this thing as simple as possible.
> >> > > >
> >> > > > Also, I've not tested this extensively, and it's possible I'll need
> >> to
> >> > > > make further changes (let me know if you spot any problems).
> >> > > >
> >> > > > That said... note also that I have *not* implemented the unicode
> >> > > > guideline which might suggest that the tokenizer should throw an
> >> error
> >> > > > on malformed utf-8 sequences. That would require several more rows
> >> and
> >> > > > columns to achieve the recommended inconvenience. This would also
> >> > > > introduce email line wrap, because the state table would become that
> >> > > > fat. (I'll attach a copy here as a .txt file, to see if an earlier
> >> > > > suggestion -- that the forum would preserve .txt attachments --
> >> might
> >> > > > be a way of working around that issue. I suspect not, but it's easy
> >> > > > enough to test...)
> >> > > >
> >> > > > This has *not* been implemented in the current jbeta as the ;:
> >> monad.
> >> > > > I am not sure if it should, since J the language is based on ascii,
> >> > > > not unicode -- it's just convenient that unicode supports an ascii
> >> > > > subset.
> >> > > >
> >> > > > Still... we often do have reason to work with utf-8.
> >> > > >
> >> > > > Thanks,
> >> > > >
> >> > > > --
> >> > > > Raul
> >> > > >
> >> > > > On Sun, Nov 8, 2020 at 11:01 AM Raul Miller <[email protected]>
> >> wrote:
> >> > > > >
> >> > > > > Oh, oops, I should have spotted that. Thanks.
> >> > > > >
> >> > > > > Updated state table:
> >> > > > >
> >> > > > > mj=: 256$0                     NB. X other
> >> > > > > mj=: 1 (9,a.i.' ')}mj          NB. S space and tab
> >> > > > > mj=: 2 (,(a.i.'Aa')+/i.26)}mj  NB. A A-Z a-z excluding N B
> >> > > > > mj=: 3 (a.i.'N')}mj            NB. N the letter N
> >> > > > > mj=: 4 (a.i.'B')}mj            NB. B the letter B
> >> > > > > mj=: 5 (a.i.'0123456789_')}mj  NB. 9 digits and _
> >> > > > > mj=: 6 (a.i.'.')}mj            NB. . the decimal point
> >> > > > > mj=: 7 (a.i.':')}mj            NB. : the colon
> >> > > > > mj=: 8 (a.i.'''')}mj           NB. Q quote
> >> > > > > mj=: 9 (a.i.'{')}mj            NB. { the left curly brace
> >> > > > > mj=:10 (10)} mj                NB. LF
> >> > > > > mj=:11 (a.i.'}')}mj            NB. } the right curly brace
> >> > > > >
> >> > > > > sj=: 0 10#:10*}.".;._2(0 :0)
> >> > > > > ' X   S   A   N   B   9   .   :   Q    {    LF   }']0
> >> > > > >  1.1 0.0 2.1 3.1 2.1 6.1 1.1 1.1 7.1 11.1 10.1 12.1 NB. 0 space
> >> > > > >  1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 11.2 10.2 12.2 NB. 1 other
> >> > > > >  1.2 0.3 2.0 2.0 2.0 2.0 1.0 1.0 7.2 11.2 10.2 12.2 NB. 2 alp/num
> >> > > > >  1.2 0.3 2.0 2.0 4.0 2.0 1.0 1.0 7.2 11.2 10.2 12.2 NB. 3 N
> >> > > > >  1.2 0.3 2.0 2.0 2.0 2.0 5.0 1.0 7.2 11.2 10.2 12.2 NB. 4 NB
> >> > > > >  9.0 9.0 9.0 9.0 9.0 9.0 1.0 1.0 9.0  9.0 10.2  9.0 NB. 5 NB.
> >> > > > >  1.4 0.5 6.0 6.0 6.0 6.0 6.0 1.0 7.4 11.4 10.2 12.4 NB. 6 num
> >> > > > >  7.0 7.0 7.0 7.0 7.0 7.0 7.0 7.0 8.0  7.0  7.0  7.0 NB. 7 '
> >> > > > >  1.2 0.3 2.2 3.2 2.2 6.2 1.2 1.2 7.0 11.2 10.2 12.2 NB. 8 ''
> >> > > > >  9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0  9.0 10.2  9.0 NB. 9 comment
> >> > > > >  1.2 0.2 2.2 3.2 2.2 6.2 1.2 1.2 7.2 11.2 10.2 12.2 NB. 10 LF
> >> > > > >  1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 13.0 10.2  1.2 NB. 11 {
> >> > > > >  1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2  1.2 10.2 14.0 NB. 12 }
> >> > > > >  1.2 0.3 2.2 3.2 2.2 6.2 1.7 1.7 7.2  1.2 10.2  1.2 NB. 13 {{
> >> > > > >  1.2 0.3 2.2 3.2 2.2 6.2 1.7 1.7 7.2  1.2 10.2  1.2 NB. 14 }}
> >> > > > > )
> >> > > > >
> >> > > > > (Note that I haven't coerced this state table to integer form --
> >> > > > > floats and integers occupy the same space on 64 bit systems, and
> >> the
> >> > > > > model doesn't really care about representation.)
> >> > > > >
> >> > > > > Thanks,
> >> > > > >
> >> > > > > --
> >> > > > > Raul
> >> > > > >
> >> > > > > On Sun, Nov 8, 2020 at 10:53 AM Henry Rich <[email protected]>
> >> wrote:
> >> > > > > >
> >> > > > > > I was talking about
> >> > > > > >
> >> > > > > >     ;: LF,'.'
> >> > > > > > +-+-+
> >> > > > > > | |.|
> >> > > > > > +-+-+
> >> > > > > >
> >> > > > > > Henry Rich
> >> > > > > >
> >> > > > > > On 11/8/2020 8:38 AM, Raul Miller wrote:
> >> > > > > > > I tested for that case:
> >> > > > > > >
> >> > > > > > >     #;:'NB.',LF,LF
> >> > > > > > > 3
> >> > > > > > >    #(0;sj;mj) sq 'NB.',LF,LF
> >> > > > > > > 3
> >> > > > > > >     #(0;sj;mj) sq 'NB.',LF,LF,LF
> >> > > > > > > 4
> >> > > > > > >
> >> > > > > > > Thanks,
> >> > > > > > >
> >> > > > > >
> >> > > > > >
> >> > > > > > --
> >> > > > > > This email has been checked for viruses by AVG.
> >> > > > > > https://www.avg.com
> >> > > > > >
> >> > > > > >
> >> ----------------------------------------------------------------------
> >> > > > > > For information about J forums see
> >> http://www.jsoftware.com/forums.htm
> >> > > >
> >> ----------------------------------------------------------------------
> >> > > > For information about J forums see
> >> http://www.jsoftware.com/forums.htm
> >> > > >
> >> > > ----------------------------------------------------------------------
> >> > > For information about J forums see
> >> http://www.jsoftware.com/forums.htm
> >> ----------------------------------------------------------------------
> >> For information about J forums see http://www.jsoftware.com/forums.htm
> >>
> >
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to