Re: [Unicon-group] SNOBOL operators - a few questions

Steve Wampler Sun, 08 Aug 2010 15:35:14 -0700

David Gamey wrote:
> Beyond the acknowledgment that part of the code is a kludge and the 
> desire to better integrate with the rest of the language.  Has the 
> problem/challenge been defined in a bit more detail?


David,

Here's something I wrote to Clint back in 2006 that outlines my thinking
on integrating some of the PM ideas into Unicon.  There are some references
to Sudarshan's thesis that may require some review to make sense, and a
few typos [when have I *ever* not had a few typo's?]. It's excerpted from a
long message but I don't think anything's lost by doing so:

====================================================================
One of the complaints against PM in S4 (though some people like this)
is that PM is really a separate language grafted onto S4.
This has the same feel to me.  (In S4, this is mitigated some
by the fact that both languages are simple, that certainly
isn't the case in Unicon!).  PM as described does not integrate
well into Unicon and I find it confusing to have to constantly
shift gears between the Unicon language and the PM language.
I found, in thinking about patterns, that I'd want to write
something like:

     PArbno(x) && write(PAny(y))

but I imagine that's not possible!  This is a step backward.

Also, too many of the operators are too close in
meaning to existing Unicon operators.  The conditional
assignment x -> y is nearly equivalent to reversible
assignment y <- x.  I think I understand why one can't just
use y <- x, but I believe that actually illustrates part
of my concern - something isn't right if you can't.  If fact,
if you could, then there would be no need for *both* immediate
assignment and conditional assignment, := and <- would suffice.
Virtually all of the operations in PM align similarly with
existing operations.

The examples shown in section 4 are also slightly misleading -
the 'pure' Unicon versions do not take advantage of existing
features enough and so present Unicon in a worse light than
necessary.  I'd rather first see a better use of existing
facilities than introducing an entirely new mechanism.

The phone number parser can be written, for example as:

----------------------------------------------
procedure main(args)
     in := open("phoneIn.txt")
     out := open("phoneOut.html","w")
     write(out, "<html>")
     write(out, "<body>")
     while line := read(in) do {
         write(out, "<div>",line,"</div>")
         line ? {
             if (areacode := digits(3)) &
                (trunk    := digits(3)) &
                (rest     := digits(4)) &
                (ext      := arbDigits()) &
                pos(0) then {
                 write(out,"<div style=\"color:red\">","AreaCode = ",
                           areacode,"</div>")
                 write(out,"<div style=\"color:red\">","Trunk = ",
                           trunk,"</div>")
                 write(out,"<div style=\"color:red\">","Rest = ",
                           rest,"</div>")
                 }
             }
         }
     write(out, "</body>")
     write(out, "</html>")
end

procedure digits(N)
     return 2(p1 := &pos := upto(&digits)\1, tab(many(&digits))\1, pos(p1+N))
end

procedure arbDigits()
     return (tab(upto(&digits))\1, tab(many(&digits))) | ""
end
--------------------------------------------------------------

which also happens to be 'more correct' than any of the examples in the
paper, all of which think that:

111111-22222222222223333333333333333333333333333

represents a valid phone number: (111) 111-2222

The program to detect words with double letters can be written *much* more
succinctly [if that's important] as:
--------------------------------------------------------------
procedure main()
     in := open("mtent12.txt", "r") | stop("open failed")
     out := open("mtentpatternOut.txt", "w")
     every line := !in do {
         line ? while word := (tab(upto(&letters)), tab(many(&letters))) do
                    word ? if |(move(1)\1) == move(1) then write(out, word)
         }
end
--------------------------------------------------------------

and the (A^n)(B^n)(C^n) can done as (the ABC(s) function is overkill for the
task and quite inefficient):

---------------------------------------------------------------
procedure main(args)
     every line := !&input do {
         if line ? (ABC() & pos(0))   then write("accepted")
                                      else write("rejected")
         }
end

procedure ABC()
     return (*tab(many('a')) = *tab(many('b')) = *tab(many('c')))
end
---------------------------------------------------------------

Note that this is shorter and clearer than the S4 equivalent!  It would
be even better if Unicon had span(c) as a synonym for tab(many(c)):

procedure ABC()
     return *span('a') = *span('b') = *span('c')
end

[As an aside, I'm also opposed to make &input, &output, and &errout
variables.  I think one would be better served by having initialization
of global parameters and allowing (e.g.):

     global out:&output

Redefining &input etc dynamically is a problem for large programs
consisting of man packages - for one thing, there's no way to go back.
This idea is (to me) the equivalent of the way FORTRAN2 let you
redefine integer constants!  &input, &output, and &errout should
remain constants also.]

Now, having said that I'm not happy with PM implemented as an add on,
I do see some *very* useful ideas here.  One thing that is apparent is
that PM is more efficient than string scanning (even the 'improved'
examples shown above aren't likely to be as fast as PM).  So, in my
mind, the question becomes more 'what can be done to improve string
scanning?'.  I see two approaches:

(1) It's clear that PM has a richer set of operations than SS.
     That fact that only tab() and move() advance the scan position
     is elegant, but inefficient because it (a) increases the number
     of function calls and (b) always constructs a string, even
     if that string isn't needed.  The procedure digits(N), for
     example can be written more 'cleanly' as:

     procedure digits(N)
         return 2(tab(p1 := upto(&digits)\1), tab(many(&digits))\1, pos(p1+N))
     end

     I'd like to see improved SS operations, such as:

        pos()   [as a synonym for &pos, instead of a runtime error]
        span(c) [tab(many(c))]
        skipto(n)   [like tab, but returns "" instead of the matched string]
        skip(c)     [skipto(many(c))]
        skip(s)     [skipto(match(s))]
        substring(p1,p2)  [&subject[p1:p2]]
        etc...

     then the above could be:

         return 3(skip(~&digits),p1 := pos(), span(&digits), pos(p1+N))

     (the calculation of ~&digits could be moved out of the expression if speed
      were important, of course) or

         return 3(skipto(&digits),p1 := pos((), span(&digits), pos(p1+N))

     In fact, virtually all of the proposed pattern matching functions would
     be *very* good candidates for new string scanning operations.  (The only 
ones
     that wouldn't are those that simple duplicate existing operations.)

(2) It's also clear that precompiling the pattern is a significant win (I'd like
     to see timing tests that included building the pattern *inside* the match 
as well
     as the prebuilt pattern.  [At least, I'm assuming that the pattern is 
prebuilt, and
     that doing such nasties as:

        double := PArbno(&letters) && PAny(&letters) $$ x && `x` && 
(PSpan(&letters) .| "")
        PAny := write
        line ?? double

     wouldn't cause problems, but that:

        PAny := write
        double := PArbno(&letters) && PAny(&letters) $$ x && `x` && 
(PSpan(&letters) .| "")
        line ?? double

     would.])

     Would it be possible to 'prebuild' scanning expressions in a similar way?  
Note that this
     is *much harder* than building patterns because of the (nice) integration 
of scanning with
     the rest of Unicon.  But I believe it's the right way to go.  For example, 
consider the
     following (valid!) Unicon program [I had been thinking about this as a 
possible Generator
     article, it's not just off the top of my head...]:

     ------------------------
     procedure main(args)
         aC := create (*tab(many('a')) = *tab(many('b')) = *tab(many('c')))
         every line := !&input do {
             if line ? (@^aC & pos(0)) then write("accepted")
                                       else write("rejected")
             }
     end
     ------------------------

     This is beautifully succinct, integrates perfectly with existing Unicon 
(since it *is* existing
     Unicon), and cleanly separates the scanning expression from its use, just 
as prebuilding patterns
     does.  Its drawbacks are:

     a. creating a coexpression is overkill for this task (it's slightly slower 
than the
        original, recursive solution on short input strings!)
     b. the need to constantly 'refresh' the coexpression is an added expense, 
since refreshing
        a coexpression is also an expensive operation
     c. combining patterns into bigger ones isn't as clean as it could be

     So, what about considering something like:

         aC := pattern (*span('a') = *span('b') = *span('c'))

     (see those new scanning operations are already helping!).
     Here pattern 'captures' an expression ala create, but in a much more 
lightweight
     fashion so that this pattern could be applied within string scanning with 
(say) @aC.
     This would decouple the efficiency issue from the syntax.  An initial 
version could be
     implemented quickly by layering on top of the coexpression mechanism while 
work proceeds
     on how to improve the internal representation to make it more efficient.  
(Since you
     know you're working on a 'pattern' (a string scanning pattern, *not* an S4 
one) the
     compiler could perform some sort of transformation internally to improve 
performance.
===========================================================
-- 
Steve Wampler -- [email protected]
The gods that smiled on your birth are now laughing out loud.

------------------------------------------------------------------------------
This SF.net email is sponsored by 

Make an app they can't live without
Enter the BlackBerry Developer Challenge
http://p.sf.net/sfu/RIM-dev2dev 
_______________________________________________
Unicon-group mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/unicon-group

Re: [Unicon-group] SNOBOL operators - a few questions

Reply via email to