RE: Applying regexen/grammars to objects (was Re: String API)

Gordon Henriksen Mon, 25 Aug 2003 08:28:37 +0000

Benjamin Goldberg wrote:

> Gordon Henriksen wrote:
> 
> > Having a lazily slurped file string simply delays disaster, and
> > opens the door for Very Big Mistakes. Such strings would have to be
> > treated very delicately, or the program would behave very
> > inefficiently or crash.
> 
> Although Dan's convinced me that STRING*s don't need to be anything
> other than concrete, wholly-in-memory, non-active buffers of data,
> (for various and sundry reasons), I'm not sure why a lazily slurped
> file string would need to be treated "delicately".
> 
> In particular, what would make the program crash?


s/crash/uses HUGE GOBS OF MEMORY and exhaust the system's swapfile/g.

> Why would you have the potential to load the entire file into 
> memory if you're careless?

Mutations would remain in memory, right? uc() such a string and watch
your swapfile fill right up. Or s///g. Or just in general change it.

And character indexing a file too big to fit in memory, when char
indexing is an O(n) problem for significant cases (UTF-8)...? Very
Bad things...

Or were you thinking that changes would be written back? In which
case... each string mod would have to rewrite the (huge, remember) file
from that point forward. Way to render an API useless.

I have no doubt that p6 will have file-tied strings which will address
many of these problems--they're just very complex and don't belong
inside STRING*.


> > And what if your admittedly huge file is larger than 2**32 bytes? (A
> > very real possibility! You said it was too big to fit in memory!)
> > Are you going to suggest that all STRING* consumers on 32-bit
> > platforms emulate 64-bit arithmetic whenever manipulating STRING*
> > lengths?
> 
> Blech.  Yeah, that *would* be annoying.  OTOH, they're already
> emulating 64-bit arithmetic whenever they deal with file offsets.  Or
> perhaps I should be saying, "bad enough that they're already ... With
> file offsets, we don't want to have to do it with string lengths,
> too."

I've got my money on option #2.


> >         grammar HTTPServer {
> >                 rule http {
> >                         (<request> <commit>)*
> >                 }
> >                 rule request {
> >                         <get_request> | <post_request> | ...
> >                 }
> >                 rule get_request {
> >                         GET <path> <version> <crlf>
> >                         <header>
> [snip]
> 
> You should have a <commit> after that CRLF there :)

Yeah, well, one could go after GET, too, and after <path>, and after
<version>, and every other non-optional protocol element. It gets noisy
after a while.


> > How cool is that? Just imagine trying to apply the same pattern to a
> > more long-lived protocol than HTTP, though-a database connection,
> > maybe, or IRC.
> 
> Through a database connection?  I can envision that for the purpose of
> implementing the protocol, [...]

I did indeed mean implementing the database protocol. Though, not
thRough.


> > [2] No doubt, unshift hacks[3] could be found to make the lazy
> > slurpy file string not crash. But these are just changes to make 
> > strings behave like streams, and would impose upon STRING*
> > consumers everywhere Very Strange things like those strings which
> > don't know their own length. A string wants to be a string, and a
> > stream wants to be a stream.
> 
> I wasn't considering allowing lazily slurped file strings on anything
> other than plain files (ones for which perl's "-f" operator returns
> true).
> 
> Thus, I can't see how the string wouldn't know it's own length.

Fine, in theory--but UTF-8 and other variable-length encodings would
need to open and scan THE ENTIRE FILE at the time it was tied in order
to know their length in characters. Ouch.


> > [3] Unshift hack #1: Where commit appears in the above, exit the
> > grammar, trim the beginning of the string, and re-enter. (But that
> > forces the grammar author to discard the regex state, whereas commit
> > would offer no such restriction.) Unshift hack #2: Tell =~ that
> > <commit> can trim the beginning of the string. (DWIM departs;
> > /cgxism returns.)
> 
> Trimming off the beginning of the string is the job of the <cut>
> operator, not the <commit> operator.

Indeed, my bad--been a while since I read the apocalypse.

> Hmm... I wonder how <cut> would be done with an iterator.  Bleh.

Equivalent to <commit>, I say.... Then your grammar rule can work on an
iterator, or on a string that's being used as a buffer.

Here's a question: How does $iter =~ /a+b/ work on an iterator which
returns "aaaaaaack!"? Requires a putback op.

I'm not sure about <cut> vs. <commit>. They seem so orthogonal, and they
pervasively tie a grammar to an implementation choice. It seems more
like an m:option.

--
 
Gordon Henriksen
IT Manager
ICLUBcentral Inc.
[EMAIL PROTECTED]

RE: Applying regexen/grammars to objects (was Re: String API)

Reply via email to