On Fri, Nov 24, 2000 at 01:01:29AM -0500, Sam Tregar wrote:
> On Wed, 22 Nov 2000, Dan Sugalski wrote:
> 
> > Probably the easiest thing is to implement some sort of file-tied scalar or
> > something that can provide bytes to the regex engine until it stops asking
> > for them. Some magic or other, though, will get us what we need.
> 
> That might be the easiest thing for us - as internals programmers - but
> does it answer the general need?  Everyone writing regex-based parsers
> faces this problem.  Maybe this is something to toss to perl6-language and
> get some RFC'd Larry-fried syntax?

I think Dan was suggesting that the (user side) regex doesn't change at all
(so that's no new syntax there)
It's just that the innards of perl gains a tied scalar that doesn't actually
read in and buffer the file immediately, but defers it as long as it can get
away with. And that the regex engine knows about these lazy scalars and
provokes the read-more when needed.

But maybe explicity being able to go

$file_handle =~ /(ba*)/;

and it working DWIM could be somewhat useful.
(except that if the match fails you don't have the data buffered
anywhere obvious, unless there's collusion between PerlIO and rexexp
engine)

> Also, a nagging question - how does a regex-based parser work without
> ending up reading the entire file into memory most of the time?  Even with
> an intelligent tied-scalar reading bytes there's going to be failing cases
> where the regex has to walk to the end of the "string" to find out it
> failed.  Presumably it would also need to seek back to the start which
> means we'd have to buffer as we go.

I don't think that this differs from the current parser. If it encounters
open " but never a close ", it will read and buffer to the end of file
before realising that there's a problem. (because strictly there isn't
a problem until EOF is encountered before the closing ")

I'm not certain there's anything that can actually be done to avert the need
to buffer a lot of script in these situations. You mustn't attempt to seek
the script file handle as it might be from something unseekable such as a
pipe (or socket. BEGIN {socket STDIN...})

Nicholas Clark

Reply via email to