Re: To get things started...
On Fri, Nov 24, 2000 at 01:01:29AM -0500, Sam Tregar wrote: On Wed, 22 Nov 2000, Dan Sugalski wrote: Probably the easiest thing is to implement some sort of file-tied scalar or something that can provide bytes to the regex engine until it stops asking for them. Some magic or other, though, will get us what we need. That might be the easiest thing for us - as internals programmers - but does it answer the general need? Everyone writing regex-based parsers faces this problem. Maybe this is something to toss to perl6-language and get some RFC'd Larry-fried syntax? I think Dan was suggesting that the (user side) regex doesn't change at all (so that's no new syntax there) It's just that the innards of perl gains a tied scalar that doesn't actually read in and buffer the file immediately, but defers it as long as it can get away with. And that the regex engine knows about these lazy scalars and provokes the read-more when needed. But maybe explicity being able to go $file_handle =~ /(ba*)/; and it working DWIM could be somewhat useful. (except that if the match fails you don't have the data buffered anywhere obvious, unless there's collusion between PerlIO and rexexp engine) Also, a nagging question - how does a regex-based parser work without ending up reading the entire file into memory most of the time? Even with an intelligent tied-scalar reading bytes there's going to be failing cases where the regex has to walk to the end of the "string" to find out it failed. Presumably it would also need to seek back to the start which means we'd have to buffer as we go. I don't think that this differs from the current parser. If it encounters open " but never a close ", it will read and buffer to the end of file before realising that there's a problem. (because strictly there isn't a problem until EOF is encountered before the closing ") I'm not certain there's anything that can actually be done to avert the need to buffer a lot of script in these situations. You mustn't attempt to seek the script file handle as it might be from something unseekable such as a pipe (or socket. BEGIN {socket STDIN...}) Nicholas Clark
Re: To get things started...
On Fri, 24 Nov 2000, Nicholas Clark wrote: I think Dan was suggesting that the (user side) regex doesn't change at all (so that's no new syntax there) It's just that the innards of perl gains a tied scalar that doesn't actually read in and buffer the file immediately, but defers it as long as it can get away with. And that the regex engine knows about these lazy scalars and provokes the read-more when needed. Right. And I was suggesting that while this might solve our problem it wouldn't do much for all the other people that have to solve the same problem. I'd like to see a general solution accessible from Perl. If that solution is some tied-scalar magic, fine. If it's more involved than that (and I think it will be) then we'll need to think about the syntax a bit. I don't think that this differs from the current parser. If it encounters open " but never a close ", it will read and buffer to the end of file before realising that there's a problem. (because strictly there isn't a problem until EOF is encountered before the closing ") I'm not certain there's anything that can actually be done to avert the need to buffer a lot of script in these situations. You mustn't attempt to seek the script file handle as it might be from something unseekable such as a pipe (or socket. BEGIN {socket STDIN...}) I suppose that's true. I was immagining something less extreme than the absolute failure case of missing a closing ". I'm imagining a failure that is recoverable but still requires running the regex to the end of the "string" to find that out. Are there any like this? Perhaps not. Perhaps this just isn't a reasonable criticism of regex parsers since normal parsers do it all the time anyway! -sam
Re: SvPV*
On Fri, 24 Nov 2000 08:54:43 +0100, Roland Giersig wrote: Maybe the title should be : "Perl should use XML as its basic data type instead of linear strings" Horrible. I kinda liked your original proposal. But you should NOT focus on XML. That leaves out too many other possible data sources: RTF, for example, or TeX. What is typical, is that it is marked up text, in the form of a tree, i.e. properly nested. The internal structure might as well be easily representable as XML. I do think that the term "non-linear text" is absolutely unclear. -- Bart.