Re: Exposing regexp engine & compiled regexp's

Branden Wed, 10 Jan 2001 12:40:04 -0800

Quoted from http://www.perl.com/pub/2000/09/ilya.html,
an interview with Dr. Ilya Zakharevich:
>
> Q: Could you describe in more detail what additional text-
> handling primitives you would like to see included with Perl?
> What string munging operations are absent that really ought to
> be included in Perl's core?
>
> A: The problem: Perl's text-handling abilities do not scale
> well. This has two faces, both invisible as far as you confine
> yourselves to simple tasks only. The first face is not that
> Perl lacks some "operations;" it is not that some "words" are
> missing, whole "word classes" are not present. Imagine
> expressive power of a language without adjectives.
>
> In Perl text-handling equals string-handling. But there is
> more in a text than the sequence of characters. You see a text
> of a program - you can see boundaries of blocks, etc.; you see
> an English text, you can see word boundaries and sentence
> boundaries, etc. With the exception of the word boundaries,
> all these "distinctive features" become very hard to recognize
> by a "local inspection of a sequence of characters near an
> offset" - unless you agree to use a heuristic which works only
> time to time. But a lot of problems require recognition of the
> relative position of a substring w.r.t. these "distinctive
> features".
>
> Remember those "abstract algorithms" books and lessons? You
> can solve the problems "straightforwardly," or you can do it "
> smartly." Typically, "straightforward" algorithms are easy to
> code, but they do not scale well. Smart algorithms start by an
> appropriate preprocessing step. You organize your data first.
> The particular ways to do this may be quite different: you
> sort the data, or keep an "index" of some kind "into your data,"
> you hash things appropriately, your balance some trees, and
> so on. The algorithms use the initial data together with such
> an "index."
>
> Perl provides a few primitives to work with strings, which are
> quite enough to code any "straightforward" algorithm. What
> about "smart" ones? You need preprocessing. Typically, digging
> out the info is easy with Perl, but how would you store what
> you dug? The information should be kept "off band," for example,
> in an array or hash of offsets into the string.
>
> Now modify the string a little bit, say, perform some s()()
> substitutions, or cut-and-paste with substr(). What happens
> with your "off band" information? It went out of sync. You
> need to update your annotating structures. Do not even think
> about doing s()()g, since you do not have enough info about
> the changes after the fact. You need to do your s()() one-by-
> one - but while s()()g is quite optimized, a series of s()()
> is not - and you get stuck again into the land of badly
> scaling algorithms.
>
> (Strictly speaking, for this particular example s()()eg could
> save you - as well as code-embedded-into-a-regular-expression,
> but this was only a simple illustration of why off-band data
> is not appropriate for many algorithms. Please be lenient with
> this example!)
>
> Even if no modification is done, using off-band data is very
> awkward: how to check what are the attributes of the character
> at offset 2001 when there are many different attributes, each
> marking a large subset of the string?
>
> That was the problem, and the solution supported by many text-
> processing systems is to have "in-band annotations", which is
> recognized by the editing primitives, and easily queryable.
> Perl allows exactly one item of in-band data for strings: pos
> (), which is respected by regular expressions. But it is not
> preserved by string-editing operations, or even by $s1 = $s2!
>
> "In-band" data comes in several "kinds". A particular "kind"
> describes:
>
> - how it behaves with respect to insertion or deletion of
> characters nearby;
> - can the "same" markup appear "several times";
> - can the markup "nest" (like nested comments in some languages
> ); and
> - is there an internal structure of the markup (as in a loop,
> which may be
>
>     [[LABEL DELIM0] KEYWORD [DELIM1 VAR1 SEP VAR2 ... DELIM2]
>      [DELIM4 EXPR DELIM4] [DELIM5 BODY DELIM6]]
>
> - with some parts possibly missing, so the internal structure
> is a tree).
>
> Different answers lead to a zoo of intuitively different kinds
> of markup, each kind useful for some categories of problems.
> You can mark "gaps between" characters, or you can mark
> characters themselves. The markup may "name" a position ("the
> first __END__ in a Perl program"), or cover a subset of the
> string ("show in red", "is a link to this URL", or "inside
> comment"). Since the kind of the markup defines what happens
> when the string is modified, the system can support self-
> consistency of the markup "automatically" (in exceptionally
> complicated cases one may need to register a callback or two).
>
> The second face of problem is not with the expressive power of
> Perl, but with the implementation. Perl has a very rigid rule:
> a string must be stored in a consecutive sequence of bytes.
> Remove a character in the middle of the string, and all the
> chars after it (or before it) should be moved. As I said, s()()g
> has some optimizations which allow doing such movements "in
> one pass", but what if your problem cannot be reduced to one
> pass of s()()g? Then each of the tiny modification you do one-
> at-a-time may require a huge relocation - or maybe even
> copying of the whole string. This is why a lot of algorithms
> for text manipulation require a "split buffer" implementation,
> when several chunks of the string may be stored (transparently!)
> at unrelated addresses.
>
> Such "split-buffer" strings may look incredibly hard to
> implement, as in "all the innards of Perl should be changed",
> but it is not. Just store "split strings" similarly to tie()d
> data. The FETCH (actually, the low-level MAGIC-read method)
> would "glue" all the chunks into one - and would remove the
> MAGIC - before the actual read is performed; and now no part
> of Perl requires any change. Now four or five primitives for
> text-handling may be changed to recognize the associated tie()d
> structures - and act without gluing chunks together. We may
> even do it in arbitrarily small steps, one opcode at a time.
>
> Another important performance improvement needed for many
> algorithms would be the copy-on-write, when several variables
> may refer to the same buffer in memory, or different parts of
> the same buffer - with suitable semantic what to do when one
> of these variables is modified. (In fact the core of this is
> already implemented in one of my patches!) Together with other
> benefits, this would solve the performance problems of $& and
> friends, as well as would make m/foo/; $& = 'bar'; equivalent
> to s/foo/bar/. Having copy-on-write substrings may be slightly
> more patch-intensive than copy-on-write strings, though. The
> complication: currently the buffers are required to be 0-
> terminated (so that they may be used with the system APIs). It
> is hard to make 'b' as in substr('abc',1,1) refer to the same
> buffer (containing "abc\0") as 'abc'. The solution may be to
> remove this requirement, and have two low-level string access
> API, SvPV() and SvPVz(), so that SvPVz() may perform the
> actual copying (as in copy-on-write) and the appending of \0 -
> but only when needed!
>
> Without these - or similar - changes Perl would not scale well
> as a language for efficient text-processing. What is more, I
> believe that the changes above can remove most of the
> significant bottlenecks for the problems we have in text-
> processing of today. At least I know a lot of problems which
> would have feasible solutions given these changes.
>
> And I need not repeat that a handful of small extensions to
> the expressive power of the regular expression engine could
> radically extend the domain of its applicability. ;-)
>



That's exactly the kind of thing we can do by exposing regexp engine's guts.
Tags in the string we can implement by using lists, just as lisp would do
it. If we have SVs that actually independ from the implementation, we can
create a ``tagged-string'', that's seen as a string by the script but
internally implemented as a list. And, if we have access to the regexp
engine's guts, we can implement matches and substitutions against those
magic tagged strings.

The other thing about copy-on-write would be piece of cake, using the same
magic SVs for implementing strings that are substrings of other strings.

Branden.
Re: Exposing regexp engine & compiled regexp's

Reply via email to