Re: (blead patch) New TRIE regex patch

demerphq Sun, 27 Feb 2005 02:18:16 -0800

On Sat, 26 Feb 2005 17:20:20 +0000, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
> demerphq <[EMAIL PROTECTED]> wrote:
> [...]
> :And pointed out that my fix was still wrong and then helped test the
> :attached patch to ensure it really was right this time. Sorry about
> :that.
> 
> Some meta considerations:
> - my patch program doesn't think like Windows; it helped to s/\r//
>  and to 's:\\:/:g if /^---/'


Ok, ill dos2unix the patches and convert the headers to a unix friendly format.

> - it would be helpful to use the same standard for indentation as perl:
>  indent is 4 spaces, tabs is 8, use tabs as much as possible

Ill do my best. My editor doesnt like mixing tabs and 4-space indents
tho. It has a mode to convert spaces to tabs so ill try that next
time.

> - the formatting is also suspect, which makes the code harder to read:
>  you are inconsistent, for example, about "a=b" v. "a = b". In general,
>  try to space operators consistently, when in doubt add a space.

Willdo. The occasional odd indentation is something i only noticed
recently and appears to be due to me using diff -w which i wont do
anymore.

> 
> Compiling gave some warnings:

Dang thats annoying, different C compilers seem to generate different
warnings. I dont see any of these. I think ill start doing cygwin
builds too just to see what gcc has to say.

> regcomp.c: In function `S_make_trie':
> regcomp.c:735: warning: unused variable `next_off'

Fixed.

> regcomp.c:785: warning: `scan' might be used uninitialized in this function
> regcomp.c:910: warning: `charid' might be used uninitialized in this function
> regcomp.c:915: warning: `scan' might be used uninitialized in this function

All non problems as you mention below, and my compiler doesnt moan
about them. But ill initialize them as you ask.

> regexec.c: In function `S_regmatch':
> regexec.c:2490: warning: comparison of distinct pointer types lacks a cast
> regexec.c:2571: warning: comparison of distinct pointer types lacks a cast

Ok, Yitzchak mentioned this too, but i dont get a warning so i wasnt
sure what to do. Ill add an explicit cast to U8 *.

> regexec.c:2297: warning: `accept' might be used uninitialized in this function
> regexec.c:2562: warning: `uscan' might be used uninitialized in this function
> 
> The 'unused variable' warning is correct; so are the two 'distinct pointer
> types' warnings (U8* v char*); the 'might be used' warnings
> are wrong, but it took quite an effort to determine that for all but the
> 'charid' case: I'd be tempted to initialise each of the pointers 'scan',
> 'accept' and 'uscan' to '(U8*)NULL' to clarify, and ensure at least
> reasonably deterministic behaviour in the case of a bug.

Ok, im sorry about these, ive mentioned above what ill do about them.
I left some of them uninitialized as this is supposed to be fast code
and i checked their usage, but if you think that initializing them is
better then no problem.

> 
> (I also notice other warnings not caused by this patch:
> pp_sys.c: In function `Perl_pp_sysread':
> pp_sys.c:1701: warning: value computed is not used
> pp_pack.c: In function `S_next_symbol':
> pp_pack.c:800: warning: `allowed' might be used uninitialized in this function
> pp_pack.c: In function `S_unpack_rec':
> pp_pack.c:1152: warning: `last' might be used uninitialized in this function
> pp_pack.c:1313: warning: `bits' might be used uninitialized in this function
> pp_pack.c:1324: warning: `bits' might be used uninitialized in this function
> pp_pack.c:1350: warning: `bits' might be used uninitialized in this function
> pp_pack.c:1361: warning: `bits' might be used uninitialized in this function
> .. which it would be nice to see cleaned up.)

I think the pack ones are works in progress by someone else right now,
But ill look at the pp_sys ones.

> I'm somewhat worried about the tests: there don't seem to be very many
> (13 new tests, if the 100 numbers are seen as a single test),

Well, i did actually start to add cases but as i reviewed the checks
already present I realized there already exists a whole host of tests
which verify this behaviour already. I added the ones that i felt
necessary to cover the gaps. But the behaviour of alternations
appeared to me to be reasonably well covered. I just did a count and
it would appear that about 125 tests in re_tests would end up
exercising the TRIE code. Thats about %13 so i think the general test
suite pretty well covers the TRIE stuff.

> and some
> of them are relying on non-guaranteed aspects of the regexp engine (eg:
>    "words"=~/(word|word|word)(?{push @got,$1})s$/;
>    ok(@got==1,"TRIE optimation is working") or warn "# @got";
> ). In general an optimisation should have no discernible effect on
> anything except resource use, so I'd expect it to be somewhere between
> difficult and impossible to test that the optimisation is occurring
> unless there are low-level debugging features that allow you to check.

Well, actually this test is very specifically oriented at determining
whether the optimisation is actually occuring. Its about the only way
I know of without parsing the output of re=debug to tell. And I
suppose this is a good point:

The only behaviour that the patch changes is exactly this. If multiple
identical words exist in an alternation the later ones are silently
ignored. For all practical purposes this has no change on matching
behaviour, but for cases where we have a code block being executed
from the regex engine we can see the difference.  Personally I dont
see that this is a problem as other optimisations can kick in and have
the same effect, for instance in some situations the perl code will
not be called at all as the charclass optimisation will prevent the
regex from ever getting executed.

For instance change that test to 

 "wordz"=~/(word|word|word)(?{push @got,$1})s$/;

and the code block will not be executed at all. 

Based on this type of evidence of existing inconsistancies in how
codeblocks are executed inside of the regex engine, and especially the
fact that such constructs are still marked as "experimental" I dont
really feel that a minor change in behaviour that this test looks for
is a big deal. And its a useful sanity check that the tests being run
are actually testing the trie patch. I added it in when after I built
with DO_TRIE 0 and ran the test suite and then realized I had just
tested the old behaviour.  So this test cant pass unless the TRIE
optimisation kicks in.

> In particular, all the new tests are for successful matches: there should
> at least be some for failing matches, and preferably some for more complex
> cases (eg where two sets of EXACT branches are separated by something else:
> /foo|bar|z.|baz|quux/, or where order is important: /foo|fool/, or perhaps
> a recursive regexp).

Ok, I can do this. Although I already am testing /foo|fool/ even it
may not look like it:

(WORDS|WORD)S   WORDS   y       $1      WORD
(X.|WORDS|X.|WORD)S     WORDS   y       $1      WORD
(WORDS|WORLD|WORD)S     WORDS   y       $1      WORD
(X.|WORDS|WORD|Y.)S     WORDS   y       $1      WORD

All check such scenarios. All require 'WORD' to match but have to
check other matches first. Also the location of the dot i the pattern
is important as sucha construct breaks the optimisation. Thus case 2
above wont get optimised, but case 4 will. But i agree I can add some
failure patterns and some additional tests to beef this up.

> 
> It would be useful to add a comment explaining this:
> -#define REPORT_CODE_OFF 24
> +#define REPORT_CODE_OFF 29

This makes fail message line up with the opcode they are from.
Currently the fail messages are outdented which makes reading the
debug output really hard. I made a few very slight changes to how the
debug output looked to make it properly indent, and to make it look
consistant. For instance the following is the output of a re=debug
with my patch:

Compiling REx "(?:ab|abc|abcd)$"
size 12 Got 100 bytes for offset annotations.

   1: TRIE(11)
      [Words:3 Chars Stored:9 Unique Chars:4 States:5]
        <ab>
        <abc>
        <abcd>
  10: TAIL(11)
  11: EOL(12)
  12: END(0)
minlen 2
Offsets: [12]
        3[1] 4[2] 0[0] 6[1] 7[3] 0[0] 10[1] 11[4] 0[0] 14[0] 16[1] 17[0]
Matching REx `(?:ab|abc|abcd)$' against `aaaabcd'
  Setting an EVAL scope, savestack=3
   0 <> <aaaabcd>         |  1:  TRIE
                                 failed...
  Setting an EVAL scope, savestack=3
   1 <a> <aaabcd>         |  1:  TRIE
                                 failed...
  Setting an EVAL scope, savestack=3
   2 <aa> <aabcd>         |  1:  TRIE
                                 failed...
  Setting an EVAL scope, savestack=3
   3 <aaa> <abcd>         |  1:  TRIE
   5 <aaaab> <cd>         | 11:    EOL
                                   failed...
   6 <aaaabc> <d>         | 11:    EOL
                                   failed...
   7 <aaaabcd> <>         | 11:    EOL
   7 <aaaabcd> <>         | 12:    END
Match successful!
Freeing REx: "(?:ab|abc|abcd)$"

and this is the 5.8.x behaviour:

Compiling REx `(?:ab|abc|abcd)$'
size 12 Got 100 bytes for offset annotations.

   1: BRANCH(4)
   2:   EXACT <ab>(11)
   4: BRANCH(7)
   5:   EXACT <abc>(11)
   7: BRANCH(10)
   8:   EXACT <abcd>(11)
  10: TAIL(11)
  11: EOL(12)
  12: END(0)
minlen 2
Offsets: [12]
        3[1] 4[2] 0[0] 6[1] 7[3] 0[0] 10[1] 11[4] 0[0] 14[0] 16[1] 17[0]
Matching REx `(?:ab|abc|abcd)$' against `aaaabcd'
  Setting an EVAL scope, savestack=3
   0 <> <aaaabcd>         |  1:  BRANCH
  Setting an EVAL scope, savestack=13
   0 <> <aaaabcd>         |  2:    EXACT <ab>
                              failed...
   0 <> <aaaabcd>         |  5:    EXACT <abc>
                              failed...
   0 <> <aaaabcd>         |  8:    EXACT <abcd>
                              failed...
  Clearing an EVAL scope, savestack=3..13
  Setting an EVAL scope, savestack=3
   1 <a> <aaabcd>         |  1:  BRANCH
  Setting an EVAL scope, savestack=13
   1 <a> <aaabcd>         |  2:    EXACT <ab>
                              failed...
   1 <a> <aaabcd>         |  5:    EXACT <abc>
                              failed...
   1 <a> <aaabcd>         |  8:    EXACT <abcd>
                              failed...
  Clearing an EVAL scope, savestack=3..13
  Setting an EVAL scope, savestack=3
   2 <aa> <aabcd>         |  1:  BRANCH
  Setting an EVAL scope, savestack=13
   2 <aa> <aabcd>         |  2:    EXACT <ab>
                              failed...
   2 <aa> <aabcd>         |  5:    EXACT <abc>
                              failed...
   2 <aa> <aabcd>         |  8:    EXACT <abcd>
                              failed...
  Clearing an EVAL scope, savestack=3..13
  Setting an EVAL scope, savestack=3
   3 <aaa> <abcd>         |  1:  BRANCH
  Setting an EVAL scope, savestack=13
   3 <aaa> <abcd>         |  2:    EXACT <ab>
   5 <aaaab> <cd>         | 11:    EOL
                              failed...
   3 <aaa> <abcd>         |  5:    EXACT <abc>
   6 <aaaabc> <d>         | 11:    EOL
                              failed...
   3 <aaa> <abcd>         |  8:    EXACT <abcd>
   7 <aaaabcd> <>         | 11:    EOL
   7 <aaaabcd> <>         | 12:    END
Match successful!
Freeing REx: `"(?:ab|abc|abcd)$"'

If you use a fixed width font youll see that the "failed" lines line
up with the opcode that failedin the first example but not the second.
Also in the patched version the Freeing REx message is formatted the
same as the Compiling REx but in the older code they are different.

> 
> In regmatch(), I don't understand why there is an extra { ... } enclosing
> the 3 TRIE(|F|FL) cases: it doesn't seem to add anything except an extra
> level of indentation. (Maybe it is a hangover from a failed attempt to
> introduce variables at that scope.) 

Yes, it was. Originally I wanted to put all of the vars needed by
those two cases inside of the block you mention, but this raised
problems with my debugger and I though somebody might shoot me for
doing it. Ill look at removing the remaining block.

>That also makes it more obvious
> that the unreachable break can be removed, since it is immediately
> preceded by a mandatory sayNO.

Ok, Ill remove the break.

> Minor optimisations: I'd swap the cases for TRIE and TRIEFL? on the
> assumption that TRIE is the more common case, so shouldn't need the
> goto; I'd also duplicate the 'sayNO if !accepted' check to avoid an
> extra goto in that case, which would also make the 'TrieAccept' label
> a more accurate name.

Ah yeah, ill change that around. Good idea. 
 
> I'd be tempted to s/acceptlen/acceptsize/, or comment that it is the
> buffer size - my automatic assumption was that it would be the length
> of either the currently accepted or the minimal acceptable text, which
> left me very confused that it was being initialised to 8.

Yeah, ok, fair call. Ill rename all of that stuff so that its much
more intuitive what its for.

> In general, it is not safe in perl to malloc memory, go call some
> arbitrary code (like regmatch()), then free it: that code may bypass
> the return (eg because some perl code called die()), in which case you
> will leak the memory:

What is the strategy to deal with these types of things then? Id
really appreciate some pointers on what to do with this issue as I
havent the foggiest.

>  perl -e 'eval { "foo" =~ /(foo|bar)(?{ die "never mind" })/ } while 1'
> 
> In sv.c:
> -       switch (d->what[i]) {
> +       switch (d->what[i]) { /* sfpont */
> .. the cryptic comment might usefully be replaced with a pointer to the
> definitions in regcomp.h, with the benefit that the comment wouldn't get
> out of date quite so readily.

Yeah, good point. Ill update the comment.

> 
>  * [Note, when REGALIGN is defined there are two places in regmatch()
> - * that bypass this code for speed.]
> + * that bypass this code for speed.
> + * demq: Is this still true?? 2005-01-31.]
> 
> Grepping shows that REGALIGN was removed after perl-5.004.

Ok, ill remove the entire comment about REGALIGN.

> That's as much as I have time for now, I'll try to look at study_chunk()
> later.

Many many thanks Hugo. I really appreciate it.

Cheers,
Yves


-- 
First they ignore you, then they laugh at you, then they fight you,
then you win.
  +Gandhi

Re: (blead patch) New TRIE regex patch

Reply via email to