demerphq <[EMAIL PROTECTED]> wrote:
[...]
:And pointed out that my fix was still wrong and then helped test the
:attached patch to ensure it really was right this time. Sorry about
:that.

Some meta considerations:
- my patch program doesn't think like Windows; it helped to s/\r//
  and to 's:\\:/:g if /^---/'
- it would be helpful to use the same standard for indentation as perl:
  indent is 4 spaces, tabs is 8, use tabs as much as possible
- the formatting is also suspect, which makes the code harder to read:
  you are inconsistent, for example, about "a=b" v. "a = b". In general,
  try to space operators consistently, when in doubt add a space.

Compiling gave some warnings:
regcomp.c: In function `S_make_trie':
regcomp.c:735: warning: unused variable `next_off'
regcomp.c:785: warning: `scan' might be used uninitialized in this function
regcomp.c:910: warning: `charid' might be used uninitialized in this function
regcomp.c:915: warning: `scan' might be used uninitialized in this function
regexec.c: In function `S_regmatch':
regexec.c:2490: warning: comparison of distinct pointer types lacks a cast
regexec.c:2571: warning: comparison of distinct pointer types lacks a cast
regexec.c:2297: warning: `accept' might be used uninitialized in this function
regexec.c:2562: warning: `uscan' might be used uninitialized in this function

The 'unused variable' warning is correct; so are the two 'distinct pointer
types' warnings (U8* v char*); the 'might be used' warnings
are wrong, but it took quite an effort to determine that for all but the
'charid' case: I'd be tempted to initialise each of the pointers 'scan',
'accept' and 'uscan' to '(U8*)NULL' to clarify, and ensure at least
reasonably deterministic behaviour in the case of a bug.

(I also notice other warnings not caused by this patch:
pp_sys.c: In function `Perl_pp_sysread':
pp_sys.c:1701: warning: value computed is not used
pp_pack.c: In function `S_next_symbol':
pp_pack.c:800: warning: `allowed' might be used uninitialized in this function
pp_pack.c: In function `S_unpack_rec':
pp_pack.c:1152: warning: `last' might be used uninitialized in this function
pp_pack.c:1313: warning: `bits' might be used uninitialized in this function
pp_pack.c:1324: warning: `bits' might be used uninitialized in this function
pp_pack.c:1350: warning: `bits' might be used uninitialized in this function
pp_pack.c:1361: warning: `bits' might be used uninitialized in this function
.. which it would be nice to see cleaned up.)

I'm somewhat worried about the tests: there don't seem to be very many
(13 new tests, if the 100 numbers are seen as a single test), and some
of them are relying on non-guaranteed aspects of the regexp engine (eg:
    "words"=~/(word|word|word)(?{push @got,$1})s$/;
    ok(@got==1,"TRIE optimation is working") or warn "# @got";
). In general an optimisation should have no discernible effect on
anything except resource use, so I'd expect it to be somewhere between
difficult and impossible to test that the optimisation is occurring
unless there are low-level debugging features that allow you to check.
That's not really a problem though - the test suite should be concentrating
on behaviour rather than how that behaviour is achieved.

In particular, all the new tests are for successful matches: there should
at least be some for failing matches, and preferably some for more complex
cases (eg where two sets of EXACT branches are separated by something else:
/foo|bar|z.|baz|quux/, or where order is important: /foo|fool/, or perhaps
a recursive regexp).

It would be useful to add a comment explaining this:
-#define REPORT_CODE_OFF 24
+#define REPORT_CODE_OFF 29

In regmatch(), I don't understand why there is an extra { ... } enclosing
the 3 TRIE(|F|FL) cases: it doesn't seem to add anything except an extra
level of indentation. (Maybe it is a hangover from a failed attempt to
introduce variables at that scope.) That also makes it more obvious
that the unreachable break can be removed, since it is immediately
preceded by a mandatory sayNO.

Minor optimisations: I'd swap the cases for TRIE and TRIEFL? on the
assumption that TRIE is the more common case, so shouldn't need the
goto; I'd also duplicate the 'sayNO if !accepted' check to avoid an
extra goto in that case, which would also make the 'TrieAccept' label
a more accurate name.

I'd be tempted to s/acceptlen/acceptsize/, or comment that it is the
buffer size - my automatic assumption was that it would be the length
of either the currently accepted or the minimal acceptable text, which
left me very confused that it was being initialised to 8.

In general, it is not safe in perl to malloc memory, go call some
arbitrary code (like regmatch()), then free it: that code may bypass
the return (eg because some perl code called die()), in which case you
will leak the memory:

  perl -e 'eval { "foo" =~ /(foo|bar)(?{ die "never mind" })/ } while 1'

In sv.c:
-       switch (d->what[i]) {
+       switch (d->what[i]) { /* sfpont */
.. the cryptic comment might usefully be replaced with a pointer to the
definitions in regcomp.h, with the benefit that the comment wouldn't get
out of date quite so readily.

  * [Note, when REGALIGN is defined there are two places in regmatch()
- * that bypass this code for speed.]
+ * that bypass this code for speed.
+ * demq: Is this still true?? 2005-01-31.]

Grepping shows that REGALIGN was removed after perl-5.004.

That's as much as I have time for now, I'll try to look at study_chunk()
later.

Hugo

Reply via email to