On Sat, Dec 18, 2004 at 12:16:31PM +0200, Markus Laire wrote:
: Patrick R. Michaud wrote:
: >>Larry mentioned 're_tests' file from perl5-source. Is anyone working on 
: >>it currently? I could make a simple script to convert at least some of 
: >>it to this pge-testing format which uses p6rule_*
: 
: 'simple script' .. it isn't so simple anymore ;)

Sorry.  Well, okay, I'm not really sorry.  :-)

In fact, I might like to look at your 'simple script' when I get further
along on the p5-to-p6 translator...

: >I'm not aware of anyone working on it currently, so please go ahead
: >and do this!
: 
: This test seems to cause an infinite loop
: (with parrot_2004-12-16_160001)
: 
: p6rule_isnt('a--', '^[a?b?]*$', 're_tests 387 (#438)');  # infinite loop

Detecting failure to progress can be quite tricky, actually.  It's easy
enough to detect that it *might* be an infinite loop.  But that pattern
would succeed the string were all a's and b's.  It's not enough to figure
out that you're at the same position or the same state.  You have to figure
out that you're at the same position and the same state, and you may well
have visited different positions in this state, or different states in
this position.  So a naive solution requires N**2 in time or space.

Henry Spencer's original regex routines simply disallowed expressions
that might be infinite.  We tried relaxing that in Perl 5, and got
it wrong more than one way.  I'm not actually sure what approach p5
takes right now, if any.

: (Currently I skip all tests for $+ as pge-testing format doesn't support 
: this. I'm not sure if these are needed for anything, as it's trivial to 
: get endpoint from startpoint and string length.)

The whole notion of string positions as integers is now somewhat
problematic in the Unicode era.  Is a position of 5 to be interpreted
as 5 bytes, 5 codepoints, 5 graphemes, or 5 letters?  String positions
are probably opaque objects that return different integer values
in different contexts.  And there is no such thing as the "length"
of a string anymore, unless it's another opaque object representing
the position at the end of the string.  And we've outlawed "length"
as a too-general concept. You have to tell it what units you mean
(.bytes, .codes, .graphs), or maybe use .chars for the default meaning
in the current context, if we decide to allow that.

As long as we're banishing .length from strings, we're also banishing
it from arrays.  You have to use .elems for that.  (At least all
this specificity now allows us to ask for the length of an array in
codepoints or graphemes...)

Anyway, sorry about the diatribe, but this is an area where we'll be
battling our own imprecision for years to come, not to mention everyone
else's.

Larry

Reply via email to