ID:               45372
 User updated by:  [EMAIL PROTECTED]
 Reported By:      [EMAIL PROTECTED]
 Status:           Assigned
 Bug Type:         Scripting Engine problem
 Operating System: linux
 PHP Version:      5.3CVS-2008-06-27 (CVS)
 Assigned To:      helly
 New Comment:

Not sure why re2c needs to deal with the #bang situation
looking at the code it would be better to eat that line outside of the
lexer..


Something like:

int ini_lex(zval *ini_lval TSRMLS_DC)
{
     if ((YYCTYPE*)yytext == SCNG(yy_start) && *yych == '#') {
         while(*yych != '\n' && *yych != '\n' && yych < yyend) {
            yych++; 
         }
         while((*yych == '\n' || *yych == '\n') && yych < yyend) {
            yych++; 
         }
         YYCURSOR = yych;
     }
.....


Previous Comments:
------------------------------------------------------------------------

[2008-06-27 11:26:18] [EMAIL PROTECTED]

Duplicated... Bug #45147

------------------------------------------------------------------------

[2008-06-27 09:31:21] [EMAIL PROTECTED]

This should work like in older releases, Marcus please check it!

------------------------------------------------------------------------

[2008-06-27 09:05:48] [EMAIL PROTECTED]

(yyless(1) could just be used before the goto...)

Anyway, did you actually try that? AFAIK it still won't work, at least
with your single line example (which there's already been at least one
report about). While the local code fix is correct, the re2c code/logic
seems flawed to me. (Maybe this bug report can be about that instead, in
general, since I didn't get around to sending a follow-up message to the
internals@ list yet, explaining things. :-))

In this example, it will still be broken because of the YYFILL() check
-- each time it checks if the next character can match, even when it's
at the end of the input. YYFILL() then makes it return, completely
ignoring anything that has matched up to that point!

I'm not sure if this explanation is 100% correct, but I believe this
wrong behavior happens when EOF is encountered while trying to match the
variable length part of ANY rule; or something close to that. :-) It's
been over a month since I tried to track and figure out what was
happening. Granted, most of the cases (unlike yours), where the match is
aborted because of YYFILL(), it's with invalid code, but it shouldn't
happen. BTW, I think the part with the inline_char_handler label where
it looks for opening PHP tags in the HTML, while a good optimization
(using memchr() to find < etc.), was actually added as a workaround for
this re2c/YYFILL() behavior. I didn't try it, but from what I've
observed, I think whatever plain HTML was at the end of a file would
have been lost if a regular rule (like in Flex) was used to match it...

Oh, there are also some more bugs in the code that looks for opening
PHP tag, but they wouldn't be found as easily as this (and haven't been
reported so far). I think I know how it can be fixed nicely, along with
some more other scanner optimizations (for inline HTML and comments,
basically). But I haven't done anything yet since some of it won't even
work with these re2c/YYFILL() issues. :-/

Finally, to simplify what I think is the basic, underlying flaw with
the code of re2c and YYFILL() now, here's a super easy example. Say you
have one rule:

[a-z]+

It will NEVER match any input that a person would think, such as the
string "foo" -- seems pretty messed up to me!?

------------------------------------------------------------------------

[2008-06-27 06:03:16] [EMAIL PROTECTED]

marking as critical....

------------------------------------------------------------------------

[2008-06-27 06:00:54] [EMAIL PROTECTED]

Description:
------------
single line file:

<?php if (1) { ?>#<?php }  ?>

produces a parse error:

get's caught with this rule from the re2c scanner.
<INITIAL>"#".+ {NEWLINE} {
        if ((YYCTYPE*)yytext == SCNG(yy_start)) {
                /* ignore first line when it's started with a # */
                goto restart;
        } else {
                goto inline_char_handler;
        }
}


basically the scanner runs off the end, and eats everything after the
#

I've fixed it by changing the above to something like:
} else {
        /* shunt back to just return the # on it's own..   */
        YYCURSOR = YYMARKER;
          yyleng = 1;
        goto inline_char_handler;
}









------------------------------------------------------------------------


-- 
Edit this bug report at http://bugs.php.net/?id=45372&edit=1

Reply via email to