Re: [mp2] OutputFilter with UTF-8 characters

Stas Bekman Tue, 11 Nov 2003 18:58:57 -0800

Matthew Darwin wrote:

After extensive playing around with this (inside mod_perl and out), I have come up with two observations:

1) doing regexes on UTF-8 characters split across buckets in an output filter seems to be not a problem. All my regexes are against ASCII characters.

Good. But could this happen in future? Did you try writing a test that injects utf8 data and arranging for the filter to split it in the middle of the character?

2) mod_perl seems to get confused when I use $_ in a content handler and a filter for the same request.

First of all, thanks a lot for trekking this down. Who would think that a response handler:

print while <FOO>;

would affect the special vars in the filter called by print. (the same problem happens in input filters)

If you think of the filter being just a function being called inside of print, that behavior has nothing to do with mod_perl, since it happens inside the same Perl interpreter.

So you were actually bitten by a known bad programming practice of doing:

for (@list) { sub_that_may_use_unlocalized_dollar_underscore(); }

without even knowing it. It's just so treaky to see it.

In the worker mpm I think this could be solved without touching the code by using:

PerlInterpScope handler

(http://perl.apache.org/docs/2.0/user/config/config.html#C_PerlInterpScope_)

So the filter will use a different perl interpreter. It'd be interesting to verify that it actually does the trick.

Notice that we are talking about $_, but this case affects all special Perl variables, so for example if your response handler goes into a slurp mode:

    local $\;
    while (<DOC>) {
        print;
    }

an output filter handler will see $\ == undef, while the author thinking that it's the default /[\r\n]/ value. And it will randomly misbehave based on what the caller did.

So let's decide how do we act upon this:

We definitely need to document that filter handlers need to be aware that if they use special perl variables, they must localize them and to explicitly set to the desired value without ever relying on the defaults.

Moreover they should be vigilant for hidden uses of $_, like in the example you've shown:

        while ($f->read(my $buffer, BUFF_LEN)) {
                print STDERR "DEBUG BUFFER: [$buffer]\n\n";
                $f->print (do_it ($r, $leftover . $buffer));
        }

I'm not sure whether we should make a special case for $_ and localize it behind the scenes. I wonder if someone may find this to be a problem, if they try to use the non-licalized $_ for its side-effects. Do you think the documentation suggestion above is sufficient? Or should we risk DWIM and fixup it? In the above example is so obviously tricky to see that $_ is affected, that I tend to think that we should make $_ a special case and localize it behind the scenes.

__________________________________________________________________
Stas Bekman            JAm_pH ------> Just Another mod_perl Hacker
http://stason.org/     mod_perl Guide ---> http://perl.apache.org
mailto:[EMAIL PROTECTED] http://use.perl.org http://apacheweek.com
http://modperlbook.org http://apache.org   http://ticketmaster.com


--
Reporting bugs: http://perl.apache.org/bugs/
Mail list info: http://perl.apache.org/maillist/modperl.html

Re: [mp2] OutputFilter with UTF-8 characters

Reply via email to