Re: Output filters, data encoding
-=| p...@cpan.org, 14.11.2019 09:51:20 +0100 |=- > On Wednesday 13 November 2019 20:37:06 Damyan Ivanov wrote: > >my($buffer, $string) = ("", ""); > >while (read($fh, $buffer, 256, length($buffer))) { > >$string .= decode($encoding, $buffer, Encode::FB_QUIET); > ># $buffer now contains the unprocessed partial character > >} > > This code is dangerous. It can enter into endless loop. Once you read > invalid UTF-8 sequence, above loop never finish. So if buffer input is > under user/attacker control you introduce DoS issues. Sure. A check to prevent that would be in order. I must admit that I was very happy to find a solution to the problem that was even in the official documentation. > Instead of FB_QUIET, you should use Encode::STOP_AT_PARTIAL flag. This > is the flag which you want to use. Encode::decode stops decoding when > valid UTF-8 sequence is not complete and needs more bytes to read. And > by default invalid UTF-8 sequences are mapped to Unicode replacement > character. > > Btw, PerlIO::encoding uses also Encode::STOP_AT_PARTIAL flag to handle > this situation. > > PS: I know that Encode::STOP_AT_PARTIAL is undocumented, but it is only > because nobody found time to write documentation for it. It is part of > Encode API and ready to use... That would be https://rt.cpan.org/Public/Bug/Display.html?id=67065 (filed 8 years ago, still open).
Re: Output filters, data encoding
On 14.11.2019 01:09, Hua, Yong wrote: Hi on 2019/11/14 2:12, André Warnier (tomcat/perl) wrote: I'm writing a new PerlOutputFilter, stream version. Can you give a more general introduction for what is "stream version"? Thank you. You shoud read the pages which I referred to previously, they explain this better than I could do : 1) http://perl.apache.org/docs/2.0/user/handlers/filters.html 2) http://perl.apache.org/docs/2.0/api/Apache2/Filter.html See in particular here : http://perl.apache.org/docs/2.0/user/handlers/filters.html#Two_Methods_for_Manipulating_Data
Re: Output filters, data encoding
On Wed, 13 Nov 2019 19:12:10 +0100 André Warnier (tomcat/perl) wrote: > I also found that calls to binmode in output filters generate a double encoding. Here is a paste of the code of an output filter that adds a menu, some headers and closing tags to the html pages generated by previous modules; it reads from STDOUT, not from a file: https://pastebin.com/trhjfDxX It uses this : > #on arrive à la fin du contenu >if ($f->seen_eos) { > >$content = Encode::decode_utf8( $content ) . '' ; Never had a problem with it. I have handlers, not output filters, that read from files and use this : open DOCUMENT_CONTENT, "<:encoding(UTF-8)", "$document_content" or die "can't open $document_content : $!\n" ; -- Bien à vous, Vincent Veyron https://compta.libremen.com Logiciel libre de comptabilité générale en partie double
Re: Output filters, data encoding
On Wednesday 13 November 2019 20:37:06 Damyan Ivanov wrote: > -=| André Warnier (tomcat/perl), 13.11.2019 19:12:10 +0100 |=- > > while (my $sz = $f->read(my $buffer, BUFF_LEN)) { > > .. > > > > and then I need to pass this data to another module for processing > > (Template::Toolkit). > > To make a long story short, Template::Toolkit misinterprets the data I'm > > sending to it, because this data /is/ actually UTF-8, but apparently not > > marked so internally by the $f->read(). So TT2 re-encodes it, leading to > > double UTF-8 encoding. > > > > My question is : can I - and how -, set the filehandle that corresponds to > > the $f->read(), to a UTF-8 layer ? > > I have tried > > > > line 155: binmode($f,'encoding:(UTF-8)'); > > > > and that triggers an error : > > Not a GLOB reference at (my filter) line 155.\n > > ) > > > > Or do I need to read the data 'as is', and separately do an > > > > $decoded_buffer = decode('UTF-8', $buffer); > > There's a middle ground - partial decoding. See Encode(1)/FB_QUIET: > >If CHECK is set to "Encode::FB_QUIET", encoding and decoding >immediately return the portion of the data that has been processed so >far when an error occurs. The data argument is overwritten with >everything after that point; that is, the unprocessed portion of the >data. This is handy when you have to call "decode" repeatedly in the >case where your source data may contain partial multi-byte character >sequences, (that is, you are reading with a fixed-width buffer). Here's >some sample code to do exactly that: > >my($buffer, $string) = ("", ""); >while (read($fh, $buffer, 256, length($buffer))) { >$string .= decode($encoding, $buffer, Encode::FB_QUIET); ># $buffer now contains the unprocessed partial character >} This code is dangerous. It can enter into endless loop. Once you read invalid UTF-8 sequence, above loop never finish. So if buffer input is under user/attacker control you introduce DoS issues. Instead of FB_QUIET, you should use Encode::STOP_AT_PARTIAL flag. This is the flag which you want to use. Encode::decode stops decoding when valid UTF-8 sequence is not complete and needs more bytes to read. And by default invalid UTF-8 sequences are mapped to Unicode replacement character. Btw, PerlIO::encoding uses also Encode::STOP_AT_PARTIAL flag to handle this situation. PS: I know that Encode::STOP_AT_PARTIAL is undocumented, but it is only because nobody found time to write documentation for it. It is part of Encode API and ready to use... > > Looks exactly like your case. > > > -- Damyan
Re: Output filters, data encoding
On Wednesday 13 November 2019 20:10:07 André Warnier (tomcat/perl) wrote: > On 13.11.2019 19:53, p...@cpan.org wrote: > > On Wednesday 13 November 2019 19:52:25 André Warnier (tomcat/perl) wrote: > > > On 13.11.2019 19:17, p...@cpan.org wrote: > > > > On Wednesday 13 November 2019 19:12:10 André Warnier (tomcat/perl) > > > > wrote: > > > > > My question is : can I - and how -, set the filehandle that > > > > > corresponds to > > > > > the $f->read(), to a UTF-8 layer ? > > > > > I have tried > > > > > > > > > > line 155: binmode($f,'encoding:(UTF-8)'); > > > > > > > > Hi André! When specifying PerlIO layer for file handle, you need to > > > > write colon character before layer name. So correct binmode call is: > > > > > > > > binmode($f, ':encoding(UTF-8)'); > > > > > > > > > and that triggers an error : > > > > >Not a GLOB reference at (my filter) line 155.\n > > > > > ) > > > > > > Thanks. Ooops, that was a typo (also in my filter, not only in the list > > > message). > > > But correcting it, does not change the GLOB error message. > > > > Ok. What is the $f? It is object or what kind of scalar? > > > It is the Apache2::Filter object. > See : http://perl.apache.org/docs/2.0/api/Apache2/Filter.html > Configured in httpd as : PerlOutputFilterHandler MyFilter > See also : http://perl.apache.org/docs/2.0/user/handlers/filters.html > > My (hopeful) thinking was that considering the > $f->read() > the Apache2::Filter object may also be a FileHandle, hence the attempt at > binmode($f,..) > But that seems to be incorrect. > (And I don't see any (documented) method of Apache2::Filter that would > return the underlying FileHandle either) Sorry, then I do not know :-(
Re: Output filters, data encoding
Hi on 2019/11/14 2:12, André Warnier (tomcat/perl) wrote: I'm writing a new PerlOutputFilter, stream version. Can you give a more general introduction for what is "stream version"? Thank you.
Re: Output filters, data encoding
On 13.11.2019 19:37, Damyan Ivanov wrote: -=| André Warnier (tomcat/perl), 13.11.2019 19:12:10 +0100 |=- while (my $sz = $f->read(my $buffer, BUFF_LEN)) { .. and then I need to pass this data to another module for processing (Template::Toolkit). To make a long story short, Template::Toolkit misinterprets the data I'm sending to it, because this data /is/ actually UTF-8, but apparently not marked so internally by the $f->read(). So TT2 re-encodes it, leading to double UTF-8 encoding. My question is : can I - and how -, set the filehandle that corresponds to the $f->read(), to a UTF-8 layer ? I have tried line 155: binmode($f,'encoding:(UTF-8)'); and that triggers an error : Not a GLOB reference at (my filter) line 155.\n ) Or do I need to read the data 'as is', and separately do an $decoded_buffer = decode('UTF-8', $buffer); There's a middle ground - partial decoding. See Encode(1)/FB_QUIET: If CHECK is set to "Encode::FB_QUIET", encoding and decoding immediately return the portion of the data that has been processed so far when an error occurs. The data argument is overwritten with everything after that point; that is, the unprocessed portion of the data. This is handy when you have to call "decode" repeatedly in the case where your source data may contain partial multi-byte character sequences, (that is, you are reading with a fixed-width buffer). Here's some sample code to do exactly that: my($buffer, $string) = ("", ""); while (read($fh, $buffer, 256, length($buffer))) { $string .= decode($encoding, $buffer, Encode::FB_QUIET); # $buffer now contains the unprocessed partial character } Looks exactly like your case. Thanks for the response and the tip. My idea of adding a UTF-8 layer to the filehandle through which Apache2::Filter reads the incoming data was probably wrong anyway : it cannot do that, because it gets this data originally in chunks, as "bucket brigades" from Apache httpd. And there is no guarantee that such a bucket brigade would always end in "complete" UTF-8 character sequences. At the very least, this would probably complicate the code underlying $f->read() quite a bit. It is clearer to handle that in the filter itself. The Encode::FB_QUIET flag above, with the incremental buffer read, is really smart. Unfortunately, the Apache2::Filter read() method does not allow as many arguments, and all one has is something like this : my $accumulated_content = ""; while (my $sz = $f->read(my $buffer, BUFF_LEN)) { $accumulated_content .= $buffer; } Luckily, in this case, I have to accumulate the complete response content anyway, before I can decide to call Template::Toolkit on it or not. So I can do a single decode() on $accumulated_content. Not the most efficient memory-wise, but good enough in this case.
Re: Output filters, data encoding
On 13.11.2019 19:53, p...@cpan.org wrote: On Wednesday 13 November 2019 19:52:25 André Warnier (tomcat/perl) wrote: On 13.11.2019 19:17, p...@cpan.org wrote: On Wednesday 13 November 2019 19:12:10 André Warnier (tomcat/perl) wrote: My question is : can I - and how -, set the filehandle that corresponds to the $f->read(), to a UTF-8 layer ? I have tried line 155: binmode($f,'encoding:(UTF-8)'); Hi André! When specifying PerlIO layer for file handle, you need to write colon character before layer name. So correct binmode call is: binmode($f, ':encoding(UTF-8)'); and that triggers an error : Not a GLOB reference at (my filter) line 155.\n ) Thanks. Ooops, that was a typo (also in my filter, not only in the list message). But correcting it, does not change the GLOB error message. Ok. What is the $f? It is object or what kind of scalar? It is the Apache2::Filter object. See : http://perl.apache.org/docs/2.0/api/Apache2/Filter.html Configured in httpd as : PerlOutputFilterHandler MyFilter See also : http://perl.apache.org/docs/2.0/user/handlers/filters.html My (hopeful) thinking was that considering the $f->read() the Apache2::Filter object may also be a FileHandle, hence the attempt at binmode($f,..) But that seems to be incorrect. (And I don't see any (documented) method of Apache2::Filter that would return the underlying FileHandle either)
Re: Output filters, data encoding
On Wednesday 13 November 2019 19:52:25 André Warnier (tomcat/perl) wrote: > On 13.11.2019 19:17, p...@cpan.org wrote: > > On Wednesday 13 November 2019 19:12:10 André Warnier (tomcat/perl) wrote: > > > My question is : can I - and how -, set the filehandle that corresponds to > > > the $f->read(), to a UTF-8 layer ? > > > I have tried > > > > > > line 155: binmode($f,'encoding:(UTF-8)'); > > > > Hi André! When specifying PerlIO layer for file handle, you need to > > write colon character before layer name. So correct binmode call is: > > > >binmode($f, ':encoding(UTF-8)'); > > > > > and that triggers an error : > > > Not a GLOB reference at (my filter) line 155.\n > > > ) > > Thanks. Ooops, that was a typo (also in my filter, not only in the list > message). > But correcting it, does not change the GLOB error message. Ok. What is the $f? It is object or what kind of scalar?
Re: Output filters, data encoding
On 13.11.2019 19:17, p...@cpan.org wrote: On Wednesday 13 November 2019 19:12:10 André Warnier (tomcat/perl) wrote: My question is : can I - and how -, set the filehandle that corresponds to the $f->read(), to a UTF-8 layer ? I have tried line 155: binmode($f,'encoding:(UTF-8)'); Hi André! When specifying PerlIO layer for file handle, you need to write colon character before layer name. So correct binmode call is: binmode($f, ':encoding(UTF-8)'); and that triggers an error : Not a GLOB reference at (my filter) line 155.\n ) Thanks. Ooops, that was a typo (also in my filter, not only in the list message). But correcting it, does not change the GLOB error message.
Re: Output filters, data encoding
-=| André Warnier (tomcat/perl), 13.11.2019 19:12:10 +0100 |=- > while (my $sz = $f->read(my $buffer, BUFF_LEN)) { > .. > > and then I need to pass this data to another module for processing > (Template::Toolkit). > To make a long story short, Template::Toolkit misinterprets the data I'm > sending to it, because this data /is/ actually UTF-8, but apparently not > marked so internally by the $f->read(). So TT2 re-encodes it, leading to > double UTF-8 encoding. > > My question is : can I - and how -, set the filehandle that corresponds to > the $f->read(), to a UTF-8 layer ? > I have tried > > line 155: binmode($f,'encoding:(UTF-8)'); > > and that triggers an error : > Not a GLOB reference at (my filter) line 155.\n > ) > > Or do I need to read the data 'as is', and separately do an > > $decoded_buffer = decode('UTF-8', $buffer); There's a middle ground - partial decoding. See Encode(1)/FB_QUIET: If CHECK is set to "Encode::FB_QUIET", encoding and decoding immediately return the portion of the data that has been processed so far when an error occurs. The data argument is overwritten with everything after that point; that is, the unprocessed portion of the data. This is handy when you have to call "decode" repeatedly in the case where your source data may contain partial multi-byte character sequences, (that is, you are reading with a fixed-width buffer). Here's some sample code to do exactly that: my($buffer, $string) = ("", ""); while (read($fh, $buffer, 256, length($buffer))) { $string .= decode($encoding, $buffer, Encode::FB_QUIET); # $buffer now contains the unprocessed partial character } Looks exactly like your case. -- Damyan
Re: Output filters, data encoding
On Wednesday 13 November 2019 19:12:10 André Warnier (tomcat/perl) wrote: > My question is : can I - and how -, set the filehandle that corresponds to > the $f->read(), to a UTF-8 layer ? > I have tried > > line 155: binmode($f,'encoding:(UTF-8)'); Hi André! When specifying PerlIO layer for file handle, you need to write colon character before layer name. So correct binmode call is: binmode($f, ':encoding(UTF-8)'); > and that triggers an error : > Not a GLOB reference at (my filter) line 155.\n > )