Re: Output filters, data encoding

2019-11-14 Thread Damyan Ivanov
-=| p...@cpan.org, 14.11.2019 09:51:20 +0100 |=-
> On Wednesday 13 November 2019 20:37:06 Damyan Ivanov wrote:
> >my($buffer, $string) = ("", "");
> >while (read($fh, $buffer, 256, length($buffer))) {
> >$string .= decode($encoding, $buffer, Encode::FB_QUIET);
> ># $buffer now contains the unprocessed partial character
> >}
> 
> This code is dangerous. It can enter into endless loop. Once you read
> invalid UTF-8 sequence, above loop never finish. So if buffer input is
> under user/attacker control you introduce DoS issues.

Sure. A check to prevent that would be in order. I must admit that 
I was very happy to find a solution to the problem that was even in 
the official documentation.

> Instead of FB_QUIET, you should use Encode::STOP_AT_PARTIAL flag. This
> is the flag which you want to use. Encode::decode stops decoding when
> valid UTF-8 sequence is not complete and needs more bytes to read. And
> by default invalid UTF-8 sequences are mapped to Unicode replacement
> character.
> 
> Btw, PerlIO::encoding uses also Encode::STOP_AT_PARTIAL flag to handle
> this situation.
> 
> PS: I know that Encode::STOP_AT_PARTIAL is undocumented, but it is only
> because nobody found time to write documentation for it. It is part of
> Encode API and ready to use...

That would be https://rt.cpan.org/Public/Bug/Display.html?id=67065 
(filed 8 years ago, still open). 


Re: Output filters, data encoding

2019-11-14 Thread tomcat/perl

On 14.11.2019 01:09, Hua, Yong wrote:

Hi

on 2019/11/14 2:12, André Warnier (tomcat/perl) wrote:

I'm writing a new PerlOutputFilter, stream version.


Can you give a more general introduction for what is "stream version"?

Thank you.

You shoud read the pages which I referred to previously, they explain this better than I 
could do :

1) http://perl.apache.org/docs/2.0/user/handlers/filters.html
2) http://perl.apache.org/docs/2.0/api/Apache2/Filter.html

See in particular here :
http://perl.apache.org/docs/2.0/user/handlers/filters.html#Two_Methods_for_Manipulating_Data




Re: Output filters, data encoding

2019-11-14 Thread Vincent Veyron
On Wed, 13 Nov 2019 19:12:10 +0100
André Warnier (tomcat/perl)  wrote:
> 

I also found that calls to binmode in output filters generate a double encoding.

Here is a paste of the code of an output filter that adds a menu, some headers 
and closing tags to the html pages generated by previous modules; it reads from 
STDOUT, not from a file:

https://pastebin.com/trhjfDxX

It uses this :

>   #on arrive à la fin du contenu
>if ($f->seen_eos) {
>   
>$content = Encode::decode_utf8( $content ) . '' ;


Never had a problem with it.

I have handlers, not output filters, that read from files and use this :

open DOCUMENT_CONTENT, "<:encoding(UTF-8)", "$document_content" or die 
"can't open $document_content : $!\n" ;


-- 

Bien à vous, Vincent Veyron 

https://compta.libremen.com
Logiciel libre de comptabilité générale en partie double


Re: Output filters, data encoding

2019-11-14 Thread pali
On Wednesday 13 November 2019 20:37:06 Damyan Ivanov wrote:
> -=| André Warnier (tomcat/perl), 13.11.2019 19:12:10 +0100 |=-
> > while (my $sz = $f->read(my $buffer, BUFF_LEN)) {
> > ..
> > 
> > and then I need to pass this data to another module for processing 
> > (Template::Toolkit).
> > To make a long story short, Template::Toolkit misinterprets the data I'm
> > sending to it, because this data /is/ actually UTF-8, but apparently not
> > marked so internally by the $f->read(). So TT2 re-encodes it, leading to
> > double UTF-8 encoding.
> > 
> > My question is : can I - and how -, set the filehandle that corresponds to
> > the $f->read(), to a UTF-8 layer ?
> > I have tried
> > 
> > line 155: binmode($f,'encoding:(UTF-8)');
> > 
> > and that triggers an error :
> >  Not a GLOB reference at (my filter) line 155.\n
> > )
> > 
> > Or do I need to read the data 'as is', and separately do an
> > 
> >  $decoded_buffer = decode('UTF-8', $buffer);
> 
> There's a middle ground - partial decoding. See Encode(1)/FB_QUIET:
> 
>If CHECK is set to "Encode::FB_QUIET", encoding and decoding
>immediately return the portion of the data that has been processed so
>far when an error occurs. The data argument is overwritten with
>everything after that point; that is, the unprocessed portion of the
>data.  This is handy when you have to call "decode" repeatedly in the
>case where your source data may contain partial multi-byte character
>sequences, (that is, you are reading with a fixed-width buffer). Here's
>some sample code to do exactly that:
> 
>my($buffer, $string) = ("", "");
>while (read($fh, $buffer, 256, length($buffer))) {
>$string .= decode($encoding, $buffer, Encode::FB_QUIET);
># $buffer now contains the unprocessed partial character
>}

This code is dangerous. It can enter into endless loop. Once you read
invalid UTF-8 sequence, above loop never finish. So if buffer input is
under user/attacker control you introduce DoS issues.

Instead of FB_QUIET, you should use Encode::STOP_AT_PARTIAL flag. This
is the flag which you want to use. Encode::decode stops decoding when
valid UTF-8 sequence is not complete and needs more bytes to read. And
by default invalid UTF-8 sequences are mapped to Unicode replacement
character.

Btw, PerlIO::encoding uses also Encode::STOP_AT_PARTIAL flag to handle
this situation.

PS: I know that Encode::STOP_AT_PARTIAL is undocumented, but it is only
because nobody found time to write documentation for it. It is part of
Encode API and ready to use...

> 
> Looks exactly like your case.
> 
> 
> -- Damyan


Re: Output filters, data encoding

2019-11-14 Thread pali
On Wednesday 13 November 2019 20:10:07 André Warnier (tomcat/perl) wrote:
> On 13.11.2019 19:53, p...@cpan.org wrote:
> > On Wednesday 13 November 2019 19:52:25 André Warnier (tomcat/perl) wrote:
> > > On 13.11.2019 19:17, p...@cpan.org wrote:
> > > > On Wednesday 13 November 2019 19:12:10 André Warnier (tomcat/perl) 
> > > > wrote:
> > > > > My question is : can I - and how -, set the filehandle that 
> > > > > corresponds to
> > > > > the $f->read(), to a UTF-8 layer ?
> > > > > I have tried
> > > > > 
> > > > > line 155: binmode($f,'encoding:(UTF-8)');
> > > > 
> > > > Hi André! When specifying PerlIO layer for file handle, you need to
> > > > write colon character before layer name. So correct binmode call is:
> > > > 
> > > > binmode($f, ':encoding(UTF-8)');
> > > > 
> > > > > and that triggers an error :
> > > > >Not a GLOB reference at (my filter) line 155.\n
> > > > > )
> > > 
> > > Thanks. Ooops, that was a typo (also in my filter, not only in the list 
> > > message).
> > > But correcting it, does not change the GLOB error message.
> > 
> > Ok. What is the $f? It is object or what kind of scalar?
> > 
> It is the Apache2::Filter object.
> See : http://perl.apache.org/docs/2.0/api/Apache2/Filter.html
> Configured in httpd as :   PerlOutputFilterHandler MyFilter
> See also :  http://perl.apache.org/docs/2.0/user/handlers/filters.html
> 
> My (hopeful) thinking was that considering the
> $f->read()
> the Apache2::Filter object may also be a FileHandle, hence the attempt at
> binmode($f,..)
> But that seems to be incorrect.
> (And I don't see any (documented) method of Apache2::Filter that would
> return the underlying FileHandle either)

Sorry, then I do not know :-(


Re: Output filters, data encoding

2019-11-13 Thread Hua, Yong

Hi

on 2019/11/14 2:12, André Warnier (tomcat/perl) wrote:

I'm writing a new PerlOutputFilter, stream version.


Can you give a more general introduction for what is "stream version"?

Thank you.


Re: Output filters, data encoding

2019-11-13 Thread tomcat/perl

On 13.11.2019 19:37, Damyan Ivanov wrote:

-=| André Warnier (tomcat/perl), 13.11.2019 19:12:10 +0100 |=-

while (my $sz = $f->read(my $buffer, BUFF_LEN)) {
..

and then I need to pass this data to another module for processing 
(Template::Toolkit).
To make a long story short, Template::Toolkit misinterprets the data I'm
sending to it, because this data /is/ actually UTF-8, but apparently not
marked so internally by the $f->read(). So TT2 re-encodes it, leading to
double UTF-8 encoding.

My question is : can I - and how -, set the filehandle that corresponds to
the $f->read(), to a UTF-8 layer ?
I have tried

line 155: binmode($f,'encoding:(UTF-8)');

and that triggers an error :
  Not a GLOB reference at (my filter) line 155.\n
)

Or do I need to read the data 'as is', and separately do an

  $decoded_buffer = decode('UTF-8', $buffer);


There's a middle ground - partial decoding. See Encode(1)/FB_QUIET:

If CHECK is set to "Encode::FB_QUIET", encoding and decoding
immediately return the portion of the data that has been processed so
far when an error occurs. The data argument is overwritten with
everything after that point; that is, the unprocessed portion of the
data.  This is handy when you have to call "decode" repeatedly in the
case where your source data may contain partial multi-byte character
sequences, (that is, you are reading with a fixed-width buffer). Here's
some sample code to do exactly that:

my($buffer, $string) = ("", "");
while (read($fh, $buffer, 256, length($buffer))) {
$string .= decode($encoding, $buffer, Encode::FB_QUIET);
# $buffer now contains the unprocessed partial character
}

Looks exactly like your case.


Thanks for the response and the tip.

My idea of adding a UTF-8 layer to the filehandle through which Apache2::Filter reads the 
incoming data was probably wrong anyway : it cannot do that, because it gets this data 
originally in chunks, as "bucket brigades" from Apache httpd.  And there is no guarantee 
that such a bucket brigade would always end in "complete" UTF-8 character sequences.

At the very least, this would probably complicate the code underlying 
$f->read() quite a bit.
It is clearer to handle that in the filter itself.

The Encode::FB_QUIET flag above, with the incremental buffer read, is really 
smart.
Unfortunately, the Apache2::Filter read() method does not allow as many arguments, and all 
one has is something like this :


my $accumulated_content = "";
while (my $sz = $f->read(my $buffer, BUFF_LEN)) {
$accumulated_content .= $buffer;
}

Luckily, in this case, I have to accumulate the complete response content anyway, before I 
can decide to call Template::Toolkit on it or not. So I can do a single decode() on 
$accumulated_content. Not the most efficient memory-wise, but good enough in this case.





Re: Output filters, data encoding

2019-11-13 Thread tomcat/perl

On 13.11.2019 19:53, p...@cpan.org wrote:

On Wednesday 13 November 2019 19:52:25 André Warnier (tomcat/perl) wrote:

On 13.11.2019 19:17, p...@cpan.org wrote:

On Wednesday 13 November 2019 19:12:10 André Warnier (tomcat/perl) wrote:

My question is : can I - and how -, set the filehandle that corresponds to
the $f->read(), to a UTF-8 layer ?
I have tried

line 155: binmode($f,'encoding:(UTF-8)');


Hi André! When specifying PerlIO layer for file handle, you need to
write colon character before layer name. So correct binmode call is:

binmode($f, ':encoding(UTF-8)');


and that triggers an error :
   Not a GLOB reference at (my filter) line 155.\n
)


Thanks. Ooops, that was a typo (also in my filter, not only in the list 
message).
But correcting it, does not change the GLOB error message.


Ok. What is the $f? It is object or what kind of scalar?


It is the Apache2::Filter object.
See : http://perl.apache.org/docs/2.0/api/Apache2/Filter.html
Configured in httpd as :   PerlOutputFilterHandler MyFilter
See also :  http://perl.apache.org/docs/2.0/user/handlers/filters.html

My (hopeful) thinking was that considering the
$f->read()
the Apache2::Filter object may also be a FileHandle, hence the attempt at
binmode($f,..)
But that seems to be incorrect.
(And I don't see any (documented) method of Apache2::Filter that would return the 
underlying FileHandle either)




Re: Output filters, data encoding

2019-11-13 Thread pali
On Wednesday 13 November 2019 19:52:25 André Warnier (tomcat/perl) wrote:
> On 13.11.2019 19:17, p...@cpan.org wrote:
> > On Wednesday 13 November 2019 19:12:10 André Warnier (tomcat/perl) wrote:
> > > My question is : can I - and how -, set the filehandle that corresponds to
> > > the $f->read(), to a UTF-8 layer ?
> > > I have tried
> > > 
> > > line 155: binmode($f,'encoding:(UTF-8)');
> > 
> > Hi André! When specifying PerlIO layer for file handle, you need to
> > write colon character before layer name. So correct binmode call is:
> > 
> >binmode($f, ':encoding(UTF-8)');
> > 
> > > and that triggers an error :
> > >   Not a GLOB reference at (my filter) line 155.\n
> > > )
> 
> Thanks. Ooops, that was a typo (also in my filter, not only in the list 
> message).
> But correcting it, does not change the GLOB error message.

Ok. What is the $f? It is object or what kind of scalar?


Re: Output filters, data encoding

2019-11-13 Thread tomcat/perl

On 13.11.2019 19:17, p...@cpan.org wrote:

On Wednesday 13 November 2019 19:12:10 André Warnier (tomcat/perl) wrote:

My question is : can I - and how -, set the filehandle that corresponds to
the $f->read(), to a UTF-8 layer ?
I have tried

line 155: binmode($f,'encoding:(UTF-8)');


Hi André! When specifying PerlIO layer for file handle, you need to
write colon character before layer name. So correct binmode call is:

   binmode($f, ':encoding(UTF-8)');


and that triggers an error :
  Not a GLOB reference at (my filter) line 155.\n
)


Thanks. Ooops, that was a typo (also in my filter, not only in the list 
message).
But correcting it, does not change the GLOB error message.



Re: Output filters, data encoding

2019-11-13 Thread Damyan Ivanov
-=| André Warnier (tomcat/perl), 13.11.2019 19:12:10 +0100 |=-
>   while (my $sz = $f->read(my $buffer, BUFF_LEN)) {
> ..
> 
> and then I need to pass this data to another module for processing 
> (Template::Toolkit).
> To make a long story short, Template::Toolkit misinterprets the data I'm
> sending to it, because this data /is/ actually UTF-8, but apparently not
> marked so internally by the $f->read(). So TT2 re-encodes it, leading to
> double UTF-8 encoding.
> 
> My question is : can I - and how -, set the filehandle that corresponds to
> the $f->read(), to a UTF-8 layer ?
> I have tried
> 
> line 155: binmode($f,'encoding:(UTF-8)');
> 
> and that triggers an error :
>  Not a GLOB reference at (my filter) line 155.\n
> )
> 
> Or do I need to read the data 'as is', and separately do an
> 
>  $decoded_buffer = decode('UTF-8', $buffer);

There's a middle ground - partial decoding. See Encode(1)/FB_QUIET:

   If CHECK is set to "Encode::FB_QUIET", encoding and decoding
   immediately return the portion of the data that has been processed so
   far when an error occurs. The data argument is overwritten with
   everything after that point; that is, the unprocessed portion of the
   data.  This is handy when you have to call "decode" repeatedly in the
   case where your source data may contain partial multi-byte character
   sequences, (that is, you are reading with a fixed-width buffer). Here's
   some sample code to do exactly that:

   my($buffer, $string) = ("", "");
   while (read($fh, $buffer, 256, length($buffer))) {
   $string .= decode($encoding, $buffer, Encode::FB_QUIET);
   # $buffer now contains the unprocessed partial character
   }

Looks exactly like your case.


-- Damyan


Re: Output filters, data encoding

2019-11-13 Thread pali
On Wednesday 13 November 2019 19:12:10 André Warnier (tomcat/perl) wrote:
> My question is : can I - and how -, set the filehandle that corresponds to
> the $f->read(), to a UTF-8 layer ?
> I have tried
> 
> line 155: binmode($f,'encoding:(UTF-8)');

Hi André! When specifying PerlIO layer for file handle, you need to
write colon character before layer name. So correct binmode call is:

  binmode($f, ':encoding(UTF-8)');

> and that triggers an error :
>  Not a GLOB reference at (my filter) line 155.\n
> )