On Wednesday 13 November 2019 20:37:06 Damyan Ivanov wrote:
> -=| André Warnier (tomcat/perl), 13.11.2019 19:12:10 +0100 |=-
> >     while (my $sz = $f->read(my $buffer, BUFF_LEN)) {
> > ..
> > 
> > and then I need to pass this data to another module for processing 
> > (Template::Toolkit).
> > To make a long story short, Template::Toolkit misinterprets the data I'm
> > sending to it, because this data /is/ actually UTF-8, but apparently not
> > marked so internally by the $f->read(). So TT2 re-encodes it, leading to
> > double UTF-8 encoding.
> > 
> > My question is : can I - and how -, set the filehandle that corresponds to
> > the $f->read(), to a UTF-8 layer ?
> > I have tried
> > 
> > line 155: binmode($f,'encoding:(UTF-8)');
> > 
> > and that triggers an error :
> >  Not a GLOB reference at (my filter) line 155.\n
> > )
> > 
> > Or do I need to read the data 'as is', and separately do an
> > 
> >  $decoded_buffer = decode('UTF-8', $buffer);
> 
> There's a middle ground - partial decoding. See Encode(1)/FB_QUIET:
> 
>        If CHECK is set to "Encode::FB_QUIET", encoding and decoding
>        immediately return the portion of the data that has been processed so
>        far when an error occurs. The data argument is overwritten with
>        everything after that point; that is, the unprocessed portion of the
>        data.  This is handy when you have to call "decode" repeatedly in the
>        case where your source data may contain partial multi-byte character
>        sequences, (that is, you are reading with a fixed-width buffer). Here's
>        some sample code to do exactly that:
> 
>            my($buffer, $string) = ("", "");
>            while (read($fh, $buffer, 256, length($buffer))) {
>                $string .= decode($encoding, $buffer, Encode::FB_QUIET);
>                # $buffer now contains the unprocessed partial character
>            }

This code is dangerous. It can enter into endless loop. Once you read
invalid UTF-8 sequence, above loop never finish. So if buffer input is
under user/attacker control you introduce DoS issues.

Instead of FB_QUIET, you should use Encode::STOP_AT_PARTIAL flag. This
is the flag which you want to use. Encode::decode stops decoding when
valid UTF-8 sequence is not complete and needs more bytes to read. And
by default invalid UTF-8 sequences are mapped to Unicode replacement
character.

Btw, PerlIO::encoding uses also Encode::STOP_AT_PARTIAL flag to handle
this situation.

PS: I know that Encode::STOP_AT_PARTIAL is undocumented, but it is only
because nobody found time to write documentation for it. It is part of
Encode API and ready to use...

> 
> Looks exactly like your case.
> 
> 
> -- Damyan

Reply via email to