subject:"mod_perl and utf8 and CGI\-param"

Re: mod_perl and utf8 and CGI->param

2017-06-02 Thread Randal L. Schwartz

> "Peng" == Peng Yonghua  writes:

Peng> And, can I override any method from a class via this way? is this a 
general
Peng> trick? thanks.

Yes, and your downstream will hate you for it.  The ruby people do this
all the time, and it makes their code brittle.  I did this in my app,
and would never think of putting that into the core CGI::Prototype where
this gets used, even though it would solve the problem for everyone.

-- 
Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095
 
Perl/Unix consulting, Technical writing, Comedy, etc. etc.
Still trying to think of something clever for the fourth line of this .sig

Re: mod_perl and utf8 and CGI->param

2017-06-01 Thread Peng Yonghua

And, can I override any method from a class via this way? is this a 
general trick? thanks.


On 2017/6/2  8:48, Peng Yonghua wrote:

good patch. thanks for sharing.

On 2017/6/1  23:34, Randal L. Schwartz wrote:

I realized that I never posted my ultimate solution.  I monkey patch
CGI.pm:

require CGI;
{
   my $orig = \::param;
   no warnings 'redefine';
   *CGI::param = sub {
 $CGI::LIST_CONTEXT_WARN = 0; # workaround for backward compatibility
 $CGI::PARAM_UTF8 = 1;
 goto &$orig;
   };
}

And this has been working just fine for both CGI and mod_perl.  Just 
for the

record.

Re: mod_perl and utf8 and CGI->param

2017-06-01 Thread Peng Yonghua


good patch. thanks for sharing.

On 2017/6/1  23:34, Randal L. Schwartz wrote:

I realized that I never posted my ultimate solution.  I monkey patch
CGI.pm:

require CGI;
{
   my $orig = \::param;
   no warnings 'redefine';
   *CGI::param = sub {
 $CGI::LIST_CONTEXT_WARN = 0; # workaround for backward compatibility
 $CGI::PARAM_UTF8 = 1;
 goto &$orig;
   };
}

And this has been working just fine for both CGI and mod_perl.  Just for the
record.

Re: mod_perl and utf8 and CGI->param

2017-06-01 Thread Randal L. Schwartz


> "Randal" == Randal L Schwartz  writes:
Randal> Getting really frustrated with mod_perl2's apparent inability to
Randal> probably read UTF8 input.

Randal> Here's my mod_perl2 setup:

Randal>   Apache 2.2.[something]
Randal>   mod_perl 2.0.7 (or nearly that)
Randal>   ModPerl::Registry
Randal>   Perl "script" with CGI.pm

Randal> Very early in my app:

Randal>   ## ensure utf8 CGI params:
Randal>   $CGI::PARAM_UTF8 = 1;

Randal>   binmode STDIN, ":utf8";
Randal>   binmode STDOUT, ":utf8";
Randal>   binmode STDERR, ":utf8";

Randal> This works fine in CGI mode: when I ask for $foo = $cgi->param('foo'),
Randal> DBI::data_string_desc($foo) shows a UTF8 string with the proper
Randal> discrepency between bytes and chars.

Randal> But when I try to run it under mod_perl, the returned string appears
Randal> to be the raw ascii bytes, and definitely not utf8.  Of course, when I
Randal> store that in the database (using DBD::Pg), the "latin-1" is encoded
Randal> to "utf-8", and I get a bunch of weird chars on the output.

Randal> Has anyone managed to round-trip UTF8 from form to database and back
Randal> using a setup similar to this?

Randal> I suspect part of the problem is this in CGI.pm:

Randal> 'read_from_client' => <<'END_OF_FUNC',
Randal> # Read data from a file handle
Randal> sub read_from_client {
Randal> my($self, $buff, $len, $offset) = @_;
Randal> local $^W=0;# prevent a warning
Randal> return $MOD_PERL
Randal> ? $self->r->read($$buff, $len, $offset)
Randal> : read(\*STDIN, $$buff, $len, $offset);
Randal> }
Randal> END_OF_FUNC

Randal> Since I binmode STDIN, the non-$MOD_PERL works ok here.  What's the
Randal> equivalent of $r->read() that marks the incoming stream as UTF8, so I
Randal> get chars instead of bytes?  Or can I just read(\*STDIN) in mod_perl2
Randal> as well? (I know that was supported at one point...)

I realized that I never posted my ultimate solution.  I monkey patch
CGI.pm:

require CGI;
{
  my $orig = \::param;
  no warnings 'redefine';
  *CGI::param = sub {
$CGI::LIST_CONTEXT_WARN = 0; # workaround for backward compatibility
$CGI::PARAM_UTF8 = 1;
goto &$orig;
  };
}

And this has been working just fine for both CGI and mod_perl.  Just for the
record.

-- 
Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095
 
Perl/Unix consulting, Technical writing, Comedy, etc. etc.
Still trying to think of something clever for the fourth line of this .sig

Re: mod_perl and utf8 and CGI-param

2014-09-14 Thread Joe Schaefer

apreq validates anything it presents as utf8, otherwise it marks it as ISO88591 
or some windows encoding I don't remember the name of if that fails.



On Monday, September 8, 2014 3:17 PM, André Warnier a...@ice-sa.com wrote:
 


Michael Schout wrote:

 On 9/2/14, 4:19 PM, Randal L. Schwartz wrote:
 
   ## ensure utf8 CGI params:
   $CGI::PARAM_UTF8 = 1;
 
 Sorry to chime in late on this, but part of the problem with CGI.pm and
 UTF-8 is that PARAM_UTF8 gets clobbered by a cleanup handler that CGI.pm
 itself registers if its running under mod_perl.
 
 This caused major headaches for me at one time until I figured this out.
 
 You have to make sure to set $CGI::PARAM_UTF8 early, and FOR EVERY
 REQUEST, because if you just set it globally (e.g.: in a startup perl
 script), then it only works for the first request.
 

Hi.
Just an addendum to the discussion :

There are really two distinct approaches to this issue, and they work at 
different levels :

1) is to fix CGI.pm so that it delivers the parameters in the way which you 
expect.
As shown by the previous valuable and technical contributions, this generally 
works, but 
it requires a certain level of expertise; and it does not necessarily work 
backwards with 
all versions of mod_perl and CGI.pm.

2) is to take whatever CGI.pm does deliver to the calling script or module, and 
use a 
couple of tricks and some additional code in ditto script or module, to ensure 
that 
whatever CGI.pm delivers under whatever mod_perl version, the receiving script 
or module 
always knows in the end what it is dealing with.
That is the method which I presented early in the discussion.
As stated in that contribution, it is not necessarily the most elegant or 
efficient way to 
deal with the issue, but it has the advantage of working always, no matter 
which version 
of CGI.pm and/or mod_perl are in use.

The real crux of the matter is this, in my view : as things stand today in 
terms of 
protocol and RFCs, there is no real way for CGI.pm (or any comparable 
framework) to be 
*sure* of the encoding of the data sent by a browser or another HTTP client 
agent.  Even 
the RFCs do not really provide a way by which this can be enforced. (*)

So if you are sure of what the client is sending, and the matter consists of 
*forcing* 
CGI.pm to always communicate POST (or GET) data as UTF-8 encoded and 
utf8-marked (or the 
opposite) to the calling script/module, then method 1 will work, and it is more 
elegant 
and probably more efficient than method 2.

But if the matter consists of ensuring that the receiving code in the 
script/module which 
  handles the data submitted by the HTTP client, is resilient and does the 
right thing 
whatever the submitted data really was, then in my opinion method 2 is better.
(But that's only my opinion of the moment, and I stand ready to be corrected).

(*) and if you believe this not to be true, please send me some references 
about it, 
because I am really interested. It might save me some code in all my web-facing 
applications.

Re: mod_perl and utf8 and CGI-param

2014-09-08 Thread Michael Schout

On 9/2/14, 4:19 PM, Randal L. Schwartz wrote:

   ## ensure utf8 CGI params:
   $CGI::PARAM_UTF8 = 1;

Sorry to chime in late on this, but part of the problem with CGI.pm and
UTF-8 is that PARAM_UTF8 gets clobbered by a cleanup handler that CGI.pm
itself registers if its running under mod_perl.

This caused major headaches for me at one time until I figured this out.

You have to make sure to set $CGI::PARAM_UTF8 early, and FOR EVERY
REQUEST, because if you just set it globally (e.g.: in a startup perl
script), then it only works for the first request.

Regards,
Michael Schout

Re: mod_perl and utf8 and CGI-param

2014-09-04 Thread Torsten Förtsch

On 03/09/14 21:38, Randal L. Schwartz wrote:
 What I need to know is what is mod_perl doing differently?  Does it not
 respect binmode STDIN, :utf8?  Apparently not.  So if you know of a
 way to get mod_perl to fix reading from the browser properly, I'm
 interested in that.

Something along these lines:

use Apache2::RequestIO ();
use Encode ();
BEGIN {
my $orig=\Apache2::RequestRec::read;
*Apache2::RequestRec::read=sub {
my ($r, $buf, $len, $offset)=@_;
my $_buf;
my $rc=$r-$orig($_buf, $len);
substr($buf, $offset, undef, Encode::decode_utf8 $_buf);
return $rc;
};
}

It's a bit more complicated than that because $_buf may end in the
middle of a character. But you can catch that and read a few more bytes.
Also, not sure if you expect the return value to be in octets or characters.

Though, I wouldn't go this way. I'd either try to force CGI.pm to read
from STDIN and use the perl-script handler
(http://perl.apache.org/docs/2.0/user/config/config.html#C_perl_script_). This
pushes a PerlIO layer to STDIN so that you can read from STDIN. On top
of that you can push :utf8 then.

The other way I'd prefer over the hack above is to patch CGI.pm to
convert the data after it has read it. You can even do that in your
application. Many applications I have seen have a separate step to
sanitize the input. That would be the place to do that. However, then
you have to watch out for upload fields.

So, there is no really simple solution. And I don't think this will be
fixed in modperl because $r has no such concept as an IO layer. The
closest thing httpd/modperl has to offer is an input filter. But that
won't help you here because brigades are handled mainly by httpd which
knows only about octets. You don't want to change the data itself. You
want to change the data's metadata.

Torsten

Re: mod_perl and utf8 and CGI-param

2014-09-04 Thread Randal L. Schwartz

 Torsten == Torsten Förtsch torsten.foert...@gmx.net writes:

Torsten Though, I wouldn't go this way. I'd either try to force CGI.pm to read
Torsten from STDIN and use the perl-script handler
Torsten 
(http://perl.apache.org/docs/2.0/user/config/config.html#C_perl_script_). This
Torsten pushes a PerlIO layer to STDIN so that you can read from STDIN. On top
Torsten of that you can push :utf8 then.

Yeah, just coded that.  In a BEGIN block in my app, I monkey-patched
read_from_client:

BEGIN {
  ## monkey-patch CGI.pm so we can get proper utf8 handling
  require CGI;
  CGI::_compile_all(qw(
read_from_client
 ));
  # warn defined CGI::read_from_client is , 0 + defined
  CGI::read_from_client;

  ## moose 'around' would be nice here. :)
  my $read_from_client = \CGI::read_from_client;
  no warnings 'redefine';
  *CGI::read_from_client = sub {
local $CGI::MOD_PERL = $CGI::MOD_PERL;
warn prior MOD_PERL is $CGI::MOD_PERL;
if (our $USE_STDIN_FOR_MOD_PERL) {
  $CGI::MOD_PERL = 0;
}
warn after MOD_PERL is $CGI::MOD_PERL;
goto $read_from_client;
  }
}

And in my toplevel, I now do this:

sub activate {
  my $self = shift;

  require Carp;
  local $SIG{__DIE__} = \Carp::confess;

  ## ensure utf8 CGI params:
  local $CGI::PARAM_UTF8 = 1;
  ## and disable mod_perl handling during read_from_client
  local our $USE_STDIN_FOR_MOD_PERL = 1;

  binmode STDIN, :utf8;
  binmode STDOUT, :utf8;
  binmode STDERR, :utf8;

  return $self-SUPER::activate(@_);
}

(This is my CGI::Prototype-based code, from the CPAN...)

I'm properly getting the $CGI::MOD_PERL set to 0, which forces
read from STDIN (via $r) instead of the native STDIN.  In theory.  In
practice, even though I've done a binmode STDIN, I'm still getting raw
bytes from read(\*STDIN...), not utf8-tagged strings.

Not sure what to do next.  Still frustrated.

Why can't the world just use ASCII? :)

(I even tried binmode STDIN, encoding(utf8) just now as well.)

-- 
Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095
mer...@stonehenge.com URL:http://www.stonehenge.com/merlyn/
Perl/Unix consulting, Technical writing, Comedy, etc. etc.
Still trying to think of something clever for the fourth line of this .sig

Re: mod_perl and utf8 and CGI-param

2014-09-04 Thread Randal L. Schwartz

 Randal == Randal L Schwartz mer...@stonehenge.com writes:

Randal Yeah, just coded that.  In a BEGIN block in my app, I monkey-patched
Randal read_from_client:

And then I've also tried to monkey-patch -read just as you said.

On the first read, an empty string is apparently returned, which fails
something higher in CGI.pm.  Ugh.

Update:

This monkey patch works:

  *Apache2::RequestRec::read = sub {
warn READ CALLED;
goto $orig;
  }

Although it doesn't do any decoding.  When I replace the body of that
with your code, I'm getting these zero-byte reads.  Even this fails:

my ($r, $buff, $len, $offset)=@_;
# my $_buff;
# my $rc = $r-$orig($_buff, $len);
my $rc = $r-$orig($buff, $len, $offset);
# warn BEFORE: , DBI::data_string_desc($_buff);
# utf8::decode($_buff);
# warn AFTER: , DBI::data_string_desc($_buff);
# substr($buff, $offset, undef, $_buff);
# warn AFTER: , DBI::data_string_desc($buff);
return $rc;

which should be the same as your code without the utf8 encoding still.
Still getting 0 bytes though.

-- 
Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095
mer...@stonehenge.com URL:http://www.stonehenge.com/merlyn/
Perl/Unix consulting, Technical writing, Comedy, etc. etc.
Still trying to think of something clever for the fourth line of this .sig

Re: mod_perl and utf8 and CGI-param

2014-09-03 Thread André Warnier

Hi Randal.

Randal L. Schwartz wrote:

Getting really frustrated with mod_perl2's apparent inability to
probably read UTF8 input.

Here's my mod_perl2 setup:

Apache 2.2.[something]
mod_perl 2.0.7 (or nearly that)
ModPerl::Registry
Perl script with CGI.pm

Very early in my app:

## ensure utf8 CGI params:
$CGI::PARAM_UTF8 = 1;

binmode STDIN, :utf8;
binmode STDOUT, :utf8;
binmode STDERR, :utf8;

This works fine in CGI mode: when I ask for $foo = $cgi-param('foo'),
DBI::data_string_desc($foo) shows a UTF8 string with the proper
discrepency between bytes and chars.

But when I try to run it under mod_perl, the returned string appears
to be the raw ascii bytes, and definitely not utf8. Of course, when I
store that in the database (using DBD::Pg), the latin-1 is encoded
to utf-8, and I get a bunch of weird chars on the output.

Has anyone managed to round-trip UTF8 from form to database and back
using a setup similar to this?

I suspect part of the problem is this in CGI.pm:

'read_from_client' = 'END_OF_FUNC',
# Read data from a file handle
sub read_from_client {
my($self, $buff, $len, $offset) = @_;
local $^W=0;# prevent a warning
return $MOD_PERL
? $self-r-read($$buff, $len, $offset)
: read(\*STDIN, $$buff, $len, $offset);
}
END_OF_FUNC

Since I binmode STDIN, the non-$MOD_PERL works ok here. What's the
equivalent of $r-read() that marks the incoming stream as UTF8, so I
get chars instead of bytes? Or can I just read(\*STDIN) in mod_perl2
as well? (I know that was supported at one point...)

I share your frustration, as I have been dealing for a long time with multi-lingual web
applications, using perl and mod_perl.

First a very top-level comment : the basic problem here is the incompleteness of the HTTP
RFC's, and the lack of proper support of international characters sets, even still today.
When a browser is POST-ing the contents of the input elements of a form to a server,
there is a set of arcane rules which, in principle, determine the character set in which
this content is encoded. The problem is that these arcane rules are arcane, often
confusing, and in addition regularly flouted by different browser makes and versions (not
to even talk about umpteen non-browser proprietary HTTP client things).

For example, when a browser sends the content of a form in the application/form-data
enctype, the content of each form parameter is sent as a separate section, in a form
similar to the parts in a multi-part RFC-822 email. In theory, each of these parts should
have its own content-type header, and if it is text, it should also contain a charset
attribute indicating the corresponding data's encoding.
(and if it doesn't, by virtue of the HTTP RFC's, it should be ISO-8859-1, which is still
the default HTTP character today; quite ridiculous, but so it is).

But the sad reality is that browser don't do that, and so in the practice in many cases
the server-side application is reduced to guessing.

By experience more than by definite code knowledge, I have to suppose that this kind of
confusion sometimes also hits developers of modules such a CGI.pm and mod_perl, so that
over the years, things have tended to vary from one version to another (versions of
browsers, versions of perl, versions of mod_perl, versions of CGI.pm). Maybe also because
of all the reasons above, there is just no right way of handling this, so CGI.pm always
returns bytes (and libapreq2 may do things differwently).

In the end, rather than trying to follow the latest developments all the time and
continuously patch my programs because of all this, I have resorted to some defensive
programming techniques in terms of interpreting form-posted data, which have been
working fine for me for the last few years. It may well be that they are a total
overkill, but in the practice they have saved me a lot of time not spent wondering why the
data in some application suddenly started to show up as A tilde followed by some bizarre
graphic sign (or, at the opposite, as a question mark embedded in a losange).

(Even logging this stuff and trying to figure out what is going on is a pain, because you
have to figure out first in what encoding you are logging, and second in what encoding you
are viewing your logs).

The methodology I follow is as follows :

1) all html form pages of the applications should have a tag like :
meta content-type=text/html; charset=.
2) all forms in the page should have the attributes
enctype=application/form-data
accept-charset=. (the same as above)

The above 2 things do not really guarantee anything, but at least they establish some
baseline which helps in interpreting the rest (and slapping users when they change their
browser settings).

3) all forms contain a hidden text input like
input type=hidden name=my-UTF8-check value=AÜÖ.. (some known sequence of
diacritics characters guaranteed to

Re: mod_perl and utf8 and CGI-param

2014-09-03 Thread Cosimo Streppone

On 09/03/2014 11:17 AM, André Warnier wrote:

 3) all forms contain a hidden text input like
 input type=hidden name=my-UTF8-check value=AÜÖ..  (some known
 sequence of diacritics characters guaranteed to have a different byte
 length between ISO-8859-x and UTF-8 encoding)
 [...]
 But it's helped me sleep better for quite a while now.

This is brilliant :-)
Thanks André.

-- 
Cosimo

Re: mod_perl and utf8 and CGI-param

2014-09-03 Thread Randal L. Schwartz

 André == André Warnier a...@ice-sa.com writes:


André The methodology I follow is as follows :

André 1) all html form pages of the applications should have a tag like :
André meta content-type=text/html; charset=.
André 2) all forms in the page should have the attributes
André enctype=application/form-data
André accept-charset=. (the same as above)

I've pretty much got success with CGI (and CGI.pm) doing the things I
listed above.  So this isn't needed.  I'm not having problems with the
browser, Apache, or Perl, or RDBO, or Postgresql.  (Even that took a bit
of work to get working, and so I think none of those are the issue.)

What I need to know is what is mod_perl doing differently?  Does it not
respect binmode STDIN, :utf8?  Apparently not.  So if you know of a
way to get mod_perl to fix reading from the browser properly, I'm
interested in that.

-- 
Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095
mer...@stonehenge.com URL:http://www.stonehenge.com/merlyn/
Perl/Unix consulting, Technical writing, Comedy, etc. etc.
Still trying to think of something clever for the fourth line of this .sig

Re: mod_perl and utf8 and CGI-param

2014-09-03 Thread Dr James A Smith

I encode a pound sign which as a parameter which indicates whether 
content is UTF-8, UCS or latin-1 - and this seems to resolve most of the 
issues... I did take a lot of effort to fix issues with utf8 and there 
are a lot of these - between form - post; between requests if storing 
data in sessions; between script and database; etc...


I do however not use CGI.pm  but use APR instead which I know works (and 
may be less error prone)


James

---
This email is free from viruses and malware because avast! Antivirus protection 
is active.
http://www.avast.com



--
The Wellcome Trust Sanger Institute is operated by Genome Research 
Limited, a charity registered in England with number 1021457 and a 
company registered in England with number 2742969, whose registered 
office is 215 Euston Road, London, NW1 2BE.

Re: mod_perl and utf8 and CGI->param

Re: mod_perl and utf8 and CGI->param

Re: mod_perl and utf8 and CGI->param

Re: mod_perl and utf8 and CGI->param

Re: mod_perl and utf8 and CGI-param

Re: mod_perl and utf8 and CGI-param

Re: mod_perl and utf8 and CGI-param

Re: mod_perl and utf8 and CGI-param

Re: mod_perl and utf8 and CGI-param

Re: mod_perl and utf8 and CGI-param

Re: mod_perl and utf8 and CGI-param

Re: mod_perl and utf8 and CGI-param

Re: mod_perl and utf8 and CGI-param

13 matches

Site Navigation

Mail list logo

Footer information