Re: mod_perl and utf8 and CGI->param

2017-06-02 Thread Randal L. Schwartz
> "Peng" == Peng Yonghua  writes:

Peng> And, can I override any method from a class via this way? is this a 
general
Peng> trick? thanks.

Yes, and your downstream will hate you for it.  The ruby people do this
all the time, and it makes their code brittle.  I did this in my app,
and would never think of putting that into the core CGI::Prototype where
this gets used, even though it would solve the problem for everyone.

-- 
Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095
 
Perl/Unix consulting, Technical writing, Comedy, etc. etc.
Still trying to think of something clever for the fourth line of this .sig


Re: mod_perl and utf8 and CGI->param

2017-06-01 Thread Peng Yonghua
And, can I override any method from a class via this way? is this a 
general trick? thanks.


On 2017/6/2  8:48, Peng Yonghua wrote:

good patch. thanks for sharing.

On 2017/6/1  23:34, Randal L. Schwartz wrote:

I realized that I never posted my ultimate solution.  I monkey patch
CGI.pm:

require CGI;
{
   my $orig = \::param;
   no warnings 'redefine';
   *CGI::param = sub {
 $CGI::LIST_CONTEXT_WARN = 0; # workaround for backward compatibility
 $CGI::PARAM_UTF8 = 1;
 goto &$orig;
   };
}

And this has been working just fine for both CGI and mod_perl.  Just 
for the

record.


Re: mod_perl and utf8 and CGI->param

2017-06-01 Thread Peng Yonghua

good patch. thanks for sharing.

On 2017/6/1  23:34, Randal L. Schwartz wrote:

I realized that I never posted my ultimate solution.  I monkey patch
CGI.pm:

require CGI;
{
   my $orig = \::param;
   no warnings 'redefine';
   *CGI::param = sub {
 $CGI::LIST_CONTEXT_WARN = 0; # workaround for backward compatibility
 $CGI::PARAM_UTF8 = 1;
 goto &$orig;
   };
}

And this has been working just fine for both CGI and mod_perl.  Just for the
record.


Re: mod_perl and utf8 and CGI->param

2017-06-01 Thread Randal L. Schwartz

> "Randal" == Randal L Schwartz  writes:
Randal> Getting really frustrated with mod_perl2's apparent inability to
Randal> probably read UTF8 input.

Randal> Here's my mod_perl2 setup:

Randal>   Apache 2.2.[something]
Randal>   mod_perl 2.0.7 (or nearly that)
Randal>   ModPerl::Registry
Randal>   Perl "script" with CGI.pm

Randal> Very early in my app:

Randal>   ## ensure utf8 CGI params:
Randal>   $CGI::PARAM_UTF8 = 1;

Randal>   binmode STDIN, ":utf8";
Randal>   binmode STDOUT, ":utf8";
Randal>   binmode STDERR, ":utf8";

Randal> This works fine in CGI mode: when I ask for $foo = $cgi->param('foo'),
Randal> DBI::data_string_desc($foo) shows a UTF8 string with the proper
Randal> discrepency between bytes and chars.

Randal> But when I try to run it under mod_perl, the returned string appears
Randal> to be the raw ascii bytes, and definitely not utf8.  Of course, when I
Randal> store that in the database (using DBD::Pg), the "latin-1" is encoded
Randal> to "utf-8", and I get a bunch of weird chars on the output.

Randal> Has anyone managed to round-trip UTF8 from form to database and back
Randal> using a setup similar to this?

Randal> I suspect part of the problem is this in CGI.pm:

Randal> 'read_from_client' => <<'END_OF_FUNC',
Randal> # Read data from a file handle
Randal> sub read_from_client {
Randal> my($self, $buff, $len, $offset) = @_;
Randal> local $^W=0;# prevent a warning
Randal> return $MOD_PERL
Randal> ? $self->r->read($$buff, $len, $offset)
Randal> : read(\*STDIN, $$buff, $len, $offset);
Randal> }
Randal> END_OF_FUNC

Randal> Since I binmode STDIN, the non-$MOD_PERL works ok here.  What's the
Randal> equivalent of $r->read() that marks the incoming stream as UTF8, so I
Randal> get chars instead of bytes?  Or can I just read(\*STDIN) in mod_perl2
Randal> as well? (I know that was supported at one point...)

I realized that I never posted my ultimate solution.  I monkey patch
CGI.pm:

require CGI;
{
  my $orig = \::param;
  no warnings 'redefine';
  *CGI::param = sub {
$CGI::LIST_CONTEXT_WARN = 0; # workaround for backward compatibility
$CGI::PARAM_UTF8 = 1;
goto &$orig;
  };
}

And this has been working just fine for both CGI and mod_perl.  Just for the
record.

-- 
Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095
 
Perl/Unix consulting, Technical writing, Comedy, etc. etc.
Still trying to think of something clever for the fourth line of this .sig


Re: mod_perl and utf8 and CGI-param

2014-09-14 Thread Joe Schaefer
apreq validates anything it presents as utf8, otherwise it marks it as ISO88591 
or some windows encoding I don't remember the name of if that fails.



On Monday, September 8, 2014 3:17 PM, André Warnier a...@ice-sa.com wrote:
 


Michael Schout wrote:

 On 9/2/14, 4:19 PM, Randal L. Schwartz wrote:
 
   ## ensure utf8 CGI params:
   $CGI::PARAM_UTF8 = 1;
 
 Sorry to chime in late on this, but part of the problem with CGI.pm and
 UTF-8 is that PARAM_UTF8 gets clobbered by a cleanup handler that CGI.pm
 itself registers if its running under mod_perl.
 
 This caused major headaches for me at one time until I figured this out.
 
 You have to make sure to set $CGI::PARAM_UTF8 early, and FOR EVERY
 REQUEST, because if you just set it globally (e.g.: in a startup perl
 script), then it only works for the first request.
 

Hi.
Just an addendum to the discussion :

There are really two distinct approaches to this issue, and they work at 
different levels :

1) is to fix CGI.pm so that it delivers the parameters in the way which you 
expect.
As shown by the previous valuable and technical contributions, this generally 
works, but 
it requires a certain level of expertise; and it does not necessarily work 
backwards with 
all versions of mod_perl and CGI.pm.

2) is to take whatever CGI.pm does deliver to the calling script or module, and 
use a 
couple of tricks and some additional code in ditto script or module, to ensure 
that 
whatever CGI.pm delivers under whatever mod_perl version, the receiving script 
or module 
always knows in the end what it is dealing with.
That is the method which I presented early in the discussion.
As stated in that contribution, it is not necessarily the most elegant or 
efficient way to 
deal with the issue, but it has the advantage of working always, no matter 
which version 
of CGI.pm and/or mod_perl are in use.

The real crux of the matter is this, in my view : as things stand today in 
terms of 
protocol and RFCs, there is no real way for CGI.pm (or any comparable 
framework) to be 
*sure* of the encoding of the data sent by a browser or another HTTP client 
agent.  Even 
the RFCs do not really provide a way by which this can be enforced. (*)

So if you are sure of what the client is sending, and the matter consists of 
*forcing* 
CGI.pm to always communicate POST (or GET) data as UTF-8 encoded and 
utf8-marked (or the 
opposite) to the calling script/module, then method 1 will work, and it is more 
elegant 
and probably more efficient than method 2.

But if the matter consists of ensuring that the receiving code in the 
script/module which 
  handles the data submitted by the HTTP client, is resilient and does the 
right thing 
whatever the submitted data really was, then in my opinion method 2 is better.
(But that's only my opinion of the moment, and I stand ready to be corrected).

(*) and if you believe this not to be true, please send me some references 
about it, 
because I am really interested. It might save me some code in all my web-facing 
applications.

Re: mod_perl and utf8 and CGI-param

2014-09-08 Thread Michael Schout
On 9/2/14, 4:19 PM, Randal L. Schwartz wrote:

   ## ensure utf8 CGI params:
   $CGI::PARAM_UTF8 = 1;

Sorry to chime in late on this, but part of the problem with CGI.pm and
UTF-8 is that PARAM_UTF8 gets clobbered by a cleanup handler that CGI.pm
itself registers if its running under mod_perl.

This caused major headaches for me at one time until I figured this out.

You have to make sure to set $CGI::PARAM_UTF8 early, and FOR EVERY
REQUEST, because if you just set it globally (e.g.: in a startup perl
script), then it only works for the first request.

Regards,
Michael Schout


Re: mod_perl and utf8 and CGI-param

2014-09-04 Thread Torsten Förtsch
On 03/09/14 21:38, Randal L. Schwartz wrote:
 What I need to know is what is mod_perl doing differently?  Does it not
 respect binmode STDIN, :utf8?  Apparently not.  So if you know of a
 way to get mod_perl to fix reading from the browser properly, I'm
 interested in that.

Something along these lines:

use Apache2::RequestIO ();
use Encode ();
BEGIN {
my $orig=\Apache2::RequestRec::read;
*Apache2::RequestRec::read=sub {
my ($r, $buf, $len, $offset)=@_;
my $_buf;
my $rc=$r-$orig($_buf, $len);
substr($buf, $offset, undef, Encode::decode_utf8 $_buf);
return $rc;
};
}

It's a bit more complicated than that because $_buf may end in the
middle of a character. But you can catch that and read a few more bytes.
Also, not sure if you expect the return value to be in octets or characters.

Though, I wouldn't go this way. I'd either try to force CGI.pm to read
from STDIN and use the perl-script handler
(http://perl.apache.org/docs/2.0/user/config/config.html#C_perl_script_). This
pushes a PerlIO layer to STDIN so that you can read from STDIN. On top
of that you can push :utf8 then.

The other way I'd prefer over the hack above is to patch CGI.pm to
convert the data after it has read it. You can even do that in your
application. Many applications I have seen have a separate step to
sanitize the input. That would be the place to do that. However, then
you have to watch out for upload fields.

So, there is no really simple solution. And I don't think this will be
fixed in modperl because $r has no such concept as an IO layer. The
closest thing httpd/modperl has to offer is an input filter. But that
won't help you here because brigades are handled mainly by httpd which
knows only about octets. You don't want to change the data itself. You
want to change the data's metadata.

Torsten


Re: mod_perl and utf8 and CGI-param

2014-09-04 Thread Randal L. Schwartz
 Torsten == Torsten Förtsch torsten.foert...@gmx.net writes:

Torsten Though, I wouldn't go this way. I'd either try to force CGI.pm to read
Torsten from STDIN and use the perl-script handler
Torsten 
(http://perl.apache.org/docs/2.0/user/config/config.html#C_perl_script_). This
Torsten pushes a PerlIO layer to STDIN so that you can read from STDIN. On top
Torsten of that you can push :utf8 then.

Yeah, just coded that.  In a BEGIN block in my app, I monkey-patched
read_from_client:

BEGIN {
  ## monkey-patch CGI.pm so we can get proper utf8 handling
  require CGI;
  CGI::_compile_all(qw(
read_from_client
 ));
  # warn defined CGI::read_from_client is , 0 + defined
  CGI::read_from_client;

  ## moose 'around' would be nice here. :)
  my $read_from_client = \CGI::read_from_client;
  no warnings 'redefine';
  *CGI::read_from_client = sub {
local $CGI::MOD_PERL = $CGI::MOD_PERL;
warn prior MOD_PERL is $CGI::MOD_PERL;
if (our $USE_STDIN_FOR_MOD_PERL) {
  $CGI::MOD_PERL = 0;
}
warn after MOD_PERL is $CGI::MOD_PERL;
goto $read_from_client;
  }
}

And in my toplevel, I now do this:

sub activate {
  my $self = shift;

  require Carp;
  local $SIG{__DIE__} = \Carp::confess;

  ## ensure utf8 CGI params:
  local $CGI::PARAM_UTF8 = 1;
  ## and disable mod_perl handling during read_from_client
  local our $USE_STDIN_FOR_MOD_PERL = 1;

  binmode STDIN, :utf8;
  binmode STDOUT, :utf8;
  binmode STDERR, :utf8;

  return $self-SUPER::activate(@_);
}

(This is my CGI::Prototype-based code, from the CPAN...)

I'm properly getting the $CGI::MOD_PERL set to 0, which forces
read from STDIN (via $r) instead of the native STDIN.  In theory.  In
practice, even though I've done a binmode STDIN, I'm still getting raw
bytes from read(\*STDIN...), not utf8-tagged strings.

Not sure what to do next.  Still frustrated.

Why can't the world just use ASCII? :)

(I even tried binmode STDIN, encoding(utf8) just now as well.)

-- 
Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095
mer...@stonehenge.com URL:http://www.stonehenge.com/merlyn/
Perl/Unix consulting, Technical writing, Comedy, etc. etc.
Still trying to think of something clever for the fourth line of this .sig


Re: mod_perl and utf8 and CGI-param

2014-09-04 Thread Randal L. Schwartz
 Randal == Randal L Schwartz mer...@stonehenge.com writes:

Randal Yeah, just coded that.  In a BEGIN block in my app, I monkey-patched
Randal read_from_client:

And then I've also tried to monkey-patch -read just as you said.

On the first read, an empty string is apparently returned, which fails
something higher in CGI.pm.  Ugh.

Update:

This monkey patch works:

  *Apache2::RequestRec::read = sub {
warn READ CALLED;
goto $orig;
  }

Although it doesn't do any decoding.  When I replace the body of that
with your code, I'm getting these zero-byte reads.  Even this fails:

my ($r, $buff, $len, $offset)=@_;
# my $_buff;
# my $rc = $r-$orig($_buff, $len);
my $rc = $r-$orig($buff, $len, $offset);
# warn BEFORE: , DBI::data_string_desc($_buff);
# utf8::decode($_buff);
# warn AFTER: , DBI::data_string_desc($_buff);
# substr($buff, $offset, undef, $_buff);
# warn AFTER: , DBI::data_string_desc($buff);
return $rc;

which should be the same as your code without the utf8 encoding still.
Still getting 0 bytes though.

-- 
Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095
mer...@stonehenge.com URL:http://www.stonehenge.com/merlyn/
Perl/Unix consulting, Technical writing, Comedy, etc. etc.
Still trying to think of something clever for the fourth line of this .sig


Re: mod_perl and utf8 and CGI-param

2014-09-03 Thread André Warnier

Hi Randal.

Randal L. Schwartz wrote:

Getting really frustrated with mod_perl2's apparent inability to
probably read UTF8 input.

Here's my mod_perl2 setup:

  Apache 2.2.[something]
  mod_perl 2.0.7 (or nearly that)
  ModPerl::Registry
  Perl script with CGI.pm

Very early in my app:

  ## ensure utf8 CGI params:
  $CGI::PARAM_UTF8 = 1;

  binmode STDIN, :utf8;
  binmode STDOUT, :utf8;
  binmode STDERR, :utf8;

This works fine in CGI mode: when I ask for $foo = $cgi-param('foo'),
DBI::data_string_desc($foo) shows a UTF8 string with the proper
discrepency between bytes and chars.

But when I try to run it under mod_perl, the returned string appears
to be the raw ascii bytes, and definitely not utf8.  Of course, when I
store that in the database (using DBD::Pg), the latin-1 is encoded
to utf-8, and I get a bunch of weird chars on the output.

Has anyone managed to round-trip UTF8 from form to database and back
using a setup similar to this?

I suspect part of the problem is this in CGI.pm:

'read_from_client' = 'END_OF_FUNC',
# Read data from a file handle
sub read_from_client {
my($self, $buff, $len, $offset) = @_;
local $^W=0;# prevent a warning
return $MOD_PERL
? $self-r-read($$buff, $len, $offset)
: read(\*STDIN, $$buff, $len, $offset);
}
END_OF_FUNC

Since I binmode STDIN, the non-$MOD_PERL works ok here.  What's the
equivalent of $r-read() that marks the incoming stream as UTF8, so I
get chars instead of bytes?  Or can I just read(\*STDIN) in mod_perl2
as well? (I know that was supported at one point...)





I share your frustration, as I have been dealing for a long time with multi-lingual web 
applications, using perl and mod_perl.


First a very top-level comment : the basic problem here is the incompleteness of the HTTP 
RFC's, and the lack of proper support of international characters sets, even still today.
When a browser is POST-ing the contents of the input elements of a form to a server, 
there is a set of arcane rules which, in principle, determine the character set in which 
this content is encoded.  The problem is that these arcane rules are arcane, often 
confusing, and in addition regularly flouted by different browser makes and versions (not 
to even talk about umpteen non-browser proprietary HTTP client things).


For example, when a browser sends the content of a form in the application/form-data 
enctype, the content of each form parameter is sent as a separate section, in a form 
similar to the parts in a multi-part RFC-822 email.  In theory, each of these parts should 
have its own content-type header, and if it is text, it should also contain a charset 
attribute indicating the corresponding data's encoding.
(and if it doesn't, by virtue of the HTTP RFC's, it should be ISO-8859-1, which is still 
the default HTTP character today; quite ridiculous, but so it is).


But the sad reality is that browser don't do that, and so in the practice in many cases 
the server-side application is reduced to guessing.


By experience more than by definite code knowledge, I have to suppose that this kind of 
confusion sometimes also hits developers of modules such a CGI.pm and mod_perl, so that 
over the years, things have tended to vary from one version to another (versions of 
browsers, versions of perl, versions of mod_perl, versions of CGI.pm).  Maybe also because 
of all the reasons above, there is just no right way of handling this, so CGI.pm always 
returns bytes (and libapreq2 may do things differwently).


In the end, rather than trying to follow the latest developments all the time and 
continuously patch my programs because of all this, I have resorted to some defensive 
programming techniques in terms of interpreting form-posted data, which have been 
working fine for me for the last few years.  It may well be that they are a total 
overkill, but in the practice they have saved me a lot of time not spent wondering why the 
data in some application suddenly started to show up as A tilde followed by some bizarre 
graphic sign (or, at the opposite, as a question mark embedded in a losange).


(Even logging this stuff and trying to figure out what is going on is a pain, because you 
have to figure out first in what encoding you are logging, and second in what encoding you 
are viewing your logs).


The methodology I follow is as follows :

1) all html form pages of the applications should have a tag like :
meta content-type=text/html; charset=.
2) all forms in the page should have the attributes
enctype=application/form-data
accept-charset=. (the same as above)

The above 2 things do not really guarantee anything, but at least they establish some 
baseline which helps in interpreting the rest (and slapping users when they change their 
browser settings).


3) all forms contain a hidden text input like
input type=hidden name=my-UTF8-check value=AÜÖ..  (some known sequence of 
diacritics characters guaranteed to 

Re: mod_perl and utf8 and CGI-param

2014-09-03 Thread Cosimo Streppone
On 09/03/2014 11:17 AM, André Warnier wrote:

 3) all forms contain a hidden text input like
 input type=hidden name=my-UTF8-check value=AÜÖ..  (some known
 sequence of diacritics characters guaranteed to have a different byte
 length between ISO-8859-x and UTF-8 encoding)
 [...]
 But it's helped me sleep better for quite a while now.

This is brilliant :-)
Thanks André.

-- 
Cosimo


Re: mod_perl and utf8 and CGI-param

2014-09-03 Thread Randal L. Schwartz
 André == André Warnier a...@ice-sa.com writes:


André The methodology I follow is as follows :

André 1) all html form pages of the applications should have a tag like :
André meta content-type=text/html; charset=.
André 2) all forms in the page should have the attributes
André enctype=application/form-data
André accept-charset=. (the same as above)

I've pretty much got success with CGI (and CGI.pm) doing the things I
listed above.  So this isn't needed.  I'm not having problems with the
browser, Apache, or Perl, or RDBO, or Postgresql.  (Even that took a bit
of work to get working, and so I think none of those are the issue.)

What I need to know is what is mod_perl doing differently?  Does it not
respect binmode STDIN, :utf8?  Apparently not.  So if you know of a
way to get mod_perl to fix reading from the browser properly, I'm
interested in that.

-- 
Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095
mer...@stonehenge.com URL:http://www.stonehenge.com/merlyn/
Perl/Unix consulting, Technical writing, Comedy, etc. etc.
Still trying to think of something clever for the fourth line of this .sig


Re: mod_perl and utf8 and CGI-param

2014-09-03 Thread Dr James A Smith
I encode a pound sign which as a parameter which indicates whether 
content is UTF-8, UCS or latin-1 - and this seems to resolve most of the 
issues... I did take a lot of effort to fix issues with utf8 and there 
are a lot of these - between form - post; between requests if storing 
data in sessions; between script and database; etc...


I do however not use CGI.pm  but use APR instead which I know works (and 
may be less error prone)


James

---
This email is free from viruses and malware because avast! Antivirus protection 
is active.
http://www.avast.com



--
The Wellcome Trust Sanger Institute is operated by Genome Research 
Limited, a charity registered in England with number 1021457 and a 
company registered in England with number 2742969, whose registered 
office is 215 Euston Road, London, NW1 2BE.