On Thu, Jun 4, 2009 at 3:39 PM, Gisle Aas <gi...@aas.no> wrote: > On Thu, Jun 4, 2009 at 11:46, Alex Kapranoff <k...@nadoby.ru> wrote: > > Oh, I see. I suppose this can break old scripts that expect 8 bit > characters > > without UTF-8 flag come through unaltered. > > They would then have to set $form->accept_charset("latin-1") as a > workaround. I think that's acceptable. I would hate to make this the > default for backwards compatiblity.
This will also affect people who use koi8-r, cp1251 оr any of hundreds of 8 bit encodings that were in wide use :) I think this point should at least be documented in POD and Changes, then. Something like this: HTML::Form now always encodes data from its inputs into destination encoding when generating HTTP::Request objects. To specify the destination encoding you can use accept_charset() method or standard "accept-charset" attribute of the <form> tag if you create HTML::Form instance using parse() constructor. Destination encoding defaults to UTF-8 (imitating modern browsers). If you want your 8 bit data to come through unchanged you have two choices: 1) either decode it from $your_charset into internal Unicode representation using Encode::decode() before feeding into HTML::Form and then specify accept-charset($your_encoding) or 2) call accept-charset("latin1"). The latter method is not recommended unless you really use latin1. We should also find a way to propagate the original charset of the > HTML document that's parsed. This should the be the default > accept_charset(), what you get when the attribute is still 'UNKNOWN'. > For the HTML::Form->parse($response) case this should happen > automatically. Yes, totally! That was my original intention behind the patch :) We do exactly that in our WWW::Mechanize scripts now -- that is, manually set accept_charset() on forms from document charset, it would be awesome to have some automatic propagation. --Gisle >