Hans: On Wed, Nov 26, 2014 at 8:49 AM, Hans Ginzel <h...@matfyz.cz> wrote: > Hello!
Hi, > Consider a small perl code below. > It should copy text file with removing leading and trailing spaces. > > while (<>) { > s/^s+//; s/s+$//; > print $_, "n"; # say; > } > > I run it with "shell" redirection > perl copy.pl <src.txt >dst.txt > > It works well for Windows ansi and utf-8 text files. But when I have > tried an unicode (ucs-2le) source file containing > "anb" this is > FF FE 61 00 0D 00 0A 00 62 00 0D 00 0A 00 > in hex (with Byte Order Mark) I get characters in hex > FF FE 61 00 0D 00 0D 0A 00 62 00 0D 00 0D 0A 00 0D 0A > . > I have attached these files but I am not sure what Mail Agents do with > them. > > Variable PERL_UNICODE is not set. > > I have tried add -CS to the command line, but got info about Malformed > UTF-8 character. > I have tried adding each of these pragmas to the beginning > use open ':encoding(UCS-2LE)'; > use open IO => ':encoding(UCS-2LE)'; > use open ':std' => ':encoding(UCS-2LE)'; > but without desired goal. I tried to combine the pragma with -CS > option. > I have tried use feature qw/say/; say; instaed of print $_, "n"; but > without correct results. > > perl --version > This is perl 5, version 18, subversion 2 (v5.18.2) > built for MSWin32-x86-multi-thread-64int > > What is the correct way to set stdin/out to UCS-2LE, please? > What is the correct way to print "encoding independent" new line > character, please? > What is the correct way to say that s should match the "UCS-2LE way", > please? You can generally pass an :encoding() LAYER to binmode to specify the text encoding (see perldoc -f binmode). For file handles that you are creating yourself with open you can pass these in directly into the MODE argument (see perldoc -f open). You should of course be using the 3 or more argument version of open regardless, but this matters especially if you intend to use tainted (perldoc perlsec) data to alter the IO layers. I wasn't familiar with the open pragma, but according to perldoc open, in order for it to affect the standard streams (as opposed to new file handles that you create yourself) you must include a :std layer. It looks like you tried that, but were probably doing it wrong. I would guess the incantation you need is: use open ':encoding(UCS-2LE) :std'; Or variations thereof, or thereabouts. I recommend you familiarize yourself with open(), binmode(), and the Encode module to rid yourself of text encoding doubts. I guess since you're using the open pragma you should also familiarize yourself with that, but since there isn't much to the perldoc I'm guessing you already tried to do that. It's actually made pretty easy in Perl, but you do need to have a basic understanding of the system to use it properly. You can also look into PerlIO which provides the mechanisms for doing this automatically on a stream. > In addition, is there a standardised way to auto-detect input encoding > (legacy(8bit)/utf-8/ucs-2), please? Unfortunately there's no perfect way to detect character encodings. There are many encodings that use all of the same code points and have no identifying features. In general, the encoding needs to be written in the headers or content of a stream (if being sent by machine) or in user-specified options (if being sent by humans). For example, there are ways to specify text encoding for HTTP messages in the headers. There are ways to specify text encoding in an HTML or XML document. Of course, if the content is written in the encoding itself, how are you supposed to read the specified encoding? I'm not sure. I guess you can try to guess until the text makes sense, and hobble along until you find the encoding, and then reinterpret the text properly. Search metacpan.org for "guess encoding". I know that there are several modules that attempt to solve this problem. Note that they aren't flawless because they can't be. It's not a problem with a perfect solution. The best option is to have the machine or user that is giving you data also tell you what the format/text encoding of it is. A simple way to do this is to implement a command-line option in your program (e.g., see Getopt::Long) that overrides a sane default encoding (e.g., UTF-8). If you control the software you can also define that the data *must* be in a particular encoding (e.g., UTF-8) and just assume that it is. Regards, -- Brandon McCaig <bamcc...@gmail.com> <bamcc...@castopulence.org> Castopulence Software <https://www.castopulence.org/> Blog <http://www.bambams.ca/> perl -E '$_=q{V zrna gur orfg jvgu jung V fnl. }. q{Vg qbrfa'\''g nyjnlf fbhaq gung jnl.}; tr/A-Ma-mN-Zn-z/N-Zn-zA-Ma-m/;say' -- To unsubscribe, e-mail: beginners-unsubscr...@perl.org For additional commands, e-mail: beginners-h...@perl.org http://learn.perl.org/