RE: Some UTF-8-related questions

2012-01-11 Thread Hamann, T.D. (Thomas)
>>Hi,

>>Thanks for the answers on my last question. I have since then dug a bit 
>>further in the UTF-8-related error >>message I got, and after some reading 
>>have a few questions with regards to UTF-8 handling in perl:

>>(Please bear in mind that I am not an IT guy)

> Worry not -- Basically no IT person gets this right anyway : )
 
>>1a) My use statements are the following:

>>use warnings;
>>use strict;
>>use utf8;
>>use open ':encoding(utf8)';

>I would add

>use feature qw(unicode_strings);

>or even

>use if $^V ge v5.12, feature => qw(unicode_strings);

>and replace :encoding(utf8) for :encoding(UTF-8), but see below.
 
Thanks. That looks very useful. Would it also be a good idea to upgrade perl to 
5.14 instead of 5.12?

>>Now if I understand it correctly, there's two ways of encoding UTF-8 in perl: 
>>One liberal (utf8) and one strict 
>>(UTF-8). For my purpose, I need correctly encoded UTF-8 files. However, I 
>>cannot be sure whether the files I 
>>start with are properly encoded in UTF-8.

>That's primarily right, but I think that you are mistaken in the usage of the 
>lax version, utf8. The latter is only 
>useful when reading something produced by another Perl process that used the 
>lax encoding and outputted >illegal UTF-8.

>For example:

>use Devel::Peek;
>use warnings;

>open my $out_fh, ">:utf8", "invalid_UTF-8.txt" or die $!;
>say { $out_fh } "This here: [\x{_}] is illegal UTF-8, but valid in 
>Perl's lax internal encoding";
>close $out_fh or die $!;

>for my $encoding ( qw< utf8 encoding(UTF-8) > ) {

>say "Encoding: [$encoding]";
>open my $in_fh, "<:$encoding", "invalid_UTF-8.txt" or die $!;
>my $line = <$in_fh>;
>Dump $line;
>close $in_fh;
>}

>What you get depends on whenever $encoding is utf8 or encoding(UTF-8), though 
>the difference is a bit hard to 
>spot. For the former, you'll get back the string that you originally printed, 
>but for the latter, Encode will complain 
>about \x{_} not being in Unicode, and give you a string with a literal 
>\x{}, and if you had 
>written it in single quotes!

>The bottom point is that you scarcely ever want the lax, internal form. Moreso 
>because it's subject to change in >upcoming Perl versions, since what it 
>currently does is whack.
 
>>So is it possible to open a file using the liberal interpretation, and write 
>>to a new file using the strict >>interpretation? Are there any issues 
>>regarding this, like characters that might not be re-encoded properly?

>See the above example. Should be entirely fine as long as the contents of the 
>file are all legal UTF-8.
 
So basically I could use the strict version of UTF-8 encoding for all of my 
scripts, as long as the original file is valid UTF-8 and I use valid UTF-8 
characters in my scripts when I need them.


>>2b) Do scripts themselves have to be encoded in UTF-8 to be able to process 
>>UTF-8-files?

>Nope.
 
>>If not, when should you encode the scripts in UTF-8 and when not?

>When you are using UTF-8 literals in your code, for example

>say "In katakana, [ni] is [ニ]";

>or

>my $león = "Simba";

>In which case the file needs to have a "use utf8;" on top, as well as being 
>properly encoded in UTF-8.
 
Alright. I had the "use utf8;" in the scripts, but they weren't encoded in 
UTF-8.

>>Most of my scripts add text to UTF-8 encoded text files. I've noticed that 
>>this sometimes seems to change the
>> encoding or give error messages when e.g. accented characters are involved. 
>> Am I right in assuming that only
>> scripts that remove text or extract certain parts do not need to be encoded 
>> in UTF-8?

>The encoding of the source has basically no relevance whatsoever [*], unless 
>you are using "use encoding", 
>which you shouldn't. Errors with accented characters is probably due to using 
>latin-1 and mistakenly assuming 
>that you are using UTF-8, or the reverse.
> The likely culprits for this sort of things are that you forgot to "use 
> utf8", or your editor isn't outputting UTF-8 
>(maybe latin-1?), or you are using the wrong encoding for reading/writing.

>[*] Nitpick: Unless you are reading things from a __DATA__ section, which 
>inherits the UTF8-ness of the file in 
>which it was found.

See later for the script I am having problems with.


>>2c) Not really a perl question: Does anyone know of a monospaced font for 
>>Windows that handles most UTF-8 
>>characters gracefully? I would like one for use in Notepad++ to make it 
>>easier to write scripts containing 
>>special characters not normally displayable in Windows.

>Symbola. It's awesome. \N{DROMEDARY CAMEL}

Thanks! Looks like it has the characters I need, too.
 
>>3) Windows uses UTF-8 with BOM, Unix and Unix-likes UTF-8 without BOM.

>Nope. Windows uses UTF-16, which requires a BOM to distinguish between 
>UTF-16LE and UTF-16BE. Most 
>Unices use UTF-8, which don't require a BOM and, in fact, using it is against 
>Unicode's recommendation. If you 
>spot a file with a UTF

Some UTF-8-related questions

2012-01-11 Thread Hamann, T.D. (Thomas)
Hi,

Thanks for the answers on my last question. I have since then dug a bit further 
in the UTF-8-related error message I got, and after some reading have a few 
questions with regards to UTF-8 handling in perl:

(Please bear in mind that I am not an IT guy)

1a) My use statements are the following:

use warnings;
use strict;
use utf8;
use open ':encoding(utf8)';

Now if I understand it correctly, there's two ways of encoding UTF-8 in perl: 
One liberal (utf8) and one strict (UTF-8). For my purpose, I need correctly 
encoded UTF-8 files. However, I cannot be sure whether the files I start with 
are properly encoded in UTF-8. 
So is it possible to open a file using the liberal interpretation, and write to 
a new file using the strict interpretation? Are there any issues regarding 
this, like characters that might not be re-encoded properly?

1b) How can I check whether a file is properly encoded UTF-8?


2a) As I understand it, Windows has a somewhat limited ability to display 
certain UTF-8 characters, although some fonts can display more of them. The 
characters do exist in the file, even if Windows can't display them (besides 
showing a square). Is this correct? If not, does that impact perl's ability to 
handle Unicode? 

2b) Do scripts themselves have to be encoded in UTF-8 to be able to process 
UTF-8-files? If not, when should you encode the scripts in UTF-8 and when not? 
Most of my scripts add text to UTF-8 encoded text files. I've noticed that this 
sometimes seems to change the encoding or give error messages when e.g. 
accented characters are involved. Am I right in assuming that only scripts that 
remove text or extract certain parts do not need to be encoded in UTF-8?

2c) Not really a perl question: Does anyone know of a monospaced font for 
Windows that handles most UTF-8 characters gracefully? I would like one for use 
in Notepad++ to make it easier to write scripts containing special characters 
not normally displayable in Windows.


3) Windows uses UTF-8 with BOM, Unix and Unix-likes UTF-8 without BOM. A 
particular script of mine prepends a piece of text to UTF-8 encoded text files 
created with MS Word on Windows (saved as .txt with UTF-8 encoding). 
Unfortunately, this appears to break the encoding, which changes from "UTF-8 
with BOM" to "UTF-8 without BOM", probably because the text is inserted 
*before* the BOM at the start of the file. How do I prevent this? How can my 
script recognize the BOM at the start of the file?

Thanks for reading.

Regards,
Thomas










--
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/




RE: Bizarre problem: Known good script (in 2011) fails to work in 2012

2012-01-04 Thread Hamann, T.D. (Thomas)
Okay, some further testing using a family member's Windows XP PC and a fresh 
install of ActivePerl seems to have revealed the culprit:

Changing 
s/\s+$//;
to:
s/(\s+$)(\n)/$2/;

fixed the issue. 

Since the script worked fine until about 3 weeks ago and I copied the original 
code from http://www.perlmonks.org/?node_id=2258, I can only surmise that 
Microsoft must have changed the way Windows XP deals with newlines in a very 
recent update. Which they could have communicated with the outer world. :( 

(oh well, another reason to dislike Microsoft, I guess).

Now for another question: How much code will this change break? 

Thomas



Van: Hamann, T.D. (Thomas) [ham...@nhn.leidenuniv.nl]
Verzonden: woensdag 4 januari 2012 12:27
Aan: beginners@perl.org
Onderwerp: Bizarre problem: Known good script (in 2011) fails to work in 2012

Hi,

I am having a rather unusual problem with a script that I wrote last year to 
clean unwanted contents out of UTF-8 encoded text files. It worked fine in the 
past, but when I try to run it now I get an error message and somehow all 
newlines are removed from the resulting file. Nothing was changed between 2011 
and 2012 in the script, which I give below:

#!/usr/bin/perl
# filecleaner.plx

use warnings;
use strict;
use utf8;
use open ':encoding(utf8)';

my $source = shift @ARGV;
my $destination = shift @ARGV;

open IN, $source or die "Can't read source file $source: $!\n";
open OUT, ">$destination" or die "can't write on file $destination: $!\n";

while () {
# Replaces all tab-characters with spaces:
s/\t/ /g;
# Replaces all hyphens that are both preceded and trailed by a space by 
long dashes preceded and trailed by a space:
s/ - / — /g;
# Removes the leading space(s) from a variety of unwanted combinations:
s/( +)( |\.|,|:|;|\!|\]|\)|\n)/$2/g;
# Removes multiple dots:
s/\.+/./g;
# Removes multiple commas:
s/,+/,/g;
# Removes multiple colons:
s/:+/:/g;
# Removes multiple semi-colons:
s/;+/;/g;
# Removes commas before dots:
s/(,+)(\.)/$2/g;
# Removes the trailing spaces and dots behind two types of brackets:
s/(\(|\[)( +|\.+)/$1/g;
# Removes empty sets of brackets:
s/(\(|\[)(\)|\])//g;
# Removes whitespace at beginning of line:
s/^\s+//;
# Removes whitespace at end of line:
s/\s+$//;
# Prints all non-empty lines to file:
if (!/^\s*$/) {
print OUT $_;
}
}

close IN;
close OUT;

The error message ("Malformed UTF-8 character (unexpected continuation byte 
0x97, with no preceding start byte) at filecleaner.plx line 23") seems to refer 
to the long dash in line 23. This was copied out of a UTF-8 encoded file in 
2011. If I change that to another UTF-8 long dash copied from another UTF-8 
file downloaded off
the internet, the error message goes away. However, if I copy the dash out of a 
supposedly UTF-8 encoded file made in Word I get the error message.

With the dash fixed, however, the newlines still get stripped out of the file, 
which leaves me at a complete loss, since nothing in the code ought to chomp 
off newline characters.

What could cause such behaviour? Corrupt script file? Corrupted perl 
installation? Some stupid recent Windows update that screwed up UTF-8 and/or 
file handling in Windows XP? Before I went on Christmas holidays
things were fine...

Any ideas?

Thanks,

Thomas
--
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/



--
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/




Bizarre problem: Known good script (in 2011) fails to work in 2012

2012-01-04 Thread Hamann, T.D. (Thomas)
Hi,

I am having a rather unusual problem with a script that I wrote last year to 
clean unwanted contents out of UTF-8 encoded text files. It worked fine in the 
past, but when I try to run it now I get an error message and somehow all 
newlines are removed from the resulting file. Nothing was changed between 2011 
and 2012 in the script, which I give below:

#!/usr/bin/perl
# filecleaner.plx

use warnings;
use strict;
use utf8;
use open ':encoding(utf8)';

my $source = shift @ARGV;
my $destination = shift @ARGV;

open IN, $source or die "Can't read source file $source: $!\n";
open OUT, ">$destination" or die "can't write on file $destination: $!\n";

while () {
# Replaces all tab-characters with spaces:
s/\t/ /g;
# Replaces all hyphens that are both preceded and trailed by a space by 
long dashes preceded and trailed by a space:
s/ - / — /g; 
# Removes the leading space(s) from a variety of unwanted combinations:
s/( +)( |\.|,|:|;|\!|\]|\)|\n)/$2/g;
# Removes multiple dots:
s/\.+/./g;
# Removes multiple commas:
s/,+/,/g;
# Removes multiple colons:
s/:+/:/g;
# Removes multiple semi-colons:
s/;+/;/g;
# Removes commas before dots:
s/(,+)(\.)/$2/g;
# Removes the trailing spaces and dots behind two types of brackets:
s/(\(|\[)( +|\.+)/$1/g;
# Removes empty sets of brackets:
s/(\(|\[)(\)|\])//g;
# Removes whitespace at beginning of line:
s/^\s+//;
# Removes whitespace at end of line:
s/\s+$//;
# Prints all non-empty lines to file:
if (!/^\s*$/) {
print OUT $_;
}
}

close IN;
close OUT;

The error message ("Malformed UTF-8 character (unexpected continuation byte 
0x97, with no preceding start byte) at filecleaner.plx line 23") seems to refer 
to the long dash in line 23. This was copied out of a UTF-8 encoded file in 
2011. If I change that to another UTF-8 long dash copied from another UTF-8 
file downloaded off 
the internet, the error message goes away. However, if I copy the dash out of a 
supposedly UTF-8 encoded file made in Word I get the error message.

With the dash fixed, however, the newlines still get stripped out of the file, 
which leaves me at a complete loss, since nothing in the code ought to chomp 
off newline characters.

What could cause such behaviour? Corrupt script file? Corrupted perl 
installation? Some stupid recent Windows update that screwed up UTF-8 and/or 
file handling in Windows XP? Before I went on Christmas holidays 
things were fine...

Any ideas?

Thanks,

Thomas
--
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/




RE: How to put an AND in a regex?

2011-10-13 Thread Hamann, T.D. (Thomas)
Thanks!

That was really simple (so simple I did not think about it ;)

Thomas


Van: Igor Dovgiy [ivd.pri...@gmail.com]
Verzonden: donderdag 13 oktober 2011 14:10
Aan: beginners@perl.org
Onderwerp: Re: How to put an AND in a regex?

Hmm, probably you should. To use two of them in AND combination, just... use
two of them. )

/^(?![[:upper:]][[:upper:]])(?!\d)/

And it gets even better: you may mix any number of look-aheads in a single
regex this way. )

-- iD

2011/10/13 Hamann, T.D. (Thomas) 

>
> Hi,
>
> I am trying to write a regex that should only match when certain patterns
> are not present, e.g. when a line does not start with either a digit or
> ALL-CAPS text. I figured I could use negative look-aheads for this.
>
> I can write it as:
>
> if (/^(?![[:upper:]][[:upper:]])/) {
>if (/^(?!\d)/) {
>s/^//;
>}
>else {
>}
> }
> else {
> }
>
> However, I was wondering whether there was a way of writing this as a
> single if loop, because there are much more than two situations that should
> not be matched.
>
> I tried to write it as:
>
> if (/^(?![[:upper:]][[:upper:]])|^(?!\d)/) {
> s/^//;
> }
> else {
> }
>
> but this means if one option is not matched the other one is matched, which
> is not what I want. So I need something that does the equivalent of "Don't
> match this AND don't match this". Is this possible in a if loop, or should I
> use something else?
>
> Thanks,
>
> Regards,
> Thomas Hamann
>
> --
> To unsubscribe, e-mail: beginners-unsubscr...@perl.org
> For additional commands, e-mail: beginners-h...@perl.org
> http://learn.perl.org/
>
>
>

--
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/




How to put an AND in a regex?

2011-10-13 Thread Hamann, T.D. (Thomas)

Hi,

I am trying to write a regex that should only match when certain patterns are 
not present, e.g. when a line does not start with either a digit or ALL-CAPS 
text. I figured I could use negative look-aheads for this.

I can write it as:

if (/^(?![[:upper:]][[:upper:]])/) {
if (/^(?!\d)/) {
s/^//;
}
else {
}
}
else {
}

However, I was wondering whether there was a way of writing this as a single if 
loop, because there are much more than two situations that should not be 
matched.

I tried to write it as:

if (/^(?![[:upper:]][[:upper:]])|^(?!\d)/) {
s/^//;
}
else {
}

but this means if one option is not matched the other one is matched, which is 
not what I want. So I need something that does the equivalent of "Don't match 
this AND don't match this". Is this possible in a if loop, or should I use 
something else?

Thanks,

Regards,
Thomas Hamann

--
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/




RE: Matching Greek letters in UTF-8 file

2011-10-10 Thread Hamann, T.D. (Thomas)
Many thanks for the replies. Reading the documentation, it looks like it's a 
bit more complicated than I had hoped.

On the other hand, I realized that for my purpose (removing unwanted hyphens 
from an OCR'ed document), I don't actually need to match the greek letters, 
because they occur in two unique formats throughout the whole document (which 
should match \w- and -\w- ).

Thomas



Van: Brian Fraser [frase...@gmail.com]
Verzonden: donderdag 29 september 2011 16:59
Aan: John Delacour
CC: beginners@perl.org
Onderwerp: Re: Matching Greek letters in UTF-8 file

On Thu, Sep 29, 2011 at 10:58 AM, John Delacour wrote:

> use encoding 'utf-8';
>
>

Nitpick: Please don't use this, as encoding is broken. use utf8; and use
open qw< :std :encoding(UTF-8) >; should make do for a replacement.

To the original poster, please note that there's a bit of a difference in
case-insensitive matching (i.e. using /i) -- newer versions of Perl do full
casefolding (so \N{GREEK CAPITAL LETTER ALPHA WITH PSILI AND PROSGEGRAMMENI}
matches \N{GREEK SMALL LETTER ALPHA WITH PSILI}\N{GREEK SMALL LETTER IOTA}),
whereas older versions don't. So if you need to do that, I'd recommend
giving the docs a thorough read. Also this:
http://98.245.80.27/tcpc/OSCON2011/upr.html
--
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/




Matching Greek letters in UTF-8 file

2011-09-29 Thread Hamann, T.D. (Thomas)
Hi,
 
I need to write a regex that matches any single Greek letter followed by a 
hyphen in a UTF-8 text file that is otherwise in English.
 
How can I match the Greek alphabet (lower and upper case)?
 
Thanks,
Thomas
--
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/