Re: regex & utf8

Tom Allison Sat, 12 May 2007 05:26:14 -0700

Rather than going through the somewhat buggy process of trying todetermine which of themany character sets there are, is there some way that I can justuniversally convert everything

into UTF8?

I can open a file with a :utf8 declaration when creating the filehandle. But do I need to do this on a utf8 file or will perl just"know". If it doesn't, can I just open everything in utf8 mode andnot lose any data?



On May 12, 2007, at 5:04 AM, Dr.Ruud wrote:

Tom Allison schreef:

Under perl version 5.8, does /(\w+)/ match UTF-8 characters without
calling any special pragma?

Yes, but only if your data is proper. Mind that any ASCII-characteris a

UTF-8 character too (U+0000 .. U+007F).

So I'm trying to see if I can just use /(\w+)/ without worrying about
all this character encoding?


Only if your data is proper. A file is just a string of bytes. If you
use the proper IO-layer while reading in the file, then you'll end up
with proper data (a string of characters, not of bytes) to work with.

A UTF-8 encoded file can't tell you that it is UTF-8 encoded. For

example a UTF-8 BOM at the start (as Windows Notepad uses) is notproof.

So you need to know beforehand.

--
Affijn, Ruud

"Gewoon is een tijger."


--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/



--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/

Re: regex & utf8

Reply via email to