Re: Pattern matching with Unicode (5.6.1)

Nicholas Clark Thu, 15 Aug 2002 15:07:16 -0700

On Thu, Aug 15, 2002 at 05:28:43PM -0400, David Gray wrote:
> > I'm having a bit of a problem getting Unicode pattern 
> > matching to do what I would like it to.
> 
> I guess my question wasn't entirely clear. I'm reading in the attatched
> file and trying to split it on "\n\n".
> 
> When I'm looping over the file,
> 
> > I've (sort of) made it work by doing:
> > 
> >  # strip BOM and trailing nulls and carriage returns
> >  s/^..// if $. == 1 and s/\0//g;
> >  s/[\0\r]//g;
> 
> The two-byte BOM has me thinking it's probably UTF-16. Is there an easy
> way to tell what encoding a file uses?


Not that I know of, but all the 0 bytes make me think it is.

> > But I'm sure there must be a more elegant way to do this. 
> > Honestly, I'm not even sure where to start. Any ideas?

I find that this:

perl5.6.1  -we 'undef $/; $_=<STDIN>; $_ = pack "U*", unpack "v*", $_; substr ($_, 0, 
1) = ""; print $_' </tmp/unicode.txt

gives me this:

fdn "grp1",55,"","",0

fdn "grp2",55,"","",0

fdn "grp3",55,"","",0

fdn "grp4",55,"","",0

fdn "grp5",55,"","",0

fdn "TEMP",55,"","",0


The substr takes out the byte order mark.

I guess a better conversion script would read the first two characters, and
if they look like a byte order mark in UTF-16 chose whether to use v or n
in the unpack based on the endianness.

You will get more sane regexp behaviour if you use 5.8.0 rather than 5.6.1
In 5.6.1 being in the scope of a "use utf8;" will make your regexps properly
unicode, even if they don't contain obvious Unicode features.

(Otherwise matches involving . and similar metachars cause the regexp to think
in ASCII, and unicode scalars are treated as a series of bytes.
5.8.0 fixes this problem - regexps "just work" there. Modulo unknown bugs)

Nicholas Clark
-- 
Even better than the real thing:        http://nms-cgi.sourceforge.net/

Re: Pattern matching with Unicode (5.6.1)

Reply via email to