On Thu, Aug 15, 2002 at 05:28:43PM -0400, David Gray wrote: > > I'm having a bit of a problem getting Unicode pattern > > matching to do what I would like it to. > > I guess my question wasn't entirely clear. I'm reading in the attatched > file and trying to split it on "\n\n". > > When I'm looping over the file, > > > I've (sort of) made it work by doing: > > > > # strip BOM and trailing nulls and carriage returns > > s/^..// if $. == 1 and s/\0//g; > > s/[\0\r]//g; > > The two-byte BOM has me thinking it's probably UTF-16. Is there an easy > way to tell what encoding a file uses?
Not that I know of, but all the 0 bytes make me think it is. > > But I'm sure there must be a more elegant way to do this. > > Honestly, I'm not even sure where to start. Any ideas? I find that this: perl5.6.1 -we 'undef $/; $_=<STDIN>; $_ = pack "U*", unpack "v*", $_; substr ($_, 0, 1) = ""; print $_' </tmp/unicode.txt gives me this: fdn "grp1",55,"","",0 fdn "grp2",55,"","",0 fdn "grp3",55,"","",0 fdn "grp4",55,"","",0 fdn "grp5",55,"","",0 fdn "TEMP",55,"","",0 The substr takes out the byte order mark. I guess a better conversion script would read the first two characters, and if they look like a byte order mark in UTF-16 chose whether to use v or n in the unpack based on the endianness. You will get more sane regexp behaviour if you use 5.8.0 rather than 5.6.1 In 5.6.1 being in the scope of a "use utf8;" will make your regexps properly unicode, even if they don't contain obvious Unicode features. (Otherwise matches involving . and similar metachars cause the regexp to think in ASCII, and unicode scalars are treated as a series of bytes. 5.8.0 fixes this problem - regexps "just work" there. Modulo unknown bugs) Nicholas Clark -- Even better than the real thing: http://nms-cgi.sourceforge.net/