Michael,

So, basically, you either need prior knowledge about the actual character encoding used, or you have to test. Testing for UTF-8 is fairly straightforward...

How are you testing for UTF-8?

There's a handy perl regexp on the W3C web site at:

   http://www.w3.org/International/questions/qa-forms-utf-8

You'll need to change the ASCII part of the regexp to something like:

   [\x01-\x7e]

This will more than accommodate for the various control characters you
can find in MARC records (don't forget Esc as the lead in to Greek,
Cyrillic, etc.)

The W3C regexp tests the whole string -- which may be inefficient
if you are testing lots of data. Depending on what sort of accuracy
you want and whether or not overlong UTF-8 sequences are a concern,
you could just test for the following:

   [\xc2-\xf4][\x80-\xbf]

The Wikipedia page on UTF-8 is worth a read.

Distinguishing Latin-1 from MARC-8 is a bit more like guess work.
As a test for MARC-8 I look for the common combining diacritics
followed by a vowel.

Do you have a programmatic way to do that test, or are you "eye-balling" the 
records.

I use a simple regexp:

  ([\xe1-\xe3][aeiouAEIOU]|\xf0[cC])

which may be rather too simple. For a critical application I'd come up
with something a bit better (after first eye-balling a load of records.)

Just as an aside, I'm not using perl -- I'm using the Boost Regexp
library for C++ (which is a good implementation of perl regexps.)

Regards,

Ashley.
--
Ashley Sanders               [EMAIL PROTECTED]
Copac http://copac.ac.uk A MIMAS Service funded by JISC

Reply via email to