In Perl, something like this might do the trick:
# Fix non-UTF-8 characters with two highest bits set (we assume they are
actually ISO-8859-1)
# Rule: there can't be a single byte with the high bits set followed by
a byte in range 00-7F or C0-FF
$str =~ s/([\xC0-\xFF])(?=[\x00-\x7f\xC0-\xFF])/chr(0xC0 + (ord($1) >>
6)) . chr(0x80 + (ord($1) & 0x3F))/seg;
No wrapping there to keep it single-line. :)
--Ere
On 7.10.2010 14:56, Cowles, Esme wrote:
Eric-
I don't know the original source of those MARC files, but I've worked
with files from an III system where diacritics had to be entered as
character code escapes like "Muse{226}e du Louvre" (where 226 is the
ANSEL code for a combining acute accent). So if somebody made a typo
and entered something like "Muse{22}6e du Louvre" instead, you'd get
some bogus invalid character. I was working with MARCXML files in
Java, so I wrote a FilterReader class that removed any characters
that were invalid in UTF-8 XML. I assume you could do something
similar in Perl (probably with a fancy one-line regex).
-Esme -- Esme Cowles<[email protected]>
"We've all heard that a million monkeys banging on a million
typewriters will eventually reproduce the works of Shakespeare. Now,
thanks to the Internet, we know this is not true." -- Robert
Wilensky
On Oct 7, 2010, at 6:51 AM, Eric Lease Morgan wrote:
How do I trap for unwanted (bogus) characters in MARC records?
I have a set of Internet Archive identifiers, and have written the
followoing Perl loop to get the MARC records associated with each
one:
# process each identifier my $ua = LWP::UserAgent->new( agent =>
AGENT ); while (<DATA> ) {
# get the identifier chop; my $identifier = $_; print $identifier,
"\n";
# get its corresponding MARC record my $response = $ua->get( ROOT .
"$identifier/$identifier" . "_meta.mrc" ); if ( !
$response->is_success ) {
warn $response->status_line; next;
}
# save it open MARC, "> $identifier.mrc" or die "Can't open
$identifier.mrc: $!\n"; binmode MARC, ":utf8"; print MARC
$response->content; close MARC;
}
I then use the venerable marcdump to see the fruits of my labors:
marcdump *.mrc. Unfortunately, marcdump returns the following error
against (at least) one of my files:
bienfaitsducatho00pina.mrc utf8 "\xC3" does not map to Unicode at
/System/Library/ Perl/5.10.0/darwin-thread-multi-2level/Encode.pm
line 162.
What is going on here? Am I saving my files incorrectly? Is the
original MARC data inherintly incorrect? Is there some way I can
fix the MARC record in question?
-- Eric Lease Morgan
--
Ere Maijala
Kansalliskirjasto