Ed, On Fri, Jun 17, 2011 at 10:53:00AM +0100, Edmund Chamberlain wrote: > Firstly, hello! Its my first time posting and possibly somewhat > predictably with a call for help with Unicode stuff.
Ah, yes... > I've just checked the archive and seen this thread and am having a > similar problem, a badly encoded character is causing a while loop > through MARC::Batch->next to crash out with: > > utf8 "\x87" does not map to Unicode at > /usr/lib64/perl5/5.8.8/x86_64-linux-thread-multi/Encode.pm line 173. > > I've tried pasting Al's modified decode subroutine and package to the > script, but it is still failing. One of the offending records is > isolated and attached. > > Any suggestions welcome with regards to further modifying the sub or > alternatives to MARC::Batch->next would be welcome. For the scope of the > project, I'm limited to large batch files of Marc21. You could try MARC::Loop, which doesn't care about (or even detect) character (mis)codings in any way -- it just sees everything as raw bytes. Then you can go in and wipe out (or fix) bad byte sequences, something like this: use MARC::Loop qw(marcloop marcbuild); marcloop { # This code is called once for each record read from standard input my ($leader, $fields, $rawref) = @_; if ($$rawref =~ /[^\x00-\x7f]/) { # The record contains one or more non-ASCII bytes foreach my $field (@$fields) { my ($tag, $valref) = @$field; $strref =~ s{ ( # Valid byte sequences for a single character: [\x{00}-\x{7f}] | [\x{c2}-\x{df}][\x{80}-\x{bf}] | \x{e0} [\x{a0}-\x{bf}][\x{80}-\x{bf}] | [\x{e1}-\x{ec}][\x{80}-\x{bf}][\x{80}-\x{bf}] | \x{ed} [\x{80}-\x{9f}][\x{80}-\x{bf}] | [\x{ee}-\x{ef}][\x{80}-\x{bf}][\x{80}-\x{bf}] | \x{f0} [\x{90}-\x{bf}][\x{80}-\x{bf}] | [\x{f1}-\x{f3}][\x{80}-\x{bf}][\x{80}-\x{bf}][\x{80}-\x{bf}] | \x{f4} [\x{80}-\x{8f}][\x{80}-\x{bf}][\x{80}-\x{bf}] ) | ( # Oops! -- invalid byte sequence, we assume all # non-ASCII bytes starting here are bad [^\x00-\x7f]+ ) }{ if (defined $2) { # Substitute a "fixed" version of the bad byte sequence print STDERR "Fixing bad byte sequence in field $tag of record $.\n"; fixed($2); } else { # Leave it unchanged $1; } }xeg; } print marcbuild($leader, $fields); } else { print $$rawref; } } \*STDIN; sub fixed { my ($str) = @_; # There are any number of actions you might take to "fix" invalid byte # sequences; this is just one return '?'; } The downside is that MARC::Loop is a separate Perl module you'll have to download from CPAN and install manually. I wrote it, so I'm biased, but I think it's good for people who prefer to (or have to) work closer to the raw MARC record. (Fixing miscoded records that MARC::Record et al. couldn't handle was what motivated me to write it in the first place.) Paul. -- Paul Hoffman <nkui...@nkuitse.com>