Re: Invalid UTF-8 characters causing MARC::Record crash.

Paul Hoffman Fri, 17 Jun 2011 11:27:09 -0700

Ed,

On Fri, Jun 17, 2011 at 10:53:00AM +0100, Edmund Chamberlain wrote:
> Firstly, hello! Its my first time posting and possibly somewhat 
> predictably with a call for help with Unicode stuff.


Ah, yes...

> I've just checked the archive and seen this thread and am having a 
> similar problem, a badly encoded character is causing a while loop 
> through MARC::Batch->next to crash out with:
> 
> utf8 "\x87" does not map to Unicode at 
> /usr/lib64/perl5/5.8.8/x86_64-linux-thread-multi/Encode.pm line 173.
> 
> I've tried pasting Al's modified decode subroutine and package to the 
> script, but it is still failing. One of the offending records is 
> isolated and attached.
> 
> Any suggestions welcome with regards to further modifying the sub or 
> alternatives to MARC::Batch->next would be welcome. For the scope of the 
> project, I'm limited to large batch files of Marc21.

You could try MARC::Loop, which doesn't care about (or even detect)
character (mis)codings in any way -- it just sees everything as raw
bytes.  Then you can go in and wipe out (or fix) bad byte sequences,
something like this:

    use MARC::Loop qw(marcloop marcbuild);
    marcloop {
        # This code is called once for each record read from standard input
        my ($leader, $fields, $rawref) = @_;
        if ($$rawref =~ /[^\x00-\x7f]/) {
            # The record contains one or more non-ASCII bytes
            foreach my $field (@$fields) {
                my ($tag, $valref) = @$field;
                $strref =~ s{
                    (   # Valid byte sequences for a single character:
                          [\x{00}-\x{7f}]
                        | [\x{c2}-\x{df}][\x{80}-\x{bf}]
                        |         \x{e0} [\x{a0}-\x{bf}][\x{80}-\x{bf}]
                        | [\x{e1}-\x{ec}][\x{80}-\x{bf}][\x{80}-\x{bf}]
                        |         \x{ed} [\x{80}-\x{9f}][\x{80}-\x{bf}]
                        | [\x{ee}-\x{ef}][\x{80}-\x{bf}][\x{80}-\x{bf}]
                        |         \x{f0} [\x{90}-\x{bf}][\x{80}-\x{bf}]
                        | 
[\x{f1}-\x{f3}][\x{80}-\x{bf}][\x{80}-\x{bf}][\x{80}-\x{bf}]
                        |         \x{f4} 
[\x{80}-\x{8f}][\x{80}-\x{bf}][\x{80}-\x{bf}]
                    )
                    |
                    (
                        # Oops! -- invalid byte sequence, we assume all
                        # non-ASCII bytes starting here are bad
                        [^\x00-\x7f]+
                    )
                }{
                    if (defined $2) {
                        # Substitute a "fixed" version of the bad byte sequence
                        print STDERR "Fixing bad byte sequence in field $tag of 
record $.\n";
                        fixed($2);
                    }
                    else {
                        # Leave it unchanged
                        $1;
                    }
                }xeg;
            }
            print marcbuild($leader, $fields);
        }
        else {
            print $$rawref;
        }
    } \*STDIN;
    sub fixed {
        my ($str) = @_;
        # There are any number of actions you might take to "fix" invalid byte
        # sequences; this is just one
        return '?';
    }

The downside is that MARC::Loop is a separate Perl module you'll have to
download from CPAN and install manually.  I wrote it, so I'm biased, but
I think it's good for people who prefer to (or have to) work closer to
the raw MARC record.  (Fixing miscoded records that MARC::Record et al.
couldn't handle was what motivated me to write it in the first place.)

Paul.

-- 
Paul Hoffman <[email protected]>

Re: Invalid UTF-8 characters causing MARC::Record crash.

Reply via email to