This has nothing to do with Perl versions.

MARC::Record 1.38 and earlier does not display this problem.
MARC::Record 2.0.0, the so called unicode version, introduced the problem you 
describe.
That is when writing records: causing incorrect leader length and corrupted 
utf-8

There are different ways to deal with this.
Myself I have changed one of the modules.

MARC::File::USMARC
It has a function called encode() around line 315
I have added a "use bytes;" just before the final return. Like this:

use bytes;
return join("",$marc->leader, @$directory, END_OF_FIELD, @$fields, 
END_OF_RECORD);

To change directly in code like this is totally "no-no" to many programmers.
If you feel uncomfortable with this, there are other methods doing the same 
stuff.
You could write a package:

package MARC_Record_hack;
use MARC::File::USMARC;
no warnings 'redefine';
sub MARC::File::USMARC::encode() {
    my $marc = shift;
    $marc = shift if (ref($marc)||$marc) =~ /^MARC::File/;
    my ($fields,$directory,$reclen,$baseaddress) = 
MARC::File::USMARC::_build_tag_directory($marc);
    $marc->set_leader_lengths( $reclen, $baseaddress );
    # Glomp it all together
    use bytes;
    return join("",$marc->leader, @$directory, "\x1E", @$fields, "\x1D");
}
use warnings;
1;
__END__

With the inclusion of this package your original code should work fine, I'd 
guess.


use MARC::Batch;
use MARC_Record_hack;
my $batch = new MARC::Batch('USMARC', $ARGV[0]);
$batch->strict_off ();
$batch->warnings_off ();
#binmode( STDOUT, ':raw' );
#binmode STDOUT;
my $record = $batch->next;
print $record->as_usmarc;


As a habit I use 
binmode FH;
when I write records to file.
It is not needed, but it keeps me from the temptation of doing any other 
assumptions about character encodings.

/Leif Andersson
Stockholm University Library

________________________________________
Från: Al [ra...@berkeley.edu]
Skickat: den 12 oktober 2010 00:03
Till: perl4lib@perl.org
Ämne: MARC-perl: different versions yield different results

Example marc record is here:
http://www.mediafire.com/file/u5cxkrfwh9ew09z/example.zip

When I process the record above in perl 5.8, MARC::Record version 1.38, and
Encode.pm version 2.12, the record comes out fine.

When I use perl 5.10, MARC::Record version 2.0.0, and Encode.pm 2.40 the
record comes out corrupted and MARC::Record will no longer read the result.

The problem is with a Unicode character (big surprise). The earlier version
leaves the \xC3A1 character intact, the later version changes it to \xE1
which is invalid. I've read as many of the perl4lib messages on the subject
of UTF-8 as I could but my eyes are spinning. I'm hoping by including a
complete but simple perl program and making a MARC record available that
somebody can explain to me in detail what is going on. My inclination is to
simply revert to the earlier version of perl but perhaps if I really
understood the issue that may not be necessary.

Here is the test program I use:

use MARC::Batch;
my $batch = new MARC::Batch('USMARC', $ARGV[0]);
$batch->strict_off ();
$batch->warnings_off ();
#binmode( STDOUT, ':utf8' );
my $record = $batch->next;
print $record->as_usmarc;

Run the program on the record, then run it again on the output and the
second time perl quits with an error:

utf8 "\xE1" does not map to Unicode at Encode.pm line 174.

That should not happen.

Why the different behavior with the different versions? I can't see
anything wrong with the original record - it's valid UTF8 as far as I can
tell. Leader byte 9 is correctly set to 'a'. Uncommenting the binmode line
seems to work - the character is output unchanged as is supposed to happen.
The problem is my record batches are a mixture of UTF8 and MARC8 and
explicitly setting binmode screws things up. I need a solution that
transparently handles a mix of record encodings.

I rather suspect the problem is with Encode.pm and not MARC perl but I
can't be sure. It also may be due to the way perl handles IO between
version 5.8 and 5.10. BTW the problem happens on Windows and Unix.

Thanks for any advice you can give me,

Al

Reply via email to