This has nothing to do with Perl versions. MARC::Record 1.38 and earlier does not display this problem. MARC::Record 2.0.0, the so called unicode version, introduced the problem you describe. That is when writing records: causing incorrect leader length and corrupted utf-8
There are different ways to deal with this. Myself I have changed one of the modules. MARC::File::USMARC It has a function called encode() around line 315 I have added a "use bytes;" just before the final return. Like this: use bytes; return join("",$marc->leader, @$directory, END_OF_FIELD, @$fields, END_OF_RECORD); To change directly in code like this is totally "no-no" to many programmers. If you feel uncomfortable with this, there are other methods doing the same stuff. You could write a package: package MARC_Record_hack; use MARC::File::USMARC; no warnings 'redefine'; sub MARC::File::USMARC::encode() { my $marc = shift; $marc = shift if (ref($marc)||$marc) =~ /^MARC::File/; my ($fields,$directory,$reclen,$baseaddress) = MARC::File::USMARC::_build_tag_directory($marc); $marc->set_leader_lengths( $reclen, $baseaddress ); # Glomp it all together use bytes; return join("",$marc->leader, @$directory, "\x1E", @$fields, "\x1D"); } use warnings; 1; __END__ With the inclusion of this package your original code should work fine, I'd guess. use MARC::Batch; use MARC_Record_hack; my $batch = new MARC::Batch('USMARC', $ARGV[0]); $batch->strict_off (); $batch->warnings_off (); #binmode( STDOUT, ':raw' ); #binmode STDOUT; my $record = $batch->next; print $record->as_usmarc; As a habit I use binmode FH; when I write records to file. It is not needed, but it keeps me from the temptation of doing any other assumptions about character encodings. /Leif Andersson Stockholm University Library ________________________________________ Från: Al [ra...@berkeley.edu] Skickat: den 12 oktober 2010 00:03 Till: perl4lib@perl.org Ämne: MARC-perl: different versions yield different results Example marc record is here: http://www.mediafire.com/file/u5cxkrfwh9ew09z/example.zip When I process the record above in perl 5.8, MARC::Record version 1.38, and Encode.pm version 2.12, the record comes out fine. When I use perl 5.10, MARC::Record version 2.0.0, and Encode.pm 2.40 the record comes out corrupted and MARC::Record will no longer read the result. The problem is with a Unicode character (big surprise). The earlier version leaves the \xC3A1 character intact, the later version changes it to \xE1 which is invalid. I've read as many of the perl4lib messages on the subject of UTF-8 as I could but my eyes are spinning. I'm hoping by including a complete but simple perl program and making a MARC record available that somebody can explain to me in detail what is going on. My inclination is to simply revert to the earlier version of perl but perhaps if I really understood the issue that may not be necessary. Here is the test program I use: use MARC::Batch; my $batch = new MARC::Batch('USMARC', $ARGV[0]); $batch->strict_off (); $batch->warnings_off (); #binmode( STDOUT, ':utf8' ); my $record = $batch->next; print $record->as_usmarc; Run the program on the record, then run it again on the output and the second time perl quits with an error: utf8 "\xE1" does not map to Unicode at Encode.pm line 174. That should not happen. Why the different behavior with the different versions? I can't see anything wrong with the original record - it's valid UTF8 as far as I can tell. Leader byte 9 is correctly set to 'a'. Uncommenting the binmode line seems to work - the character is output unchanged as is supposed to happen. The problem is my record batches are a mixture of UTF8 and MARC8 and explicitly setting binmode screws things up. I need a solution that transparently handles a mix of record encodings. I rather suspect the problem is with Encode.pm and not MARC perl but I can't be sure. It also may be due to the way perl handles IO between version 5.8 and 5.10. BTW the problem happens on Windows and Unix. Thanks for any advice you can give me, Al