Re: printing UTF-8 encoded MARC records with as_usmarc
On Wed, Aug 1, 2012 at 12:47 AM, Shelley Doljack wrote: > The problem was I wasn't telling perl to output UTF-8. Now that I added > binmode(FILE, ':utf8') to my script, the problem is fixed. However, it > sounds like once I set binmode to UTF-8 everything will be interpreted as > such, even when the record is in MARC-8. Is that right? So this means that > I can only use my script with a file of records where all of them are > encoded in UTF-8. If I want to run the script against a file with all > MARC-8 encoding, then I'd need to remove the binmode line. > Sometimes it's easier to use the yaz-marcdump utility for MARC-8 to UTF-8 conversion (it's much faster): yaz-marcdump -f MARC-8 -t UTF-8 -o marc marc21.in >marc21.out http://www.indexdata.com/yaz/doc/yaz-marcdump.html Best regards, Saiful Amin DRTC, Bangalore
Re: printing UTF-8 encoded MARC records with as_usmarc
The problem was I wasn't telling perl to output UTF-8. Now that I added binmode(FILE, ':utf8') to my script, the problem is fixed. However, it sounds like once I set binmode to UTF-8 everything will be interpreted as such, even when the record is in MARC-8. Is that right? So this means that I can only use my script with a file of records where all of them are encoded in UTF-8. If I want to run the script against a file with all MARC-8 encoding, then I'd need to remove the binmode line. It doesn't seem possible to say: if ($record->encoding() eq 'UTF-8' ) { binmode(FILE, ':utf8') ; FILE $record->as_usmarc() ; } else { print FILE $record->as_usmarc() ; } This will result in messing up the diacritics if a file has a mixture of records in MARC-8 and UTF-8. Is that correct? Thanks, Shelley - Original Message - > From: "William Dueber" > To: "Shelley Doljack" > Cc: perl4lib@perl.org > Sent: Monday, July 30, 2012 5:13:41 PM > Subject: Re: printing UTF-8 encoded MARC records with as_usmarc > First off, it's entirely possible that you have bad UTF-8 (perhaps > rogue MARC-8, perhaps just lousy characters) in your MARC. I know we > have plenty of that crap. > You need to tell perl that you'll be outputting UTF-8 using 'bincode' > binmode(FILE, ':utf8'); > In general, you'll want to do this to basically every file you open > for reading or writing. > A great overview of Perl and UTF-8 can be found at: > http://stackoverflow.com/questions/6162484/why-does-modern-perl-avoid-utf-8-by-default > On Mon, Jul 30, 2012 at 6:51 PM, Shelley Doljack < > sdolj...@stanford.edu > wrote: > > Hi, > > > I wrote a script that extracts marc records from a file given > > certain > > conditions and puts them in a new file. When my input record is > > correctly encoded in UTF-8 and I run my script from windows command > > prompt, this warning message appears: "Wide character in print at > > record_extraction.pl line 99" (the line in my script where I print > > to a new file using as_usmarc). I compared the extracted record > > before and after in MarcEdit and the diacritic was changed. I tried > > marcdump newfile.mrc to see what happens and I get this error: > > "utf8 > > \xF4 does not map to Unicode at C:/Perl64/lib/Encode.pm line 176." > > When I run my extraction script again with MARC-8 encoded data then > > I don't have the same problem. > > > The basic outline of my script is: > > > my $batch = MARC::Batch->new('USMARC', $input_file); > > > while (my $record = $batch->next()) { > > > #do some checks > > > #if checks ok then > > > print FILE $record->as_usmarc(); > > > } > > > Do I need to add something that specifies to interpret the data as > > UTF-8? Does MARC::Record not handle UTF-8 at all? > > > Thanks, > > > Shelley > > > > > > Shelley Doljack > > > E-Resources Metadata Librarian > > > Metadata and Library Systems > > > Stanford University Libraries > > > sdolj...@stanford.edu > > > 650-725-0167 > > -- > Bill Dueber > Programmer -- Library Systems > University of Michigan
RE: printing UTF-8 encoded MARC records with as_usmarc
I recently came across a nasty issue with MARC::Record to do with output of Marc-8 encoded records. I was converting XML (which was in UTF-8) into MARC records using MARC::Record and had initially, and successfully, got good UTF-8 encoded MARC records out at the end. However, I then could not load them into our LMS, and realised it was going to be easier at the LMS end if the records were presented in MARC-8. While the Perl modules largely worked and I got the right MARC-8 representation out at the end, the record length and the field offsets and lengths in the directory got in a real mess, because the top-bit-set characters in MARC-8 got counted as though they were code-points 0x80 to 0xFF encoded as two bytes of UTF-8. I found a solution by hackily recalculating the lengths when needed, but I thought I'd mention it as the thread has touched on this area. Matthew -- Matthew Phillips Electronic Systems Librarian, Durham University Durham University Library, Stockton Road, Durham, DH1 3LY +44 (0)191 334 2941 > On Mon, Jul 30, 2012 at 6:51 PM, Shelley Doljack > wrote: > > > Hi, > > > > I wrote a script that extracts marc records from a file given certain > > conditions and puts them in a new file. When my input record is correctly > > encoded in UTF-8 and I run my script from windows command prompt, this > > warning message appears: "Wide character in print at > record_extraction.plline 99" (the line in my script where I print to a new > file > using > > as_usmarc). I compared the extracted record before and after in MarcEdit > > and the diacritic was changed. I tried marcdump newfile.mrc to see what > > happens and I get this error: "utf8 \xF4 does not map to Unicode at > > C:/Perl64/lib/Encode.pm line 176." When I run my extraction script again > > with MARC-8 encoded data then I don't have the same problem. > > > > The basic outline of my script is: > > > > my $batch = MARC::Batch->new('USMARC', $input_file); > > > > while (my $record = $batch->next()) { > > #do some checks > > #if checks ok then > > print FILE $record->as_usmarc(); > > } > > > > Do I need to add something that specifies to interpret the data as UTF-8? > > Does MARC::Record not handle UTF-8 at all? > > > > Thanks, > > Shelley > > > > > > Shelley Doljack > > E-Resources Metadata Librarian > > Metadata and Library Systems > > Stanford University Libraries > > sdolj...@stanford.edu > > 650-725-0167 > >
RE: printing UTF-8 encoded MARC records with as_usmarc
I just recently came across this presentation which lays out pretty much all the issues with Unicode in perl, and makes some recommendations for best practices. You may find some general insight into the whole situation by going over it. http://www.slideshare.net/nickpatch/fundamental-unicode-at-dcbaltimore-perl-workshop-2012 /dev -- Devon Smith Consulting Software Engineer OCLC Research http://www.oclc.org/research/people/smith.htm -Original Message- From: William Dueber [mailto:dueb...@umich.edu] Sent: Monday, July 30, 2012 8:14 PM To: Shelley Doljack Cc: perl4lib@perl.org Subject: Re: printing UTF-8 encoded MARC records with as_usmarc First off, it's entirely possible that you have bad UTF-8 (perhaps rogue MARC-8, perhaps just lousy characters) in your MARC. I know we have plenty of that crap. You need to tell perl that you'll be outputting UTF-8 using 'bincode' binmode(FILE, ':utf8'); In general, you'll want to do this to basically every file you open for reading or writing. A great overview of Perl and UTF-8 can be found at: http://stackoverflow.com/questions/6162484/why-does-modern-perl-avoid-utf-8-by-default On Mon, Jul 30, 2012 at 6:51 PM, Shelley Doljack wrote: > Hi, > > I wrote a script that extracts marc records from a file given certain > conditions and puts them in a new file. When my input record is correctly > encoded in UTF-8 and I run my script from windows command prompt, this > warning message appears: "Wide character in print at record_extraction.plline > 99" (the line in my script where I print to a new file using > as_usmarc). I compared the extracted record before and after in MarcEdit > and the diacritic was changed. I tried marcdump newfile.mrc to see what > happens and I get this error: "utf8 \xF4 does not map to Unicode at > C:/Perl64/lib/Encode.pm line 176." When I run my extraction script again > with MARC-8 encoded data then I don't have the same problem. > > The basic outline of my script is: > > my $batch = MARC::Batch->new('USMARC', $input_file); > > while (my $record = $batch->next()) { > #do some checks > #if checks ok then > print FILE $record->as_usmarc(); > } > > Do I need to add something that specifies to interpret the data as UTF-8? > Does MARC::Record not handle UTF-8 at all? > > Thanks, > Shelley > > > Shelley Doljack > E-Resources Metadata Librarian > Metadata and Library Systems > Stanford University Libraries > sdolj...@stanford.edu > 650-725-0167 > -- Bill Dueber Programmer -- Library Systems University of Michigan