Re: printing UTF-8 encoded MARC records with as_usmarc

2012-07-31 Thread Dr. Saiful Amin
On Wed, Aug 1, 2012 at 12:47 AM, Shelley Doljack wrote:

> The problem was I wasn't telling perl to output UTF-8. Now that I added
> binmode(FILE, ':utf8') to my script, the problem is fixed. However, it
> sounds like once I set binmode to UTF-8 everything will be interpreted as
> such, even when the record is in MARC-8. Is that right? So this means that
> I can only use my script with a file of records where all of them are
> encoded in UTF-8. If I want to run the script against a file with all
> MARC-8 encoding, then I'd need to remove the binmode line.
>

Sometimes it's easier to use the yaz-marcdump utility for MARC-8 to UTF-8
conversion (it's much faster):
yaz-marcdump -f MARC-8 -t UTF-8 -o marc marc21.in >marc21.out

http://www.indexdata.com/yaz/doc/yaz-marcdump.html

Best regards,
Saiful Amin
DRTC, Bangalore


Re: printing UTF-8 encoded MARC records with as_usmarc

2012-07-31 Thread Shelley Doljack
The problem was I wasn't telling perl to output UTF-8. Now that I added 
binmode(FILE, ':utf8') to my script, the problem is fixed. However, it sounds 
like once I set binmode to UTF-8 everything will be interpreted as such, even 
when the record is in MARC-8. Is that right? So this means that I can only use 
my script with a file of records where all of them are encoded in UTF-8. If I 
want to run the script against a file with all MARC-8 encoding, then I'd need 
to remove the binmode line. 

It doesn't seem possible to say: 

if ($record->encoding() eq 'UTF-8' ) { 
binmode(FILE, ':utf8') ; 
FILE $record->as_usmarc() ; 
} 
else { 
print FILE $record->as_usmarc() ; 
} 

This will result in messing up the diacritics if a file has a mixture of 
records in MARC-8 and UTF-8. Is that correct? 

Thanks, 
Shelley 

- Original Message -

> From: "William Dueber" 
> To: "Shelley Doljack" 
> Cc: perl4lib@perl.org
> Sent: Monday, July 30, 2012 5:13:41 PM
> Subject: Re: printing UTF-8 encoded MARC records with as_usmarc

> First off, it's entirely possible that you have bad UTF-8 (perhaps
> rogue MARC-8, perhaps just lousy characters) in your MARC. I know we
> have plenty of that crap.

> You need to tell perl that you'll be outputting UTF-8 using 'bincode'

> binmode(FILE, ':utf8');

> In general, you'll want to do this to basically every file you open
> for reading or writing.

> A great overview of Perl and UTF-8 can be found at:

> http://stackoverflow.com/questions/6162484/why-does-modern-perl-avoid-utf-8-by-default

> On Mon, Jul 30, 2012 at 6:51 PM, Shelley Doljack <
> sdolj...@stanford.edu > wrote:

> > Hi,
> 

> > I wrote a script that extracts marc records from a file given
> > certain
> > conditions and puts them in a new file. When my input record is
> > correctly encoded in UTF-8 and I run my script from windows command
> > prompt, this warning message appears: "Wide character in print at
> > record_extraction.pl line 99" (the line in my script where I print
> > to a new file using as_usmarc). I compared the extracted record
> > before and after in MarcEdit and the diacritic was changed. I tried
> > marcdump newfile.mrc to see what happens and I get this error:
> > "utf8
> > \xF4 does not map to Unicode at C:/Perl64/lib/Encode.pm line 176."
> > When I run my extraction script again with MARC-8 encoded data then
> > I don't have the same problem.
> 

> > The basic outline of my script is:
> 

> > my $batch = MARC::Batch->new('USMARC', $input_file);
> 

> > while (my $record = $batch->next()) {
> 
> > #do some checks
> 
> > #if checks ok then
> 
> > print FILE $record->as_usmarc();
> 
> > }
> 

> > Do I need to add something that specifies to interpret the data as
> > UTF-8? Does MARC::Record not handle UTF-8 at all?
> 

> > Thanks,
> 
> > Shelley
> 

> > 
> 
> > Shelley Doljack
> 
> > E-Resources Metadata Librarian
> 
> > Metadata and Library Systems
> 
> > Stanford University Libraries
> 
> > sdolj...@stanford.edu
> 
> > 650-725-0167
> 

> --

> Bill Dueber
> Programmer -- Library Systems
> University of Michigan


RE: printing UTF-8 encoded MARC records with as_usmarc

2012-07-31 Thread PHILLIPS M.E.
I recently came across a nasty issue with MARC::Record to do with output of 
Marc-8 encoded records.  I was converting XML (which was in UTF-8) into MARC 
records using MARC::Record and had initially, and successfully, got good UTF-8 
encoded MARC records out at the end.

However, I then could not load them into our LMS, and realised it was going to 
be easier at the LMS end if the records were presented in MARC-8.  While the 
Perl modules largely worked and I got the right MARC-8 representation out at 
the end, the record length and the field offsets and lengths in the directory 
got in a real mess, because the top-bit-set characters in MARC-8 got counted as 
though they were code-points 0x80 to 0xFF encoded as two bytes of UTF-8.  I 
found a solution by hackily recalculating the lengths when needed, but I 
thought I'd mention it as the thread has touched on this area.

Matthew

-- 
Matthew Phillips
Electronic Systems Librarian, Durham University
Durham University Library, Stockton Road, Durham, DH1 3LY
+44 (0)191 334 2941


> On Mon, Jul 30, 2012 at 6:51 PM, Shelley Doljack
> wrote:
> 
> > Hi,
> >
> > I wrote a script that extracts marc records from a file given certain
> > conditions and puts them in a new file. When my input record is correctly
> > encoded in UTF-8 and I run my script from windows command prompt, this
> > warning message appears: "Wide character in print at
> record_extraction.plline 99" (the line in my script where I print to a new 
> file
> using
> > as_usmarc). I compared the extracted record before and after in MarcEdit
> > and the diacritic was changed. I tried marcdump newfile.mrc to see what
> > happens and I get this error: "utf8 \xF4 does not map to Unicode at
> > C:/Perl64/lib/Encode.pm line 176." When I run my extraction script again
> > with MARC-8 encoded data then I don't have the same problem.
> >
> > The basic outline of my script is:
> >
> > my $batch = MARC::Batch->new('USMARC', $input_file);
> >
> > while (my $record = $batch->next()) {
> >  #do some checks
> >  #if checks ok then
> >  print FILE $record->as_usmarc();
> > }
> >
> > Do I need to add something that specifies to interpret the data as UTF-8?
> > Does MARC::Record not handle UTF-8 at all?
> >
> > Thanks,
> > Shelley
> >
> > 
> > Shelley Doljack
> > E-Resources Metadata Librarian
> > Metadata and Library Systems
> > Stanford University Libraries
> > sdolj...@stanford.edu
> > 650-725-0167
> >



RE: printing UTF-8 encoded MARC records with as_usmarc

2012-07-31 Thread Smith,Devon
I just recently came across this presentation which lays out pretty much all 
the issues with Unicode in perl, and makes some recommendations for best 
practices. You may find some general insight into the whole situation by going 
over it.

http://www.slideshare.net/nickpatch/fundamental-unicode-at-dcbaltimore-perl-workshop-2012

/dev
-- 
Devon Smith
Consulting Software Engineer
OCLC Research
http://www.oclc.org/research/people/smith.htm


-Original Message-
From: William Dueber [mailto:dueb...@umich.edu] 
Sent: Monday, July 30, 2012 8:14 PM
To: Shelley Doljack
Cc: perl4lib@perl.org
Subject: Re: printing UTF-8 encoded MARC records with as_usmarc

First off, it's entirely possible that you have bad UTF-8 (perhaps rogue
MARC-8, perhaps just lousy characters) in your MARC. I know we have plenty
of that crap.

You need to tell perl that you'll be outputting UTF-8 using 'bincode'

  binmode(FILE, ':utf8');

In general, you'll want to do this to basically every file you open for
reading or writing.

A great overview of Perl and UTF-8 can be found at:

http://stackoverflow.com/questions/6162484/why-does-modern-perl-avoid-utf-8-by-default





On Mon, Jul 30, 2012 at 6:51 PM, Shelley Doljack wrote:

> Hi,
>
> I wrote a script that extracts marc records from a file given certain
> conditions and puts them in a new file. When my input record is correctly
> encoded in UTF-8 and I run my script from windows command prompt, this
> warning message appears: "Wide character in print at record_extraction.plline 
> 99" (the line in my script where I print to a new file using
> as_usmarc). I compared the extracted record before and after in MarcEdit
> and the diacritic was changed. I tried marcdump newfile.mrc to see what
> happens and I get this error: "utf8 \xF4 does not map to Unicode at
> C:/Perl64/lib/Encode.pm line 176." When I run my extraction script again
> with MARC-8 encoded data then I don't have the same problem.
>
> The basic outline of my script is:
>
> my $batch = MARC::Batch->new('USMARC', $input_file);
>
> while (my $record = $batch->next()) {
>  #do some checks
>  #if checks ok then
>  print FILE $record->as_usmarc();
> }
>
> Do I need to add something that specifies to interpret the data as UTF-8?
> Does MARC::Record not handle UTF-8 at all?
>
> Thanks,
> Shelley
>
> 
> Shelley Doljack
> E-Resources Metadata Librarian
> Metadata and Library Systems
> Stanford University Libraries
> sdolj...@stanford.edu
> 650-725-0167
>



-- 

Bill Dueber
Programmer -- Library Systems
University of Michigan