Re: [CODE4LIB] Q.: MARC8 vs. MARC/Unicode and pymarc and misencoded III records

Terray, James Thu, 08 Mar 2012 10:50:39 -0800

Hi Godmar,

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe8 in position 9: ordinal 
not in range(128)


Having seen my fair share of these kinds of encoding errors in Python, I can 
speculate (without seeing the pymarc source code, so please don't hold me to 
this) that it's the Python code that's not set up to handle the UTF-8 strings 
from your data source. In fact, the error indicates it's using the default 
'ascii' codec rather than 'utf-8'. If it said "'utf-8' codec can't decode...", 
then I'd suspect a problem with the data.

If you were to send the full traceback (all the gobbledy-gook that Python spews 
when it encounters an error) and the version of pymarc you're using to the 
program's author(s), they may be able to help you out further.

Thanks,

Jay



________________________________________
From: Code for Libraries [[email protected]] on behalf of Godmar Back 
[[email protected]]
Sent: Thursday, March 08, 2012 1:02 PM
To: [email protected]
Subject: [CODE4LIB] Q.: MARC8 vs. MARC/Unicode and pymarc and misencoded III 
records

Hi,

a few days ago, I showed pymarc to a group of technical librarians to
demonstrate how easily certain tasks can be scripted/automated.

Unfortunately, it blew up at me when I tried to write a record:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe8 in position 9:
ordinal not in range(128)

Investigation revealed this culprit:

=LDR  00916nam a2200241I  4500
=001  ocm10685946
=005  19880203211447.0
=007  cr\bn||||||abp
=007  cr\bn||||||cda
=008  840503s1939\\\\gw\\\\\\\\\\\\00010\ger\d
=040  \\$aMBB$cMBB$dCRL
=049  \\$aCRLL
=100  10$aEsser, Hermann,$d1900-
=245  14$aDie j<E8>udischer Weltpest ;$bjudend<E1>ammerung auf dem
Erdball,$cvon Hermann Esser.
=260  0\$aM<E8>unchen,$bZentralverlag der N S D A P., F. Eher ahchf.,$c1939.
=300  \\$a243 [1] p.$c23 cm.
=533  \\$aAlso available as electronic reproduction.$bChicago :$cCenter for
Research Libraries,$d[2009]
=650  \0$aJewish question.
=700  12$aBierbrauer, Johann Jacob,$d1705-1760?
=710  2\$aCenter for Research Libraries (U.S.)
=856  41$uhttp://dds.crl.edu/CRLdelivery.asp?tid=10538$zOnline version
=907  \\$a.b28931622$b08-30-10$c08-30-10
=998  \\$awww$b08-30-10$cm$dz$e-$fger$ggw $h4$i0

The leader[9] field is set to 'a', so the record should contain
UTF8-encoded Unicode [1], but E8 75 in the 245$a appears to be ANSEL where
'E8' denotes the Umlaut preceding the lowercase 'u' (0x75). [2]

To me, this record looks misencoded... am I correct here? There are
thousands of such records in the data set I'm dealing with, which was
obtained using the 'Data Exchange' feature of III's Millennium system.

My question is how others, especially pymarc users dealing with III
records, deal with this issue or whatever other
experiences/hints/practices/kludges exist in this area.

Thanks.

 - Godmar

[1] http://www.loc.gov/marc/bibliographic/bdleader.html
[2] http://lcweb2.loc.gov/diglib/codetables/45.html

Re: [CODE4LIB] Q.: MARC8 vs. MARC/Unicode and pymarc and misencoded III records

Reply via email to