a) Mis-characterized MARC char encodings are common amongst many of our corpuses and ILS's. It is a common problem. It can be very inconvenient. Not only Marc8 that says it's UTF8 and vice versa, but something that says it's MARC8 or UTF8 but is actually neither.

b) While one solution would be having the marc tool pass the char stream through as is without complaining like Godmar suggested; and another solution would be trying to heuristically guess the 'real' solution like Gabe suggests; personally I favor a different solution:

The thing that's encoding as unicode on the way out? Instead of raising on an invalid char, it should have the option of silently eating it, replacing it with either empty string or the unicode "replacement character" ( "used to replace an incoming character whose value is unknown or unrepresentable in Unicode" [http://www.fileformat.info/info/unicode/char/fffd/index.htm] )

I have worked with character encoding libraries before that have this option, replace messed up bytes with unicode replacement char. I don't know what's avail in Python though.

Jonathan

On 3/8/2012 3:19 PM, Gabriel Farrell wrote:
Sounds like what you do, Terry, and what we need in PyMARC, is
something like UnicodeDammit [0]. Actually handling all of these
esoteric encodings would be quite the chore, though.

I also used to think it would be cool if we could get MARC8
encoding/decoding into the Python standard library, but then I
realized I'd rather work on other stuff while MARC8 withers and dies.


[0] https://github.com/bdoms/beautifulsoup/blob/master/BeautifulSoup.py#L1753

On Thu, Mar 8, 2012 at 2:36 PM, Reese, Terry
<[email protected]>  wrote:
This is one of the reasons you really can't trust the information found in 
position 9.  This is one of the reasons why when I wrote MarcEdit, I utilize a 
mixed process when working with data and determining characterset -- a process 
that reads this byte and takes the information under advisement, but in the end 
treats it more as a suggestion and one part of a larger heuristic analysis of 
the record data to determine whether the information is in UTF8 or not.  
Fortunately, determining if a set of data is in UTF8 or something else, is a 
fairly easy process.  Determining the something else is much more difficult, 
but generally not necessary.

For that reason, if I was advising other people working on MARC processing 
libraries, I'd advocate having a process for recognizing that certain 
informational data may not be set correctly, and essentially utilize a 
compatibility process to read and correct them.  Because unfortunately, while 
the number of vendors and systems that set this encoding byte correctly has 
increased dramatically (it used to be pretty much no one) -- but it's still so 
uneven, I generally consider this information unreliable.

--TR

-----Original Message-----
From: Code for Libraries [mailto:[email protected]] On Behalf Of Godmar 
Back
Sent: Thursday, March 08, 2012 11:01 AM
To: [email protected]
Subject: Re: [CODE4LIB] Q.: MARC8 vs. MARC/Unicode and pymarc and misencoded 
III records

On Thu, Mar 8, 2012 at 1:46 PM, Terray, James<[email protected]>  wrote:

Hi Godmar,

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe8 in position 9:
ordinal not in range(128)

Having seen my fair share of these kinds of encoding errors in Python,
I can speculate (without seeing the pymarc source code, so please
don't hold me to this) that it's the Python code that's not set up to
handle the UTF-8 strings from your data source. In fact, the error
indicates it's using the default 'ascii' codec rather than 'utf-8'. If
it said "'utf-8' codec can't decode...", then I'd suspect a problem with the 
data.

If you were to send the full traceback (all the gobbledy-gook that
Python spews when it encounters an error) and the version of pymarc
you're using to the program's author(s), they may be able to help you out 
further.


My question is less about the Python error, which I understand, than about the 
MARC record causing the error and about how others deal with this issue (if 
it's a common issue, which I do not know.)

But, here's the long story from pymarc's perspective.

The record has leader[9] == 'a', but really, truly contains ANSEL-encoded data. 
 When reading the record with a MARCReader(to_unicode = False) instance, the 
record reads ok since no decoding is attempted, but attempts at writing the 
record fail with the above error since pymarc attempts to
utf8 encode the ANSEL-encoded string which contains non-ascii chars such as
0xe8 (the ANSEL Umlaut prefix). It does so because leader[9] == 'a' (see [1]).

When reading the record with a MARCReader(to_unicode=True) instance, it'll 
throw an exception during marc_decode when trying to utf8-decode the 
ANSEL-encoded string. Rightly so.

I don't blame pymarc for this behavior; to me, the record looks wrong.

  - Godmar

(ps: that said, what pymarc does fails in different circumstances - from what I 
can see, pymarc shouldn't assume that it's ok to utf8-encode the field data if 
leader[9] is 'a'.  For instance, this would double-encode correctly encoded 
Marc/Unicode records that were read with a
MARCReader(to_unicode=False) instance. But that's a separate issue that is not 
my immediate concern. pymarc should probably remember if a record needs or does 
not need encoding when writing it, rather than consulting the leader[9] field.)


(*)
https://github.com/mbklein/pymarc/commit/ff312861096ecaa527d210836dbef904c24baee6

Reply via email to