Re: [CODE4LIB] Q.: MARC8 vs. MARC/Unicode and pymarc and misencoded III records

Gabriel Farrell Thu, 08 Mar 2012 12:23:50 -0800

Sounds like what you do, Terry, and what we need in PyMARC, is
something like UnicodeDammit [0]. Actually handling all of these
esoteric encodings would be quite the chore, though.


I also used to think it would be cool if we could get MARC8
encoding/decoding into the Python standard library, but then I
realized I'd rather work on other stuff while MARC8 withers and dies.


[0] https://github.com/bdoms/beautifulsoup/blob/master/BeautifulSoup.py#L1753

On Thu, Mar 8, 2012 at 2:36 PM, Reese, Terry
<[email protected]> wrote:
> This is one of the reasons you really can't trust the information found in 
> position 9.  This is one of the reasons why when I wrote MarcEdit, I utilize 
> a mixed process when working with data and determining characterset -- a 
> process that reads this byte and takes the information under advisement, but 
> in the end treats it more as a suggestion and one part of a larger heuristic 
> analysis of the record data to determine whether the information is in UTF8 
> or not.  Fortunately, determining if a set of data is in UTF8 or something 
> else, is a fairly easy process.  Determining the something else is much more 
> difficult, but generally not necessary.
>
> For that reason, if I was advising other people working on MARC processing 
> libraries, I'd advocate having a process for recognizing that certain 
> informational data may not be set correctly, and essentially utilize a 
> compatibility process to read and correct them.  Because unfortunately, while 
> the number of vendors and systems that set this encoding byte correctly has 
> increased dramatically (it used to be pretty much no one) -- but it's still 
> so uneven, I generally consider this information unreliable.
>
> --TR
>
> -----Original Message-----
> From: Code for Libraries [mailto:[email protected]] On Behalf Of 
> Godmar Back
> Sent: Thursday, March 08, 2012 11:01 AM
> To: [email protected]
> Subject: Re: [CODE4LIB] Q.: MARC8 vs. MARC/Unicode and pymarc and misencoded 
> III records
>
> On Thu, Mar 8, 2012 at 1:46 PM, Terray, James <[email protected]> wrote:
>
>> Hi Godmar,
>>
>> UnicodeDecodeError: 'ascii' codec can't decode byte 0xe8 in position 9:
>> ordinal not in range(128)
>>
>> Having seen my fair share of these kinds of encoding errors in Python,
>> I can speculate (without seeing the pymarc source code, so please
>> don't hold me to this) that it's the Python code that's not set up to
>> handle the UTF-8 strings from your data source. In fact, the error
>> indicates it's using the default 'ascii' codec rather than 'utf-8'. If
>> it said "'utf-8' codec can't decode...", then I'd suspect a problem with the 
>> data.
>>
>> If you were to send the full traceback (all the gobbledy-gook that
>> Python spews when it encounters an error) and the version of pymarc
>> you're using to the program's author(s), they may be able to help you out 
>> further.
>>
>>
> My question is less about the Python error, which I understand, than about 
> the MARC record causing the error and about how others deal with this issue 
> (if it's a common issue, which I do not know.)
>
> But, here's the long story from pymarc's perspective.
>
> The record has leader[9] == 'a', but really, truly contains ANSEL-encoded 
> data.  When reading the record with a MARCReader(to_unicode = False) 
> instance, the record reads ok since no decoding is attempted, but attempts at 
> writing the record fail with the above error since pymarc attempts to
> utf8 encode the ANSEL-encoded string which contains non-ascii chars such as
> 0xe8 (the ANSEL Umlaut prefix). It does so because leader[9] == 'a' (see [1]).
>
> When reading the record with a MARCReader(to_unicode=True) instance, it'll 
> throw an exception during marc_decode when trying to utf8-decode the 
> ANSEL-encoded string. Rightly so.
>
> I don't blame pymarc for this behavior; to me, the record looks wrong.
>
>  - Godmar
>
> (ps: that said, what pymarc does fails in different circumstances - from what 
> I can see, pymarc shouldn't assume that it's ok to utf8-encode the field data 
> if leader[9] is 'a'.  For instance, this would double-encode correctly 
> encoded Marc/Unicode records that were read with a
> MARCReader(to_unicode=False) instance. But that's a separate issue that is 
> not my immediate concern. pymarc should probably remember if a record needs 
> or does not need encoding when writing it, rather than consulting the 
> leader[9] field.)
>
>
> (*)
> https://github.com/mbklein/pymarc/commit/ff312861096ecaa527d210836dbef904c24baee6

Re: [CODE4LIB] Q.: MARC8 vs. MARC/Unicode and pymarc and misencoded III records

Reply via email to