Sounds like what you do, Terry, and what we need in PyMARC, is something like UnicodeDammit [0]. Actually handling all of these esoteric encodings would be quite the chore, though.
I also used to think it would be cool if we could get MARC8 encoding/decoding into the Python standard library, but then I realized I'd rather work on other stuff while MARC8 withers and dies. [0] https://github.com/bdoms/beautifulsoup/blob/master/BeautifulSoup.py#L1753 On Thu, Mar 8, 2012 at 2:36 PM, Reese, Terry <terry.re...@oregonstate.edu> wrote: > This is one of the reasons you really can't trust the information found in > position 9. This is one of the reasons why when I wrote MarcEdit, I utilize > a mixed process when working with data and determining characterset -- a > process that reads this byte and takes the information under advisement, but > in the end treats it more as a suggestion and one part of a larger heuristic > analysis of the record data to determine whether the information is in UTF8 > or not. Fortunately, determining if a set of data is in UTF8 or something > else, is a fairly easy process. Determining the something else is much more > difficult, but generally not necessary. > > For that reason, if I was advising other people working on MARC processing > libraries, I'd advocate having a process for recognizing that certain > informational data may not be set correctly, and essentially utilize a > compatibility process to read and correct them. Because unfortunately, while > the number of vendors and systems that set this encoding byte correctly has > increased dramatically (it used to be pretty much no one) -- but it's still > so uneven, I generally consider this information unreliable. > > --TR > > -----Original Message----- > From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of > Godmar Back > Sent: Thursday, March 08, 2012 11:01 AM > To: CODE4LIB@LISTSERV.ND.EDU > Subject: Re: [CODE4LIB] Q.: MARC8 vs. MARC/Unicode and pymarc and misencoded > III records > > On Thu, Mar 8, 2012 at 1:46 PM, Terray, James <james.ter...@yale.edu> wrote: > >> Hi Godmar, >> >> UnicodeDecodeError: 'ascii' codec can't decode byte 0xe8 in position 9: >> ordinal not in range(128) >> >> Having seen my fair share of these kinds of encoding errors in Python, >> I can speculate (without seeing the pymarc source code, so please >> don't hold me to this) that it's the Python code that's not set up to >> handle the UTF-8 strings from your data source. In fact, the error >> indicates it's using the default 'ascii' codec rather than 'utf-8'. If >> it said "'utf-8' codec can't decode...", then I'd suspect a problem with the >> data. >> >> If you were to send the full traceback (all the gobbledy-gook that >> Python spews when it encounters an error) and the version of pymarc >> you're using to the program's author(s), they may be able to help you out >> further. >> >> > My question is less about the Python error, which I understand, than about > the MARC record causing the error and about how others deal with this issue > (if it's a common issue, which I do not know.) > > But, here's the long story from pymarc's perspective. > > The record has leader[9] == 'a', but really, truly contains ANSEL-encoded > data. When reading the record with a MARCReader(to_unicode = False) > instance, the record reads ok since no decoding is attempted, but attempts at > writing the record fail with the above error since pymarc attempts to > utf8 encode the ANSEL-encoded string which contains non-ascii chars such as > 0xe8 (the ANSEL Umlaut prefix). It does so because leader[9] == 'a' (see [1]). > > When reading the record with a MARCReader(to_unicode=True) instance, it'll > throw an exception during marc_decode when trying to utf8-decode the > ANSEL-encoded string. Rightly so. > > I don't blame pymarc for this behavior; to me, the record looks wrong. > > - Godmar > > (ps: that said, what pymarc does fails in different circumstances - from what > I can see, pymarc shouldn't assume that it's ok to utf8-encode the field data > if leader[9] is 'a'. For instance, this would double-encode correctly > encoded Marc/Unicode records that were read with a > MARCReader(to_unicode=False) instance. But that's a separate issue that is > not my immediate concern. pymarc should probably remember if a record needs > or does not need encoding when writing it, rather than consulting the > leader[9] field.) > > > (*) > https://github.com/mbklein/pymarc/commit/ff312861096ecaa527d210836dbef904c24baee6