Re: [CODE4LIB] utf8 "\xC2" does not map to Unicode

Jon Gorman Mon, 11 Apr 2011 09:09:29 -0700

> I'm making headway on my MARC records, but only through the use of brute 
> force.
>
> I used wget to retrieve the MARC records (as well as associated PDF and text 
> files) from the
> Internet Archive.


I know IA has some bad marc records (and also records w/ bad encoding)
from my experience with them in the past.  I'm also not sure what the
web server / wget will do to the files as well.

> I did play a bit with yaz-marcdump to seemingly convert things from marc-8 to 
> utf-8, but I'm not so
> sure it does what is expected. Does it actually convert characters, or does 
> it simply change a
> value in the leader of each record? If the former, then how do I know it is 
> not double-encoding
>things? If the later, then my resulting data set is still broken.

There was a bug I seem to remember with yaz-marcdump where it was just
toggling the leader.  (Or a design flaw where you had to specify a
character conversion as well.).  But that was fixed a while ago I
thought. It's probably one of the better tools out there for this type
of stuff.

> If MARC records are not well-formed and do not validate according to the 
> standard, then just like
> XML processors, they should be used. Garbage in. Garbage out.

I'm guessing you meant "they shouldn't be used?" ;).  XML processors
aren't really known for flexibility in this regard.

Unfortunately there's a lot of issues here, not the least of it some
of the worse issues I've seen are introduced by well-meaning folks who
do things like dump a file out into MARCXML and twiddle with bits or a
marc-breaker format and start using tools to dump unicode text into
what is really a marc-8 file.  Then at some point in the pipeline of
conversions enough character encoding conversions happens that the
file ends up being messed up.

And then there's always the legacy data that got bungled up in the an
encoding transfer.  I know we've got some bad CJK characters due to
this.  At some point in converting our marc-8 records one or two
characters got mapped to something that's not in the unicode spec at
all.  At some point we'll clean up those records, you know, when we've
got some spare time :P.

The problem here has been the tools and they pass whatever internal
validations are enforced.  Probably more stages need to check for
validity, but there's a lot of records that would fail if they did.
(I don't even want to think about how many people disable validation,
or use the same software stack that generated the marc in the first
place, or changes within the marc spec itself over time that makes
validation even more difficult.

Jon Gorman

Re: [CODE4LIB] utf8 "\xC2" does not map to Unicode

Reply via email to