> I'm making headway on my MARC records, but only through the use of brute > force. > > I used wget to retrieve the MARC records (as well as associated PDF and text > files) from the > Internet Archive.
I know IA has some bad marc records (and also records w/ bad encoding) from my experience with them in the past. I'm also not sure what the web server / wget will do to the files as well. > I did play a bit with yaz-marcdump to seemingly convert things from marc-8 to > utf-8, but I'm not so > sure it does what is expected. Does it actually convert characters, or does > it simply change a > value in the leader of each record? If the former, then how do I know it is > not double-encoding >things? If the later, then my resulting data set is still broken. There was a bug I seem to remember with yaz-marcdump where it was just toggling the leader. (Or a design flaw where you had to specify a character conversion as well.). But that was fixed a while ago I thought. It's probably one of the better tools out there for this type of stuff. > If MARC records are not well-formed and do not validate according to the > standard, then just like > XML processors, they should be used. Garbage in. Garbage out. I'm guessing you meant "they shouldn't be used?" ;). XML processors aren't really known for flexibility in this regard. Unfortunately there's a lot of issues here, not the least of it some of the worse issues I've seen are introduced by well-meaning folks who do things like dump a file out into MARCXML and twiddle with bits or a marc-breaker format and start using tools to dump unicode text into what is really a marc-8 file. Then at some point in the pipeline of conversions enough character encoding conversions happens that the file ends up being messed up. And then there's always the legacy data that got bungled up in the an encoding transfer. I know we've got some bad CJK characters due to this. At some point in converting our marc-8 records one or two characters got mapped to something that's not in the unicode spec at all. At some point we'll clean up those records, you know, when we've got some spare time :P. The problem here has been the tools and they pass whatever internal validations are enforced. Probably more stages need to check for validity, but there's a lot of records that would fail if they did. (I don't even want to think about how many people disable validation, or use the same software stack that generated the marc in the first place, or changes within the marc spec itself over time that makes validation even more difficult. Jon Gorman