On Wed, Jan 11, 2012 at 3:35 PM, arvinport...@lycos.com <arvinport...@lycos.com> wrote: > I've been converting MARC XML records into USMARC and recently had a slew of > bad records which MARCEdit reported as having invalid leaders. After a few > days of puzzling over this and blaming it all on Unicode I noticed they were > all records which contained newlines (0D 0A) in their datafields. As far as I > know newlines aren't illegal in USMARC, but when I replaced them with spaces, > sure enough the problem disappeared. >
A guess? (I haven't looked closely at the test record) Spaces take up one less byte then two newline characters. (20 vs 0D 0A). The leader contains how many bytes the record has. If you replace a two-byte character with a one byte character and it suddenly works, the byte count might be off. So imagine once in a distant past that those records lived on a system that just uses line feed for "newline". So just OA. Then someone starts does something like download this with an sftp client which changes line endings. Or runs it through some program that changes line endings of files (some unixy tools have this as a annoying side effect, which causes weird things in cygwin). Suddenly your one byte newline is a two byte newline. And these tools don't know to alter the leader to reflect this. I've seen something similar where someone tried to use some sort of character set conversion tools on raw binary MARC files that didn't "grok" MARC to get it to utf-8. Suddenly what had been one character / one byte became one character / two bytes. Since this didn't happen every record, it passed a simple test they did and they started trying to do it on a massive number of records. Of course, since I'd be surprised if that tool grokked marc-8 either, I'm surprised the characters looked correct on the new records, but that's not exactly relevant. Jon Gorman