On Wed, Jan 11, 2012 at 3:35 PM, arvinport...@lycos.com
<arvinport...@lycos.com> wrote:
> I've been converting MARC XML records into USMARC and recently had a slew of 
> bad records which MARCEdit reported as having invalid leaders. After a few 
> days of puzzling over this and blaming it all on Unicode I noticed they were 
> all records which contained newlines (0D 0A) in their datafields. As far as I 
> know newlines aren't illegal in USMARC, but when I replaced them with spaces, 
> sure enough the problem disappeared.
>


A guess?  (I haven't looked closely at the test record)

Spaces take up one less byte then two newline characters.  (20 vs 0D
0A).  The leader contains how many bytes the record has.  If you
replace a two-byte character with a one byte character and it suddenly
works, the byte count might be off.

So imagine once in a distant past that those records lived on a system
that just uses line feed for "newline".  So just OA.  Then someone
starts does something like download this with an sftp client which
changes line endings.  Or runs it through some program that changes
line endings of files (some unixy tools have this as a annoying side
effect, which causes weird things in cygwin).

Suddenly your one byte newline is a two byte newline.  And these tools
don't know to alter the leader to reflect this.

I've seen something similar where someone tried to use some sort of
character set conversion tools on raw binary MARC files that didn't
"grok" MARC to get it to utf-8.  Suddenly what had been one character
/ one byte became one character / two bytes.  Since this didn't happen
every record, it passed a simple test they did and they started trying
to do it on a massive number of records.  Of course, since I'd be
surprised if that tool grokked marc-8 either, I'm surprised the
characters looked correct on the new records, but that's not exactly
relevant.


Jon Gorman

Reply via email to