On 4/6/2011 2:02 PM, Kyle Banerjee wrote:
I'd go so far as to question the value of validating redundant data that
theoretically has meaning but which are never supposed to vary. The 4 and
the 5 simply repeat what is already known about the structure of the MARC
record. Choking on stuff like this is like having a web browser ask you want
to do with a page because it lacks a document type declaration.

Well, the problem is when the original Marc4J author took the spec at it's word, and actually _acted upon_ the '4' and the '5', changing file semantics if they were different, and throwing an exception if it was a non-digit.

This actually happened, I'm not making this up!  Took me a while to debug.

So do you think he got it wrong? How was he supposed to know he got it wrong, he wrote to the spec and took it at it's word. Are you SURE there aren't any Marc formats other than Marc21 out there that actually do use these bytes with their intended meaning, instead of fixing them? How was the Marc4J author supposed to be sure of that, or even guess it might be the case, and know he'd be serving users better by ignoring the spec here instead of following it? What documents instead of the actual specifications should he have been looking at to determine that he ought not to be taking those bytes at their words, but just ignoring them?

To realize that we have so much non-conformant data out there that we're better off ignoring these bytes, is something you can really only learn through experience -- and something you can then later realize you're wrong on too:

Ie: I _thought_ I was writing only for Marc21, but then it turns out I've got to accept records from Outer Weirdistan that are a kind of legal Marc that actually uses those bytes for their intended meaning -- better go back and fix my entire software stack, involving various proprietary and open source products from multiple sources, each of which has undocumented behavior when it comes to these bytes, maybe they follow the spec or maybe the follow Kyle's advice, but they don't tell me. This is a mess.

Maybe this scenario is impossible, maybe there ARE and NEVER HAVE BEEN any Marc variants that actually use leader bytes 20-22 in this way -- how can I determine that? I've just got to guess and hope for the best. The point of specifications in the first place is for inter-operability, so we know that if all software and data conforms to the spec, then all software and data will interact in expected ways. Once we start guessing at which parts of the spec we really ought to be ignoring....

Again, I realize in the actual environment we've got, this is not a luxury we have. But it's a fault, not a benefit, to have lots of software everywhere behaving in non-compliant ways and creating invalid (according to the spec!) data.

Reply via email to