Re: [CODE4LIB] a note on MARC8 to UTF8 transcoding: Character references

Robert Haschart Fri, 08 Nov 2013 07:42:08 -0800

Jonathan,

Marc4j does handle this case. It is not implemented in theAnselToUnicode class, instead it exists in theMarcPermissiveStreamReader (and is only enabled when the permissivereading is enabled) I'm not sure there is a good reason that it is donethere instead of in the AnselToUnicode class. It will even match someslightly broader patterns (such as &#x200f%x; or &#200f ) both of whichI have encountered in data in our system.

The corresponding UnicodeToAnsel converter relies on this feature toensure data is roundtrip-able.


-Bob Haschart


On 11/5/2013 4:04 PM, Jonathan Rochkind wrote:

Do you do sometimes deal with MARC in the MARC8 character encoding?Do you deal with software that converts from MARC8 to UTF8?
Maybe sometimes you've seen weird escape sequences that look like HTMLor XML "character references", like, say "‏".
You, like me, might wonder what the heck that is about -- is itcataloger error, a catalgoer manually entered this or something inerror? Is it a software error, some software accidentally stuck thisin, at some part in the pipeline?
You can't, after all, just put HTML/XML character references whereveryou want -- there's no reason "‏" would mean anything otherthan &, #, x, 2, etc, when embedded in MARC ISO 2709 binary, right?
Wrong, it turns out!
There is actually a standard that says you _can_ embed XML/HTML-stylecharacter references in MARC8, for glyphs that can't otherwise berepresented in MARC8. "Lossless conversion [from unicode] to MARC-8encoding."
http://www.loc.gov/marc/specifications/speccharconversion.html#lossless

Phew, who knew?!
Software that converts from MARC8 to UTF-8 may or may not properlyun-escape these character references though. For instance, the Marc4K"AnselToUnicode" class which converts from Marc8 to UTF8 (or otherunicode serializations) won't touch these "lossless conversions" (ie,HTML/XML character references), they'll leave them alone in theoutput, as is.
yaz-marcdump also will NOT un-escape these entities when convertingfrom Marc8 to UTF8.
So, then, the system you then import your UTF8 records into will nowjust display the literal HTML/XML-style character reference, it won'tknow to un-escape them either, since those literals in UTF8 really_do_ just mean & followed by a # followed by an x, etc. It only meanssomething special as a literal in HTML, or in XML -- or it turns outin MARC8, as a 'lossless character conversion'.
So, for instance, in my own Traject software that uses Marc4J toconvert from Marc8 to UTF8 -- I'm going to have to go add anotherpass, that converts HTML/XML-character entities to actual UTF8serializations. Phew.
So be warned, you may need to add this to your software too.

Re: [CODE4LIB] a note on MARC8 to UTF8 transcoding: Character references

Reply via email to