Jonathan,
Marc4j does handle this case. It is not implemented in the
AnselToUnicode class, instead it exists in the
MarcPermissiveStreamReader (and is only enabled when the permissive
reading is enabled) I'm not sure there is a good reason that it is done
there instead of in the AnselToUnicode class. It will even match some
slightly broader patterns (such as ‏%x; or Èf ) both of which
I have encountered in data in our system.
The corresponding UnicodeToAnsel converter relies on this feature to
ensure data is roundtrip-able.
-Bob Haschart
On 11/5/2013 4:04 PM, Jonathan Rochkind wrote:
Do you do sometimes deal with MARC in the MARC8 character encoding?
Do you deal with software that converts from MARC8 to UTF8?
Maybe sometimes you've seen weird escape sequences that look like HTML
or XML "character references", like, say "‏".
You, like me, might wonder what the heck that is about -- is it
cataloger error, a catalgoer manually entered this or something in
error? Is it a software error, some software accidentally stuck this
in, at some part in the pipeline?
You can't, after all, just put HTML/XML character references wherever
you want -- there's no reason "‏" would mean anything other
than &, #, x, 2, etc, when embedded in MARC ISO 2709 binary, right?
Wrong, it turns out!
There is actually a standard that says you _can_ embed XML/HTML-style
character references in MARC8, for glyphs that can't otherwise be
represented in MARC8. "Lossless conversion [from unicode] to MARC-8
encoding."
http://www.loc.gov/marc/specifications/speccharconversion.html#lossless
Phew, who knew?!
Software that converts from MARC8 to UTF-8 may or may not properly
un-escape these character references though. For instance, the Marc4K
"AnselToUnicode" class which converts from Marc8 to UTF8 (or other
unicode serializations) won't touch these "lossless conversions" (ie,
HTML/XML character references), they'll leave them alone in the
output, as is.
yaz-marcdump also will NOT un-escape these entities when converting
from Marc8 to UTF8.
So, then, the system you then import your UTF8 records into will now
just display the literal HTML/XML-style character reference, it won't
know to un-escape them either, since those literals in UTF8 really
_do_ just mean & followed by a # followed by an x, etc. It only means
something special as a literal in HTML, or in XML -- or it turns out
in MARC8, as a 'lossless character conversion'.
So, for instance, in my own Traject software that uses Marc4J to
convert from Marc8 to UTF8 -- I'm going to have to go add another
pass, that converts HTML/XML-character entities to actual UTF8
serializations. Phew.
So be warned, you may need to add this to your software too.