Hi Sophie, > To better understand the character encoding issue, can anybody > point me to some resources or list like UTF8 encoded data but > not in the MARC8 character set?
That question doesn't lend itself to an easy answer. The full MARC-8 repertoire (when you include all of the alternate character sets) has over 16,000 characters. The latest version of Unicode consists of a repertoire of more than 110,000 characters. So a list of UTF8 encoded data not in the MARC8 character set, would be a pretty long list. For a more *general* understanding of character encoding issues, I would recommend the following resources: For a quick library-centric overview, "Coded Character Sets: A Technical Primer for Librarians" web page [1]. Included is a page on "Resources on the Web", which has an emphasis on library automation and the internet environment [2]. For a good explanation about how character sets work in relational databases (as part of the more general topic of globalization/I18n), the Oracle "Globalization Support Guide" [3]. For all the ins and outs of Unicode, the book "Unicode Explained" by Jukka Korpela [4]. -- Michael [1] http://rocky.uta.edu/doran/charsets/ [2] http://rocky.uta.edu/doran/charsets/resources.html [3] http://docs.oracle.com/cd/B19306_01/server.102/b14225/toc.htm [4] http://www.amazon.com/gp/product/059610121X/ # Michael Doran, Systems Librarian # University of Texas at Arlington # 817-272-5326 office # 817-688-1926 mobile # do...@uta.edu # http://rocky.uta.edu/doran/ > -----Original Message----- > From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of > Deng, Sai > Sent: Friday, April 20, 2012 8:55 AM > To: CODE4LIB@LISTSERV.ND.EDU > Subject: Re: [CODE4LIB] more on MARC char encoding > > If a canned cleaner can be added in MarcEdit to deal with "smart > quotes/values," that will be great! Besides the smart quotes, please > consider other special characters including Chemistry and mathematics > symbols (these are different types of special characters, right?) To > better understand the character encoding issue, can anybody point me to > some resources or list like UTF8 encoded data but not in the MARC8 > character set? Thanks a lot. > Sophie > > -----Original Message----- > From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of > Jonathan Rochkind > Sent: Thursday, April 19, 2012 2:14 PM > To: CODE4LIB@LISTSERV.ND.EDU > Subject: Re: [CODE4LIB] more on MARC char encoding > > Ah, thanks Terry. > > That canned cleaner in MarcEdit sounds potentially useful -- I'm in a > continuing battle to keep the character encoding in our local marc corpus > clean. > > (The real blame here is on cataloger interfaces that let catalogers save > data that are illegal bytes for the character set it's being saved as. > And/or display the data back to the cataloger using a translation that > lets them show up as expected even though they are _wrong_ for the > character set being saved as. Connexion is theoretically the rolls royce > of cataloger interfaces, does it do this? Gosh I hope not.) > > On 4/19/2012 2:20 PM, Reese, Terry wrote: > > Actually -- the issue isn't one of MARC8 versus UTF8 (since this data > is being harvested from DSpace and is UTF8 encoded). It's actually an > issue with user entered data -- specifically, smart quotes and the like. > These values obviously are not in the MARC8 characterset and cause many > who transform user entered data (which tend to be used by default on > Windows) from XML to MARC. If you are sticking with a strickly UTF8 > based system, there generally are not issues because these are valid > characters. If you move them into a system where the data needs to be > represented in MARC -- then you have more problems. > > > > We do a lot of harvesting, and because of that, we run into these types > of issues moving data that is in UTF8, but has characters not represented > in MARC8, from into Connexion and having some of that data flattened. > Given the wide range of data not in the MARC8 set that can show up in > UTF8, it's not a surprise that this would happen. My guess is that you > could add a template to your XSLT translation that attempted to filter > the most common forms of these "smart quotes/values" and replace them > with the more standard values. Likewise, if there was a great enough > need, I could provide a canned cleaner in MarcEdit that could fix many of > the most common varieties of these "smart quotes/values". > > > > --TR > > > > -----Original Message----- > > From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf > > Of Jonathan Rochkind > > Sent: Thursday, April 19, 2012 11:13 AM > > To: CODE4LIB@LISTSERV.ND.EDU > > Subject: Re: [CODE4LIB] more on MARC char encoding > > > > If your records are really in MARC8 not UTF8, your best bet is to use a > tool to convert them to UTF8 before hitting your XSLT. > > > > The open source 'yaz' command line tools can do it for Marc21. > > > > The Marc4J package can do it in java, and probably work for any MARC > variant not just Marc21. > > > > Char encoding issues are tricky. You might want to first figure out if > your records are really in Marc8, thus the problems, or if instead they > illegally contain bad data or data in some other encoding (Latin1). > > > > Char encoding is a tricky topic, you might want to do some reading on > it in general. The Unicode docs are pretty decent. > > > > On 4/19/2012 11:06 AM, Deng, Sai wrote: > >> Hi list, > >> I am a Metadata librarian but not a programmer, sorry if my question > seems naïve. We use XSLT stylesheet to transform some harvested DC > records from DSpace to MARC in MarcEdit, and then export them to OCLC. > >> Some characters do not display correctly and need manual editing, for > example: > >> In MarcEditor > Transferred to OCLC Edit in OCLC > >> Bayes’ theorem > Bayes⁰́₉ theorem Bayes' theorem > >> ―it won‘t happen here‖ attitude ⁰́₅it won⁰́₈t happen here⁰́₆ > attitude "it won't happen here" attitude > >> “Generation Y” > ⁰́₋Generation Y⁰́₊ > "Generation Y" > >> listeners‟ evaluations listeners⁰́Ÿ > evaluations listeners' evaluations > >> high school – from high school ⁰́₃ from > high school – from > >> Co₀․₅Zn₀․₅Fe₂O₄ > Co²́⁰⁰́Þ²́⁵Zn²́⁰⁰́Þ²́⁵Fe²́²O²́⁴ > Co0.5Zn0.5Fe2O4? > >> μ Îơ > μ > >> Nafion® Nafion℗ʼ > Nafion® > >> Lévy > L©♭vy > Lévy > >> 43±13.20 years > 43℗ł13.20 years 43±13.20 > years > >> 12.6 ± 7.05 ft∙lbs 12.6 ℗ł 7.05 ft⁸́₉lbs > 12.6 ± 7.05 ft•lbs > >> ‘Pouring on the Pounds' ⁰́₈Pouring on the Pounds' > 'Pouring on the Pounds' > >> k-ε turbulence k-Îæ > turbulence k-ε turbulence > >> student—neither parents student⁰́₄neither parents > student-neither parents > >> Λ = M – {p1, p2,…,pκ} Î₎ = M ⁰́₃ {p1, > p2,⁰́Œ,pÎð} ? (won’t save) > >> M = (0, δ)x × Y M = (0, > Îþ)x ©₇ Y ? > >> 100° 100℗ð > 100⁰ > >> (α ≥16º) (Îł ⁹́Æ16℗ð) > (α>=16⁰) > >> naïve na©¯ve > > naïve > >> > >> To deal with this, we normally replace limited numbers of characters > in MarcEditor first and then do the compiling and transfer. For example: > replace ’ to ', “ to ", ” to " and ‟ to '. I am not sure about the right > and efficient way to solve this problem. I see that the XSLT stylesheet > specifies encoding="UTF-8". Is there a systematic way to make the > character transform and display right? Thank you for your suggestion and > feedback! > >> > >> Sophie > >> > >> -----Original Message----- > >> From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf > >> Of Tod Olson > >> Sent: Tuesday, April 17, 2012 10:13 PM > >> To: CODE4LIB@LISTSERV.ND.EDU > >> Subject: Re: [CODE4LIB] more on MARC char encoding: Now we're about > >> ISO_2709 and MARC21 > >> > >> In practice it seems to mean UTF-8. At least I've only seen UTF-8, and > I can't imagine the code that processes this stuff being safe for UTF-16 > or UTF-32. All of the offsets are byte-oriented, and there's too much > legacy code that makes assumption about null-terminated strings. > >> > >> -Tod > >> > >> On Apr 17, 2012, at 6:55 PM, Jonathan Rochkind wrote: > >> > >>> Okay, forget XML for a moment, let's just look at marc 'binary'. > >>> > >>> First, for Anglophone-centric MARC21. > >>> > >>> The LC docs don't actually say quite what I thought about leader byte > 09, used to advertise encoding: > >>> > >>> > >>> a - UCS/Unicode > >>> Character coding in the record makes use of characters from the > Universal Coded Character Set (UCS) (ISO 10646), or Unicode™, an industry > subset. > >>> > >>> > >>> > >>> That doesn't say UTF-8. It says UCS or "Unicode". What does that > actually mean? Does it mean UTF-8, or does it mean UTF-16 (closer to > what used to be called "UCS" I think?). Whatever it actually means, do > people violate it in the wild? > >>> > >>> > >>> > >>> Now we get to non-Anglophone centric marc. I think all of which is > ISO_2709? A standard which of course is not open access, so I can't get > it to see what it says. > >>> > >>> But leader 09 being used for encoding -- is that Marc21 specific, or > is it true of any ISO-2709? Marc8 and "unicode" being the only valid > encodings can't be true of any ISO-2709, right? > >>> > >>> Is there a generic ISO-2709 way to deal with this, or not so much? > >> > >