Re: [CODE4LIB] more on MARC char encoding
Hi list, I am a Metadata librarian but not a programmer, sorry if my question seems naïve. We use XSLT stylesheet to transform some harvested DC records from DSpace to MARC in MarcEdit, and then export them to OCLC. Some characters do not display correctly and need manual editing, for example: In MarcEditor Transferred to OCLC Edit in OCLC Bayes’ theorem Bayes⁰́₉ theorem Bayes' theorem ―it won‘t happen here‖ attitude ⁰́₅it won⁰́₈t happen here⁰́₆ attitude it won't happen here attitude “Generation Y” ⁰́₋Generation Y⁰́₊ Generation Y listeners‟ evaluationslisteners⁰́ evaluations listeners' evaluations high school – from high school ⁰́₃ from high school – from Co₀․₅Zn₀․₅Fe₂O₄ Co²́⁰⁰́Þ²́⁵Zn²́⁰⁰́Þ²́⁵Fe²́²O²́⁴ Co0.5Zn0.5Fe2O4? μ Îơ μ Nafion®Nafion℗ʼ Nafion® Lévy L©♭vy Lévy 43±13.20 years 43℗ł13.20 years 43±13.20 years 12.6 ± 7.05 ft∙lbs 12.6 ℗ł 7.05 ft⁸́₉lbs 12.6 ± 7.05 ft•lbs ‘Pouring on the Pounds' ⁰́₈Pouring on the Pounds' 'Pouring on the Pounds' k-ε turbulence k-Îæ turbulence k-ε turbulence student—neither parents student⁰́₄neither parents student-neither parents Λ = M – {p1, p2,…,pκ} Î₎ = M ⁰́₃ {p1, p2,⁰́Œ,pÎð} ? (won’t save) M = (0, δ)x × YM = (0, Îþ)x ©₇ Y ? 100°100℗ð 100⁰ (α ≥16º) (Îł ⁹́Æ16℗ð) (α=16⁰) naïve na©¯ve naïve To deal with this, we normally replace limited numbers of characters in MarcEditor first and then do the compiling and transfer. For example: replace ’ to ', “ to , ” to and ‟ to '. I am not sure about the right and efficient way to solve this problem. I see that the XSLT stylesheet specifies encoding=UTF-8. Is there a systematic way to make the character transform and display right? Thank you for your suggestion and feedback! Sophie -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Tod Olson Sent: Tuesday, April 17, 2012 10:13 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] more on MARC char encoding: Now we're about ISO_2709 and MARC21 In practice it seems to mean UTF-8. At least I've only seen UTF-8, and I can't imagine the code that processes this stuff being safe for UTF-16 or UTF-32. All of the offsets are byte-oriented, and there's too much legacy code that makes assumption about null-terminated strings. -Tod On Apr 17, 2012, at 6:55 PM, Jonathan Rochkind wrote: Okay, forget XML for a moment, let's just look at marc 'binary'. First, for Anglophone-centric MARC21. The LC docs don't actually say quite what I thought about leader byte 09, used to advertise encoding: a - UCS/Unicode Character coding in the record makes use of characters from the Universal Coded Character Set (UCS) (ISO 10646), or Unicode™, an industry subset. That doesn't say UTF-8. It says UCS or Unicode. What does that actually mean? Does it mean UTF-8, or does it mean UTF-16 (closer to what used to be called UCS I think?). Whatever it actually means, do people violate it in the wild? Now we get to non-Anglophone centric marc. I think all of which is ISO_2709? A standard which of course is not open access, so I can't get it to see what it says. But leader 09 being used for
Re: [CODE4LIB] Archivists' Toolkit: Adding Digital Objects via MySQL
Hi. Is there a reason not to attempt this instead through the CLI? Al Matthews, Software Dev, Atlanta University Center From: Code for Libraries [CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Rosalyn Metz [rosalynm...@gmail.com] Sent: Wednesday, April 18, 2012 9:23 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: [CODE4LIB] Archivists' Toolkit: Adding Digital Objects via MySQL Hi Everyone, I posted this over on the Archivists' Toolkit listserv and got no response (yet), so I thought I might try here as well. I have a large quantity (around 300+) of digital objects that I need to add to Archivists' Toolkit. I think I've figured out what queries I need to run in order to do this in MySQL (rather than the interface) but I wanted to get opinions from the peanut gallery before trying it out on my test instance. It seems that there are actually two update queries that need to be used when creating a Digital Object. They are: insert into ArchDescriptionInstances (instanceType, resourceComponentId, resourceId, parentResourceId, instanceDescriminator, archDescriptionInstancesId) values ('Digital object', 336673, null, 543, 'digital', 22567003) and... insert into DigitalObjects (version, lastUpdated, created, lastUpdatedBy, createdBy, title, dateExpression, dateBegin, dateEnd, languageCode, restrictionsApply, eadDaoActuate, eadDaoShow, metsIdentifier, objectType, label, objectOrder, componentId, parentDigitalObjectId, archDescriptionInstancesId, repositoryId) values (0, '2012-04-17 12:05:15', '2012-04-17 12:05:15', 'username', 'username', 'title', '1938-1959', null, null, '', 0, 'onRequest', 'new', '678.1829', 'text', '', 0, '', null, 22567003, 1) There also appears to be some update queries as well, but I'm guessing that they are less important (please correct me if I'm wrong). Has anyone tried to do this in the past? If so do you have scripts that will create Digital Objects for you that you wouldn't mind sharing? Is there anything you think I should know before testing this out in my test instance of AT? Any caveats for me? Any help anyone can provide would be greatly appreciated. Thanks, Rosalyn - ** The contents of this email and any attachments are confidential. They are intended for the named recipient(s) only. If you have received this email in error please notify the system manager or the sender immediately and do not disclose the contents to anyone or make copies. ** IronMail scanned this email for viruses, vandals and malicious content. ** **
Re: [CODE4LIB] Archivists' Toolkit: Adding Digital Objects via MySQL
Al, Looking at the CLI quickly it looks like its related to batch exporting. I'm trying to quickly create +1500 new digital object records in AT (ie. go into a resource, click new instance, choose digital object, add the title, add the mets identifier, click save). If I misread what the CLI does, then I'm all ears on how I could use it to help. Thanks! Rosalyn On Wed, Apr 18, 2012 at 11:13 AM, Al Matthews amatth...@auctr.edu wrote: Hi. Is there a reason not to attempt this instead through the CLI? Al Matthews, Software Dev, Atlanta University Center From: Code for Libraries [CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Rosalyn Metz [rosalynm...@gmail.com] Sent: Wednesday, April 18, 2012 9:23 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: [CODE4LIB] Archivists' Toolkit: Adding Digital Objects via MySQL Hi Everyone, I posted this over on the Archivists' Toolkit listserv and got no response (yet), so I thought I might try here as well. I have a large quantity (around 300+) of digital objects that I need to add to Archivists' Toolkit. I think I've figured out what queries I need to run in order to do this in MySQL (rather than the interface) but I wanted to get opinions from the peanut gallery before trying it out on my test instance. It seems that there are actually two update queries that need to be used when creating a Digital Object. They are: insert into ArchDescriptionInstances (instanceType, resourceComponentId, resourceId, parentResourceId, instanceDescriminator, archDescriptionInstancesId) values ('Digital object', 336673, null, 543, 'digital', 22567003) and... insert into DigitalObjects (version, lastUpdated, created, lastUpdatedBy, createdBy, title, dateExpression, dateBegin, dateEnd, languageCode, restrictionsApply, eadDaoActuate, eadDaoShow, metsIdentifier, objectType, label, objectOrder, componentId, parentDigitalObjectId, archDescriptionInstancesId, repositoryId) values (0, '2012-04-17 12:05:15', '2012-04-17 12:05:15', 'username', 'username', 'title', '1938-1959', null, null, '', 0, 'onRequest', 'new', '678.1829', 'text', '', 0, '', null, 22567003, 1) There also appears to be some update queries as well, but I'm guessing that they are less important (please correct me if I'm wrong). Has anyone tried to do this in the past? If so do you have scripts that will create Digital Objects for you that you wouldn't mind sharing? Is there anything you think I should know before testing this out in my test instance of AT? Any caveats for me? Any help anyone can provide would be greatly appreciated. Thanks, Rosalyn - ** The contents of this email and any attachments are confidential. They are intended for the named recipient(s) only. If you have received this email in error please notify the system manager or the sender immediately and do not disclose the contents to anyone or make copies. ** IronMail scanned this email for viruses, vandals and malicious content. ** **
[CODE4LIB] NYC code4lib meetup: next Weds April 25
Hello all, The code4lib-nyc chapter in conjunction with METRO is holding our somewhat-quarterly jam session: next Wednesday, April 25 10am-noon at the METRO Training Center, 57 E 11th Street, NYC. Come talk about your projects, and find out what everybody's working on! Folks without technical background are welcome to join us. No charge; please register at http://metro.org/events/144/ -- Yitzchak Schaffer Systems Manager Touro College Libraries 212.742.8770 ext. 2432 http://www.tourolib.org/ Access Problems? Contact systems.libr...@touro.edu
Re: [CODE4LIB] more on MARC char encoding
Actually -- the issue isn't one of MARC8 versus UTF8 (since this data is being harvested from DSpace and is UTF8 encoded). It's actually an issue with user entered data -- specifically, smart quotes and the like. These values obviously are not in the MARC8 characterset and cause many who transform user entered data (which tend to be used by default on Windows) from XML to MARC. If you are sticking with a strickly UTF8 based system, there generally are not issues because these are valid characters. If you move them into a system where the data needs to be represented in MARC -- then you have more problems. We do a lot of harvesting, and because of that, we run into these types of issues moving data that is in UTF8, but has characters not represented in MARC8, from into Connexion and having some of that data flattened. Given the wide range of data not in the MARC8 set that can show up in UTF8, it's not a surprise that this would happen. My guess is that you could add a template to your XSLT translation that attempted to filter the most common forms of these smart quotes/values and replace them with the more standard values. Likewise, if there was a great enough need, I could provide a canned cleaner in MarcEdit that could fix many of the most common varieties of these smart quotes/values. --TR -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Jonathan Rochkind Sent: Thursday, April 19, 2012 11:13 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] more on MARC char encoding If your records are really in MARC8 not UTF8, your best bet is to use a tool to convert them to UTF8 before hitting your XSLT. The open source 'yaz' command line tools can do it for Marc21. The Marc4J package can do it in java, and probably work for any MARC variant not just Marc21. Char encoding issues are tricky. You might want to first figure out if your records are really in Marc8, thus the problems, or if instead they illegally contain bad data or data in some other encoding (Latin1). Char encoding is a tricky topic, you might want to do some reading on it in general. The Unicode docs are pretty decent. On 4/19/2012 11:06 AM, Deng, Sai wrote: Hi list, I am a Metadata librarian but not a programmer, sorry if my question seems naïve. We use XSLT stylesheet to transform some harvested DC records from DSpace to MARC in MarcEdit, and then export them to OCLC. Some characters do not display correctly and need manual editing, for example: In MarcEditor Transferred to OCLC Edit in OCLC Bayes’ theorem Bayes⁰́₉ theorem Bayes' theorem ―it won‘t happen here‖ attitude ⁰́₅it won⁰́₈t happen here⁰́₆ attitude it won't happen here attitude “Generation Y”⁰́₋Generation Y⁰́₊ Generation Y listeners‟ evaluations listeners⁰́ evaluations listeners' evaluations high school – from high school ⁰́₃ from high school – from Co₀․₅Zn₀․₅Fe₂O₄ Co²́⁰⁰́Þ²́⁵Zn²́⁰⁰́Þ²́⁵Fe²́²O²́⁴ Co0.5Zn0.5Fe2O4? μÎơ μ Nafion® Nafion℗ʼ Nafion® LévyL©♭vy Lévy 43±13.20 years 43℗ł13.20 years 43±13.20 years 12.6 ± 7.05 ft∙lbs12.6 ℗ł 7.05 ft⁸́₉lbs 12.6 ± 7.05 ft•lbs ‘Pouring on the Pounds'⁰́₈Pouring on the Pounds' 'Pouring on the Pounds' k-ε turbulence k-Îæ turbulence k-ε turbulence student—neither parents student⁰́₄neither parents student-neither parents Λ = M – {p1, p2,…,pκ} Î₎ = M ⁰́₃ {p1, p2,⁰́Œ,pÎð} ? (won’t save) M = (0, δ)x × Y M = (0, Îþ)x ©₇ Y? 100°
Re: [CODE4LIB] more on MARC char encoding
Ah, thanks Terry. That canned cleaner in MarcEdit sounds potentially useful -- I'm in a continuing battle to keep the character encoding in our local marc corpus clean. (The real blame here is on cataloger interfaces that let catalogers save data that are illegal bytes for the character set it's being saved as. And/or display the data back to the cataloger using a translation that lets them show up as expected even though they are _wrong_ for the character set being saved as. Connexion is theoretically the rolls royce of cataloger interfaces, does it do this? Gosh I hope not.) On 4/19/2012 2:20 PM, Reese, Terry wrote: Actually -- the issue isn't one of MARC8 versus UTF8 (since this data is being harvested from DSpace and is UTF8 encoded). It's actually an issue with user entered data -- specifically, smart quotes and the like. These values obviously are not in the MARC8 characterset and cause many who transform user entered data (which tend to be used by default on Windows) from XML to MARC. If you are sticking with a strickly UTF8 based system, there generally are not issues because these are valid characters. If you move them into a system where the data needs to be represented in MARC -- then you have more problems. We do a lot of harvesting, and because of that, we run into these types of issues moving data that is in UTF8, but has characters not represented in MARC8, from into Connexion and having some of that data flattened. Given the wide range of data not in the MARC8 set that can show up in UTF8, it's not a surprise that this would happen. My guess is that you could add a template to your XSLT translation that attempted to filter the most common forms of these smart quotes/values and replace them with the more standard values. Likewise, if there was a great enough need, I could provide a canned cleaner in MarcEdit that could fix many of the most common varieties of these smart quotes/values. --TR -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Jonathan Rochkind Sent: Thursday, April 19, 2012 11:13 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] more on MARC char encoding If your records are really in MARC8 not UTF8, your best bet is to use a tool to convert them to UTF8 before hitting your XSLT. The open source 'yaz' command line tools can do it for Marc21. The Marc4J package can do it in java, and probably work for any MARC variant not just Marc21. Char encoding issues are tricky. You might want to first figure out if your records are really in Marc8, thus the problems, or if instead they illegally contain bad data or data in some other encoding (Latin1). Char encoding is a tricky topic, you might want to do some reading on it in general. The Unicode docs are pretty decent. On 4/19/2012 11:06 AM, Deng, Sai wrote: Hi list, I am a Metadata librarian but not a programmer, sorry if my question seems naïve. We use XSLT stylesheet to transform some harvested DC records from DSpace to MARC in MarcEdit, and then export them to OCLC. Some characters do not display correctly and need manual editing, for example: In MarcEditor Transferred to OCLC Edit in OCLC Bayes’ theorem Bayes⁰́₉ theorem Bayes' theorem ―it won‘t happen here‖ attitude ⁰́₅it won⁰́₈t happen here⁰́₆ attitude it won't happen here attitude “Generation Y” ⁰́₋Generation Y⁰́₊ Generation Y listeners‟ evaluationslisteners⁰́ evaluations listeners' evaluations high school – from high school ⁰́₃ from high school – from Co₀․₅Zn₀․₅Fe₂O₄ Co²́⁰⁰́Þ²́⁵Zn²́⁰⁰́Þ²́⁵Fe²́²O²́⁴ Co0.5Zn0.5Fe2O4? μ Îơ μ Nafion®Nafion℗ʼ Nafion® Lévy L©♭vy Lévy 43±13.20 years 43℗ł13.20 years 43±13.20 years 12.6 ± 7.05 ft∙lbs 12.6 ℗ł 7.05 ft⁸́₉lbs 12.6 ± 7.05 ft•lbs ‘Pouring on the Pounds' ⁰́₈Pouring on the Pounds'
Re: [CODE4LIB] more on MARC char encoding
We see Unicode data pasted into MARC8 records all the time. It happens enough that my MARC8-Unicode converter takes a second look at illegal MARC8 bytes and tries a UTF-8 encoding as well. Ralph -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Jonathan Rochkind Sent: Thursday, April 19, 2012 3:14 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: more on MARC char encoding Ah, thanks Terry. That canned cleaner in MarcEdit sounds potentially useful -- I'm in a continuing battle to keep the character encoding in our local marc corpus clean. (The real blame here is on cataloger interfaces that let catalogers save data that are illegal bytes for the character set it's being saved as. And/or display the data back to the cataloger using a translation that lets them show up as expected even though they are _wrong_ for the character set being saved as. Connexion is theoretically the rolls royce of cataloger interfaces, does it do this? Gosh I hope not.) On 4/19/2012 2:20 PM, Reese, Terry wrote: Actually -- the issue isn't one of MARC8 versus UTF8 (since this data is being harvested from DSpace and is UTF8 encoded). It's actually an issue with user entered data -- specifically, smart quotes and the like. These values obviously are not in the MARC8 characterset and cause many who transform user entered data (which tend to be used by default on Windows) from XML to MARC. If you are sticking with a strickly UTF8 based system, there generally are not issues because these are valid characters. If you move them into a system where the data needs to be represented in MARC -- then you have more problems. We do a lot of harvesting, and because of that, we run into these types of issues moving data that is in UTF8, but has characters not represented in MARC8, from into Connexion and having some of that data flattened. Given the wide range of data not in the MARC8 set that can show up in UTF8, it's not a surprise that this would happen. My guess is that you could add a template to your XSLT translation that attempted to filter the most common forms of these smart quotes/values and replace them with the more standard values. Likewise, if there was a great enough need, I could provide a canned cleaner in MarcEdit that could fix many of the most common varieties of these smart quotes/values. --TR -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Jonathan Rochkind Sent: Thursday, April 19, 2012 11:13 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] more on MARC char encoding If your records are really in MARC8 not UTF8, your best bet is to use a tool to convert them to UTF8 before hitting your XSLT. The open source 'yaz' command line tools can do it for Marc21. The Marc4J package can do it in java, and probably work for any MARC variant not just Marc21. Char encoding issues are tricky. You might want to first figure out if your records are really in Marc8, thus the problems, or if instead they illegally contain bad data or data in some other encoding (Latin1). Char encoding is a tricky topic, you might want to do some reading on it in general. The Unicode docs are pretty decent. On 4/19/2012 11:06 AM, Deng, Sai wrote: Hi list, I am a Metadata librarian but not a programmer, sorry if my question seems naïve. We use XSLT stylesheet to transform some harvested DC records from DSpace to MARC in MarcEdit, and then export them to OCLC. Some characters do not display correctly and need manual editing, for example: In MarcEditor Transferred to OCLC Edit in OCLC Bayes’ theorem Bayes⁰́₉ theorem Bayes' theorem ―it won‘t happen here‖ attitude ⁰́₅it won⁰́₈t happen here⁰́₆ attitude it won't happen here attitude “Generation Y” ⁰́₋Generation Y⁰́₊ Generation Y listeners‟ evaluations listeners⁰́ evaluations listeners' evaluations high school – from high school ⁰́₃ from high school – from Co₀․₅Zn₀․₅Fe₂O₄ Co²́⁰⁰́Þ²́⁵Zn²́⁰⁰́Þ²́⁵Fe²́²O²́⁴ Co0.5Zn0.5Fe2O4? μ Îơ μ Nafion® Nafion℗ʼ Nafion® Lévy
Re: [CODE4LIB] more on MARC char encoding
On 4/19/2012 3:23 PM, LeVan,Ralph wrote: We see Unicode data pasted into MARC8 records all the time. It happens enough that my MARC8-Unicode converter takes a second look at illegal MARC8 bytes and tries a UTF-8 encoding as well. Right. I see it too. I'm arguing that means cataloger entry tools, the tools which catalogers are using when they paste that stuff in, are not giving the cataloger sufficient feedback as to their entry. Flagging completely illegal byte sequences in the output encoding and not letting them be saved; make sure cataloger input is displayed back _as appropriate for the current encoding_, so they get immediate visual feedback if they're entering bytes that don't mean what they think for the operative output encoding. I think it's possible _no_ cataloger interfaces actually do this. (although if any do, I bet it's MarcEdit). If Connexion doesn't, for interactive cataloger entry, it'd be awfully nice if it did.
Re: [CODE4LIB] more on MARC char encoding: Now we're about ISO_2709 and MARC21
On 4/18/2012 12:08 PM, Jonathan Rochkind wrote: On 4/18/2012 11:09 AM, Doran, Michael D wrote: I don't believe that is the case. Take UTF-8 out of the picture, and consider the MARC-8 character set with its escape sequences and combining characters. A character such as an n with a tilde would consist of two bytes. The Greek small letter alpha, if invoked in accordance with ANSI X3.41, would consist of five bytes (two bytes for the initial escape sequence, a byte for the character, and then two bytes for the escape sequence returning to the default character set). ISO 2709 doesn't care how many bytes your characters are. The directory and offsets and other things count bytes, not characters. (which was, in my opinion, the _right_ decision, for once with marc!) How bytes translate into characters is not a concern of ISO 2709. The majority of non-7-bit-ASCII encodings will have chars that are more than one byte, either sometimes or always. This is true of MARC8 (some chars), UTF8 (some chars), and UTF16 (all chars), all of them. (It is not true of Latin-1 though, for instance, I don't think). ISO 2709 doesn't care what char encodings you use, and there's no standard ISO 2709 way to determine what char encodings are used for _data_ in the MARC record. ISO 2709 does say that _structural_ elements like field names, subfield names, the directory itself, seperator chars, etc, all need to be (essentially, over-simplifying) 7-bit-ASCII. The actual data itself is application dependent, 2709 doesn't care, and 2709 doesn't give any standard cross-2709 way to determine it. That is my conclusion at the moment, helped by all of you all in this thread, thanks! The conclusion that I came to in the work I have done on marc4j (which is used heavily by SolrMarc) is that for any significant processing of Marc records the only solution that makes sense is to translate the record data into Unicode characters as it is being read in. Of course as you and others have stated, determining what the data actually is, in order to correctly translate it to Unicode, is no easy task. The leader byte that merely indicates is UTF8 or is not UTF8 is wrong often enough in the real world that it is of little value when it indicates is UTF-8and is even less value when it indicates is not UTF-8 Significant portions of the code I've added to marc4j deal with trying to determine what the encoding of that data actually is and trying to translate the data correctly into Unicode even when the data is incorrect. You also argued in another message that cataloger entry tools should give feedback to help the cataloger not create errors. I agree. I think one possible step towards this would be that the editor must work in Unicode, irrespective of the data format that the underlying system expects the data to be. If the underlying system expects MARC8 then the save as process should be able to translate the data into MARC8 on output. -Robert Haschart
[CODE4LIB] ruby-marc, better ruby 1.9 char encoding support, testers wanted
I have implemented fairly complete and robust proper support for character encodings in ruby-marc when reading 'binary' marc under ruby 1.9. It's currently in a git branch, not yet released, and not yet in git master. https://github.com/ruby-marc/ruby-marc/tree/char_encodings If anyone who uses this (or doesn't) has a chance to beta test it, it would be appreciated. One way to test, checkout with git, switch to 'char_encodings' branch, and `rake install` to install as a gem to your system. These changes should _only_ effect use under ruby 1.9, and only effect reading in 'binary' (ISO 2709) marc. The new functionality is pretty extensively covered by automated tests, but there are some weird and complex interactions that can occur depending on exactly what you're doing, bugs are possible. It was somewhat more complicated than one might expect to implement a complete solution here, in part because we _do_ have international users who use ruby-marc, with encodings that are neither MARC8 nor UTF8, and in fact non-MARC21. If any of the other committers (or anyone else) wants to code review, you are welcome to. POSSIBLE BACKWARDS INCOMPAT Some previous 0.4.x versions, when running under ruby 1.9 only, would automatically _transcode_ non-unicode encodings to UTF-8 for you under the hood. The new version no longer does so automatically (although you can ask it to). It was not tenable to support that backwards compatibly. Everything else _ought_ to be backwards compatible with previous 0.4.x ruby-marc under ruby 1.9, fixing many problems. NEW FEATURES All applying to ruby 1.9 only, and to reading binary MARC only. * Do a pretty good job of setting encodings properly for your ruby environment, especially under standard UTF-8 usage. * You _can_ and _do have to_ provide an argument for reading non-UTF8 encodings. (but sadly no support for marc8). * You can ask MARC::Reader to transcode to a different encoding when loading marc. * You can ask MARC::Reader to replace bytes that are illegal in the believed source encoding with a replacement character (or the empty string) to avoid ruby invalid UTF-8 byte exceptions later, and sanitize your input. New features documented in inline comments, see at: http://rubydoc.info/github/ruby-marc/ruby-marc/MARC/Reader I had trouble making the docs concise, sorry, I think I've been pounding my head against this stuff so much realizing how complicated it ends up being that I wasn't sure what to leave out.
[CODE4LIB] Job: Head of Metadata Services at Georgetown University
Head of Metadata Services Georgetown University Library is seeking a dynamic, forward-thinking, innovative, energetic and teamoriented person to serve as Head of the Metadata Services Unit within the Technical ServicesDepartment. The successful candidate will have overall responsibility for providing innovative leadership, vision,planning, and supervision for cataloging and metadata services. The incumbent will set priorities;allocate resources; develop plans, policies and practices within the unit/department; supervise operationsfor original and copy cataloging of print, multi- media resources, special collections/rare books,electronic monographs, serials, and databases using MARC or other metadata formats; oversee physicalprocessing functions; provide leadership for knowledgeable staff in an environment of anticipated create a positive work environment; deliver digital initiatives support; monitor the national andinternational trends in metadata creation and direct on-going review and revision of library- widemetadata/cataloging policies and procedures; serve as the resource person to all Library staff answeringinquiries and providing interpretations on existing and emerging metadata standards and rules;collaborate and work with other library units to create metadata for digital and special collections;oversee the Library's participation in cooperative metadata endeavors such as NACO; serve as amember of the Technical Services Department's Management Team; serve on library and university wide committees and task forces and initiatives as required.This position reports directly to the Head of Technical Services. Directly reporting to this position are 2catalogers and 1 Receiving/Copy Cataloging Supervisor, 4 indirect reports and 1-3 student(s).Additional indirect reporting may also include staff that performs metadata creation work within otherdepartments, copy cataloging, and physical processing. Work is performed according to priorities set bythe Department Head and within guidelines and procedures established for the Department. Qualifications: The candidate must have an ALA-accredited MLS degree, at least 2 years progressively increasingsupervisory/management/leadership experience, along with demonstrated knowledge and experiencewith provision of metadata/cataloging services, including those related to digital initiatives within anacademic or research library setting. The candidate must demonstrate excellent verbal and written skills.Experience working with metadata creation for institutional repositories is highly preferred. Workingknowledge of MARC21and non-MARC metadata schema including but not limited to metadata formats,such as Dublin Core, EAD, METS, MODS, OAI, and XML required. Familiarity with data interchangestandards (e.g., OAI-PMH); knowledge of the semantic web and linked data; experience with digitalcontent management systems such as DSpace and ContentDM; knowledge of current standards such asAACR2, LCSH, LC Classification, NACO and forthcoming changes with FRBR, RDA and MARC; andemerging technologies in cataloging services, including those related to digital libraries and specialcollections is highly preferred. Salary/Benefits/Rank: Salary commensurate with experience. Comprehensive benefits packageincluding 21 days paid leave per year; medical; TIAA/CREF; tuition assistance. This is a 12-month,Academic/Administrative Professional (AAP) appointment.Apply online at www.library.georgetown.edu/employment.Review of applications begins immediately and continues until filled. Georgetown University is an Equal Opportunity, Affirmative Action Employer. Brought to you by code4lib jobs: http://jobs.code4lib.org/job/898/