On behalf of Charles Riley: ---------- Forwarded message ---------- From: Riley, Charles <[email protected]> Date: 23 February 2016 at 05:37 Subject: [camms-ccaam] Common encoding errors To: "[email protected]" <[email protected]>, " [email protected]" <[email protected]>, "[email protected]" < [email protected]>, "[email protected]" < [email protected]>, "[email protected]" < [email protected]>, "[email protected]" < [email protected]>
Hi all, This is something I’ve noticed happening with somewhat regular, and probably increasing occurrence lately: a class of problems with records containing either escaped entity references from HTML or XML (like ‘ ’), or accented characters that have become corrupted in a data migration (like ‘français <https://openlibrary.org/works/OL10004281W/Les_archets_français>‘). I was asked by another librarian if I could point them to any resources that deal with this class of issues, and rounded up a few that I thought would be good to share. Here’s what I came across, in terms of examples and explanations for some of the more common cases: http://markmcb.com/2011/11/07/replacing-ae%E2%80%9C-ae%E2%84%A2-aeoe-etc-with-utf-8-characters-in-ruby-on-rails/ https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references (But treat this list with caution in using it to search; there will be false positives for a search for ‘amp;’, for example.) http://www.i18nqa.com/debug/utf8-debug.html (See also associated links on this page.) Hope this helps! Charles Riley *Charles Riley* *Interim Librarian for African Studies and Catalog Librarian* *Sterling Memorial Library* *Yale University* *[email protected] <[email protected]>* *(203)432-7566 <%28203%29432-7566> or (203)432-9301 <%28203%29432-9301>* -- Andrew Cunningham [email protected]
