RE: How to convert from ANSEL/MARC-8 to UTF-8?
From: Galen Charlton [mailto:galen.charl...@liblime.com] Sent: Wednesday, January 07, 2009 11:47 AM To: Michael Lackhoff Cc: perl4lib@perl.org Subject: Re: How to convert from ANSEL/MARC-8 to UTF-8? On Wed, Jan 7, 2009 at 11:42 AM, Michael Lackhoff lackh...@fh-muenster.de wrote: diakritics + base char to the combined character. So I still have two characters for e.g. the German umlauts. This might be correct UTF-8 but is not useable to present in (X)HTML. I just cannot let that go. UTF-8 *is* Unicode encoded in a special way. Whether the characters are combined or uncombined is not relevant to (X)HTML so long as you specify that the document is encoded in a Unicode encoding, e.g., UTF-8, UTF-16BE, UTF-16LE, and the user agent, e.g., browser understands Unicode which I think is a requirement of the (X)HTML standards. Your browser should be able to deal with combined or uncombined characters however, uncombined characters may not display appropriately due to font rendering issues, which is why you might be inclined to pre-compose any uncombined characters in your (X)HTML, e.g., convert them to Unicode Normal Form C (NFC). Andy.
RE: MARC Records, XML, and encoding
From: Joshua Ferraro [mailto:[EMAIL PROTECTED] Sent: 19 May, 2006 13:40 To: Edward Summers Cc: perl4lib Subject: Re: MARC Records, XML, and encoding Hi all, Here is an OCLC record: http://liblime.com/public/oclc1.dat So ... any suggestions for tracking down this problem? ... and what about ideas for handling these records 'in the wild' that have some encoding problems... what do other MARC libraries do? I was curious about whether this record was bad in WorldCat and since I have access to WorldCat, I looked at the record. There appears to be one diacritic in this record, a MARC-8 E2, combining acute, which has e as its base character. I exported the record from WorldCat and it does in fact have an E2 in it. However, the size of the record, above, and the one I exported from OCLC Connexion are different. Above, 1442 bytes vs. OCLC 1387. The 005's are, above 20060516100102.0 vs. OCLC 20060519162028.0. So it's not surprising that the sizes are different. When I use MarcView on both records it doesn't complain and looking at both records side-by-side it appears that there are very minor edits. I suspect that the record was edited on OCLC, then exported, where as I just exported the record without making any edits. This doesn't solve your issue, but I don't think the issue is with the actual content of the record. Andy. Andrew Houghton, OCLC Online Computer Library Center, Inc. http://www.oclc.org/about/ http://www.oclc.org/research/staff/houghton.htm
RE: yet another character encoding question
-Original Message- From: Thomale, J [mailto:[EMAIL PROTECTED] Sent: 29 September, 2005 11:05 To: perl4lib@perl.org Subject: RE: yet another character encoding question Right, that was my plan. Since latin-1 to UTF-8 isn't difficult to do (using utf8::encode()), I figured that would be the simplest solution. Or am I wrong? Is there a requirement to deliver the MARC records in MARC-8 encoding? If not, then use utf8::encode() to encode the Latin-1 to UTF-8 and create the MARC records with Leader/09 = a. Andy.
RE: Use of encode([$encoding]) in MARC-XML
-Original Message- From: Edward Summers [mailto:[EMAIL PROTECTED] Sent: 27 September, 2005 10:36 To: perl4lib@perl.org Subject: Re: Use of encode([$encoding]) in MARC-XML On Sep 27, 2005, at 7:29 AM, Sperr, Edwin wrote: I'm attempting to use XSL (on a Windows server) to transform XML that I generated using MARC::File::XML. However, I keep running into errors because of illegal characters. Well part of the problem is that MARC::File::XML does not do character conversion from MARC-8 to UTF-8. If you aren't concerned about special characters immediately try changing the encoding in the XML declaration to ISO-8859-1. If that does the trick let me know and I'll provide details on how to do this with the MARC::File::XML api. While changing the XML declaration to ISO-8859-1 might work to allow an XML parser to deal with the file, I vaguely remember that the MARC-XML standard required that the encoding attribute be UTF-8. Can anyone verify that, a pointer would be helpful, or was I just dreaming... Andy.
RE: Corrupt MARC records
Most MARC utilities like MARC::Record depend upon the actual directory lengths and having well formed structure. Isn't that what standards are for? But sometimes you really do get badly formed MARC records and need to recover the data. The presented code does have two caveats, which I pointed out and Ed reiterates. The directory *must* be in the same order as the fields. However, even if the fields are not in the same order as the directory, code could be written to take that into account so long as you can make the assumption that the start positions for each directory entry give the nearest position to the data. If we take the directory and sort on the start position field, we will have the directory in the order necessary for extraction by the presented code. Of course, you would probably want to keep track of the original directory and the sorted directory order so you can output the MARC record with the fields in the same order as the original. Things are never ideal when you have corrupt MARC records... Andy. -Original Message- From: Ed Summers [mailto:[EMAIL PROTECTED] Sent: Saturday, May 07, 2005 3:11 PM To: perl4lib@perl.org Subject: Re: Corrupt MARC records I wondered if any of you had run into similar problems, or if you had any thoughts on how to tackle this particular issue. It's ironic that MARC::Record *used* to do what Andrew suggests: using split() rather than than substr() with the actual directory lengths. The reason for the switch was just as Andrew pointed out: the order of the tags in the directory is not necessarily the order of the field data. If you need to you could try downloading MARC::Record v1.17 and try using that. Or you could roll your own code and cut and paste it everywhere like Andrew ;-) //Ed
RE: French translation for MARC tag descriptions
The unavailable French translation will be at: http://www.loc.gov/marc/marcfre.html The completed Spanish translation is at: http://www.loc.gov/marc/marcspa.html Andy. -Original Message- From: Christensen, David A. (CHT) [mailto:[EMAIL PROTECTED] Sent: 25 April, 2005 11:45 To: perl4lib@perl.org Subject: French translation for MARC tag descriptions Hi all, Does anyone know of a good site giving the French translations for MARC tag descriptions? I'd like to enable my MARC::Descriptions module to nicely handle other languages... (I would, of course, seek permission from the site owner) Thanks, -- David A. Christensen Phone: (204) 726-6870 Technical Consultant Toll-free MB: 1-888-226-8014 Manitoba Public Library Services FAX: (204) 726-6868 http://maplin.gov.mb.caEmail: [EMAIL PROTECTED]
RE: French translation for MARC tag descriptions
I should point out that since Canada is using MARC-21, it is possible that Library Archives Canada might have the same information translated into French. So take a look at their site. Andy. -Original Message- From: Houghton,Andrew [mailto:[EMAIL PROTECTED] Sent: 25 April, 2005 13:49 To: perl4lib@perl.org Subject: RE: French translation for MARC tag descriptions The unavailable French translation will be at: http://www.loc.gov/marc/marcfre.html The completed Spanish translation is at: http://www.loc.gov/marc/marcspa.html Andy. -Original Message- From: Christensen, David A. (CHT) [mailto:[EMAIL PROTECTED] Sent: 25 April, 2005 11:45 To: perl4lib@perl.org Subject: French translation for MARC tag descriptions Hi all, Does anyone know of a good site giving the French translations for MARC tag descriptions? I'd like to enable my MARC::Descriptions module to nicely handle other languages... (I would, of course, seek permission from the site owner) Thanks, -- David A. Christensen Phone: (204) 726-6870 Technical Consultant Toll-free MB: 1-888-226-8014 Manitoba Public Library Services FAX: (204) 726-6868 http://maplin.gov.mb.caEmail: [EMAIL PROTECTED]
RE: MARC::Record and UTF-8
From: Ron Davies [mailto:[EMAIL PROTECTED] Sent: Friday, January 07, 2005 2:54 AM Subject: Re: MARC::Record and UTF-8 At 07:50 7/01/2005, [EMAIL PROTECTED] wrote: Does anyone know of any work underway to adapt MARC::Record for utf-8 encoding ? I will have a similar project in a few months' time, converting a whole bunch of processing from MARC-8 to UTF-8. I would be very happy to assist in testing or development of a UTF-8 capability for MARC::Record. Is the problem listed in This is not a Perl solution, but if you are just looking to convert MARC-8 records to UTF-8 record you can use Terry Reese's MarcEdit program. Under its MARC Tools section it allows you to do batch conversions. You can download it from: http://oregonstate.edu/~reeset/marcedit/html/downloads.html Andy.
RE: MARC::Record and UTF-8
From: Ed Summers [mailto:[EMAIL PROTECTED] Sent: 07 January, 2005 09:56 To: perl4lib@perl.org Subject: Re: MARC::Record and UTF-8 On Fri, Jan 07, 2005 at 08:13:08AM -0500, Houghton,Andrew wrote: This is not a Perl solution, but if you are just looking to convert MARC-8 records to UTF-8 record you can use Terry Reese's MarcEdit program. Does MarcEdit completely map MARC-8 to UTF-8? Yes it does. I think he uses the LC code table XML document for his conversions. The URL is: http://www.loc.gov/marc/specifications/codetables.xml which can be found off the Character Sets: Code Tables page at: http://www.loc.gov/marc/specifications/specchartables.html Andy.
RE: Warnings during decode() of raw MARC
From: Bryan Baldus [mailto:[EMAIL PROTECTED] Sent: 18 August, 2004 09:24 Subject: Warnings during decode() of raw MARC I'm probably missing something obvious, but I have been unsuccessful in trying to capture the warnings reported by MARC::Record that are set by MARC::File::USMARC-decode(). Is there a simple way to store the warnings reported during the decode() process (using a MARC::Record or MARC::Batch object)? How about this technique: #!perl package main; sub out { print STDERR ERR: Error 1\n; print STDERR ERR: Error 2\n; print STDERR ERR: Error 3\n; return; } sub main { my @errs = (); open SAVERR, STDERR; open STDERR, errors.txt; open GETERR, errors.txt; main::out; while (GETERR) { push(@errs,$_); } close(GETERR); open STDERR, SAVERR; print STDOUT scalar(@errs),\n; return 0; } exit main::main;
RE: Filing-rules sort subroutine for authors' names?
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: 26 July, 2004 13:58 Subject: Re: Filing-rules sort subroutine for authors' names? Definitely possible--library automation systems and card printing systems do it. I'm not fully conversant with the rules myself, but can tell you that it's a lot more work than a regex plus a string compare. (I'm thinking of sorting titles here, actually.) Handling diacritics, sorting 10 after 9, etc. adds up to a decent amount of work. It's worth your while to determine whether you have to implement the rules 100% or have some wiggle room. From: Ben Ostrowsky [EMAIL PROTECTED] Subject: Filing-rules sort subroutine for authors' names? Just a sanity check: is it really possible to create a Perl subroutine that would compare two authors being sorted and enforce the ALA filing rules? I Don't confuse ALA filing rules with NACO normalization rules. If you are trying to compare two author names you should use the NACO normalization rules [1]. If you are trying to sort the headings into order then use the filing rules [2]. You can probably write a simple Perl routine to do the NACO normalization rules. The filing rules are much more complex, since you need to take into account numbers, e.g. 9 vs. 999, as well as Roman numerals, dates in a variety of formats, including spans, and articles in various foreign languages. So it's a lot more complex to do ALA filing rules correctly. IMHO, it's almost impossible to do ALA filing by computer, due to the rules, and having tried this for several concordances between LCSH and Dewey. My last attempt was for the publication People, Places Things [3] where the Editors only found four headings that were miss-filed out of 60,000. And this was a small subset of LCSH. With a lot of work you can get it mostly correct... Andy. [1] http://www.loc.gov/catdir/pcc/naco/normrule.html [2] ALA Filing Rules, Americal Library Association (ALA), (c) 1980, ISBN: 0-8389-3255-X [3] People, Places Things, OCLC Online Computer Library Center, Inc., (c) 2001, ISBN: 0-910608-69-5 Andrew Houghton, OCLC Online Computer Library Center, Inc. http://www.oclc.org/about/ http://www.oclc.org/research/staff/houghton.htm
RE: Displaying diacritics in a terminal vs. a browser
From: Christopher Morgan [mailto:[EMAIL PROTECTED] Sent: 01 July, 2004 10:50 Subject: Displaying diacritics in a terminal vs. a browser I use the $cs-to_utf8 conversion from MARC::Charset to display MARC Authority records in a browser, and the diacritics display properly there. But they don't display properly via SDTOUT in my terminal window (I get two characters instead of one -- one with the letter and one with the accent mark). Am I doing something wrong? I'm using: binmode (STDOUT, :utf8); Is there any way around this problem, or is it a limitation of terminal displays? I'm not sure what MARC::Charset does internally, but MARC-8 defines the diacritic separate from the base character. So even using binmode(STDOUT,:utf8) will produce two characters, one for the base character followed by the diacritic. If you want them combined then you need to combine them. It just so happens that I have recently been converting MARC-XML to RDF. The RDF specification mandates Unicode Normal form C, which means that the base character and the diacritic are combined. MARC-XML uses Unicode Normal form D, which means that the base character is separate from the diacritic. So I hacked together some Perl scripts to convert Unicode NFD - Unicode NFC. The scripts require Perl 5.8.0. I was talking with a colleague, just yesterday, about whether we should unleash these on the Net... They need to be cleaned up a little and need some basic documentation on how to run the Perl scripts. Andy. Andrew Houghton, OCLC Online Computer Library Center, Inc. http://www.oclc.org/about/ http://www.oclc.org/research/staff/houghton.htm
RE: Displaying diacritics in a terminal vs. a browser
From: Paul Hoffman [mailto:[EMAIL PROTECTED] Sent: 01 July, 2004 11:57 Subject: Re: Displaying diacritics in a terminal vs. a browser Unless I'm very much mistaken, Chris's code is outputting UTF-8 to the terminal, not MARC-8. From: Christopher Morgan [mailto:[EMAIL PROTECTED] Sent: 01 July, 2004 10:50 Subject: Displaying diacritics in a terminal vs. a browser (I get two characters instead of one -- one with the letter and one with the accent mark). Am I doing something wrong? I realized that he was outputting UTF-8, but if he started with MARC-8 and used $cs-to_utf8 in MARC::Charset, MARC::Charset would most likely keep the data in Unicode Normal form D, which is why he sees two characters. When he views them with a browser, the browser most likely receives the two characters but, depending upon what fonts you are using, it will combine the two characters to look as *if* they are one combined character. http://mail.nl.linux.org/linux-utf8/2003-07/msg00231.html Nice reference... Andy. Andrew Houghton, OCLC Online Computer Library Center, Inc. http://www.oclc.org/about/ http://www.oclc.org/research/staff/houghton.htm
RE: NACO Normalization and Text::Normalize
From: Brian Cassidy [mailto:[EMAIL PROTECTED] Subject: RE: NACO Normalization and Text::Normalize * normalize() inputs: either a MARC::Record object or a string. This should probably accept an arbitrary number of inputs so, you can do * compare() inputs: either two M::R objects or two strings. Given two M::R objects, both are normalize()'ed. It would return false (or should it be true?) if, based on the rules [1], some field in $a matches some field in $b. You may need some additional parameters, like what tags to normalize, since you may want to do NACO normalization on fields other than the 1XX. For example, I currently do NACO normalization on the 1XX, 4XX, 5XX and 7XX in my Authority records. By doing that I can quickly build a hash that allows me to find the broader, narrower, related and use-for references for a record in the entire Authority file. Andy.