RE: How to convert from ANSEL/MARC-8 to UTF-8?

2009-01-07 Thread Houghton,Andrew
 From: Galen Charlton [mailto:galen.charl...@liblime.com]
 Sent: Wednesday, January 07, 2009 11:47 AM
 To: Michael Lackhoff
 Cc: perl4lib@perl.org
 Subject: Re: How to convert from ANSEL/MARC-8 to UTF-8?
 
 On Wed, Jan 7, 2009 at 11:42 AM, Michael Lackhoff
 lackh...@fh-muenster.de wrote:
  diakritics + base char to the combined character. So I still have two
  characters for e.g. the
  German umlauts. This might be correct UTF-8 but is not useable to
  present in (X)HTML.

I just cannot let that go.  UTF-8 *is* Unicode encoded in a special way.
Whether the characters are combined or uncombined is not relevant to
(X)HTML so long as you specify that the document is encoded in a Unicode
encoding, e.g., UTF-8, UTF-16BE, UTF-16LE, and the user agent, e.g.,
browser understands Unicode which I think is a requirement of the (X)HTML
standards.  Your browser should be able to deal with combined or uncombined
characters however, uncombined characters may not display appropriately due
to font rendering issues, which is why you might be inclined to pre-compose
any uncombined characters in your (X)HTML, e.g., convert them to Unicode 
Normal Form C (NFC).


Andy.



RE: MARC Records, XML, and encoding

2006-05-19 Thread Houghton,Andrew
 From: Joshua Ferraro [mailto:[EMAIL PROTECTED] 
 Sent: 19 May, 2006 13:40
 To: Edward Summers
 Cc: perl4lib
 Subject: Re: MARC Records, XML, and encoding
 
 Hi all,
 
 Here is an OCLC record:
 
 http://liblime.com/public/oclc1.dat
 
 So ... any suggestions for tracking down this problem? ... 
 and what about ideas for handling these records 'in the wild' 
 that have some encoding problems... what do other MARC libraries do?

I was curious about whether this record was bad in WorldCat and since
I have access to WorldCat, I looked at the record.  There appears to
be one diacritic in this record, a MARC-8 E2, combining acute, which
has e as its base character.  I exported the record from WorldCat
and it does in fact have an E2 in it.

However, the size of the record, above, and the one I exported from
OCLC Connexion are different.  Above, 1442 bytes vs. OCLC 1387.  The
005's are, above 20060516100102.0 vs. OCLC 20060519162028.0.  So it's
not surprising that the sizes are different.

When I use MarcView on both records it doesn't complain and looking
at both records side-by-side it appears that there are very minor
edits.  I suspect that the record was edited on OCLC, then exported,
where as I just exported the record without making any edits.

This doesn't solve your issue, but I don't think the issue is with
the actual content of the record.


Andy.

Andrew Houghton, OCLC Online Computer Library Center, Inc.
http://www.oclc.org/about/
http://www.oclc.org/research/staff/houghton.htm



RE: yet another character encoding question

2005-09-29 Thread Houghton,Andrew
 -Original Message-
 From: Thomale, J [mailto:[EMAIL PROTECTED] 
 Sent: 29 September, 2005 11:05
 To: perl4lib@perl.org
 Subject: RE: yet another character encoding question
 
 Right, that was my plan. Since latin-1 to UTF-8 isn't 
 difficult to do (using utf8::encode()), I figured that would 
 be the simplest solution.
 Or am I wrong?

Is there a requirement to deliver the MARC records in MARC-8
encoding?  If not, then use utf8::encode() to encode the Latin-1
to UTF-8 and create the MARC records with Leader/09 = a.


Andy.


RE: Use of encode([$encoding]) in MARC-XML

2005-09-27 Thread Houghton,Andrew
 -Original Message-
 From: Edward Summers [mailto:[EMAIL PROTECTED] 
 Sent: 27 September, 2005 10:36
 To: perl4lib@perl.org
 Subject: Re: Use of encode([$encoding]) in MARC-XML
 
 On Sep 27, 2005, at 7:29 AM, Sperr, Edwin wrote:
  I'm attempting to use XSL (on a Windows server) to 
 transform XML that 
  I generated using MARC::File::XML.  However, I keep running into 
  errors because of illegal characters.
 
 Well part of the problem is that MARC::File::XML does not do 
 character conversion from MARC-8 to UTF-8. If you aren't 
 concerned about special characters immediately try changing 
 the encoding in the XML declaration to ISO-8859-1. If that 
 does the trick let me know and I'll provide details on how to 
 do this with the MARC::File::XML api.

While changing the XML declaration to ISO-8859-1 might work to
allow an XML parser to deal with the file, I vaguely remember
that the MARC-XML standard required that the encoding attribute
be UTF-8.  Can anyone verify that, a pointer would be helpful,
or was I just dreaming...


Andy.


RE: Corrupt MARC records

2005-05-07 Thread Houghton,Andrew
 
Most MARC utilities like MARC::Record depend upon the actual directory lengths 
and having well formed structure.  Isn't that what standards are for?  But 
sometimes you really do get badly formed MARC records and need to recover the 
data.  The presented code does have two caveats, which I pointed out and Ed 
reiterates.  The directory *must* be in the same order as the fields.

However, even if the fields are not in the same order as the directory, code 
could be written to take that into account so long as you can make the 
assumption that the start positions for each directory entry give the nearest 
position to the data.  If we take the directory and sort on the start position 
field, we will have the directory in the order necessary for extraction by the 
presented code.

Of course, you would probably want to keep track of the original directory and 
the sorted directory order so you can output the MARC record with the fields in 
the same order as the original.  Things are never ideal when you have corrupt 
MARC records...


Andy.

-Original Message-
From: Ed Summers [mailto:[EMAIL PROTECTED] 
Sent: Saturday, May 07, 2005 3:11 PM
To: perl4lib@perl.org
Subject: Re: Corrupt MARC records

 I wondered if any of you had run into similar problems, or if you had 
 any thoughts on how to tackle this particular issue.

It's ironic that MARC::Record *used* to do what Andrew suggests: using
split() rather than
than substr() with the actual directory lengths. The reason for the switch was 
just as Andrew pointed out: the order of the tags in the directory is not 
necessarily the order of the field data.

If you need to you could try downloading MARC::Record v1.17 and try using that. 
Or you could roll your own code and cut and paste it everywhere like Andrew ;-)

//Ed



RE: French translation for MARC tag descriptions

2005-04-25 Thread Houghton,Andrew

The unavailable French translation will be at:

http://www.loc.gov/marc/marcfre.html

The completed Spanish translation is at:

http://www.loc.gov/marc/marcspa.html


Andy.

 -Original Message-
 From: Christensen, David A. (CHT) [mailto:[EMAIL PROTECTED] 
 Sent: 25 April, 2005 11:45
 To: perl4lib@perl.org
 Subject: French translation for MARC tag descriptions
 
 Hi all,
 
 Does anyone know of a good site giving the French 
 translations for MARC tag descriptions?  I'd like to enable 
 my MARC::Descriptions module to nicely
 handle other languages...
 
 (I would, of course, seek permission from the site owner)
 
 Thanks,
 
 --
 David A. Christensen   Phone: (204) 726-6870
 Technical Consultant   Toll-free MB: 1-888-226-8014
 Manitoba Public Library Services   FAX: (204) 726-6868
 http://maplin.gov.mb.caEmail: [EMAIL PROTECTED]
 


RE: French translation for MARC tag descriptions

2005-04-25 Thread Houghton,Andrew
 
I should point out that since Canada is using MARC-21, it is possible that 
Library Archives Canada might have the same information translated into French. 
 So take a look at their site.

Andy.

 -Original Message-
 From: Houghton,Andrew [mailto:[EMAIL PROTECTED] 
 Sent: 25 April, 2005 13:49
 To: perl4lib@perl.org
 Subject: RE: French translation for MARC tag descriptions
 
 
 The unavailable French translation will be at:
 
 http://www.loc.gov/marc/marcfre.html
 
 The completed Spanish translation is at:
 
 http://www.loc.gov/marc/marcspa.html
 
 
 Andy.
 
  -Original Message-
  From: Christensen, David A. (CHT) [mailto:[EMAIL PROTECTED]
  Sent: 25 April, 2005 11:45
  To: perl4lib@perl.org
  Subject: French translation for MARC tag descriptions
  
  Hi all,
  
  Does anyone know of a good site giving the French translations for 
  MARC tag descriptions?  I'd like to enable my MARC::Descriptions 
  module to nicely
  handle other languages...
  
  (I would, of course, seek permission from the site owner)
  
  Thanks,
  
  --
  David A. Christensen   Phone: (204) 726-6870
  Technical Consultant   Toll-free MB: 1-888-226-8014
  Manitoba Public Library Services   FAX: (204) 726-6868
  http://maplin.gov.mb.caEmail: [EMAIL PROTECTED]
  
 


RE: MARC::Record and UTF-8

2005-01-07 Thread Houghton,Andrew
From: Ron Davies [mailto:[EMAIL PROTECTED] 
Sent: Friday, January 07, 2005 2:54 AM
Subject: Re: MARC::Record and UTF-8

At 07:50 7/01/2005, [EMAIL PROTECTED] wrote:
Does anyone know of any work underway to adapt MARC::Record for utf-8 
encoding ?

I will have a similar project in a few months' time, converting a whole bunch 
of processing from MARC-8 to UTF-8. I would be very happy to assist in 
testing or development of a UTF-8 capability for MARC::Record. Is the problem 
listed in

This is not a Perl solution, but if you are just looking to convert 
MARC-8 records to UTF-8 record you can use Terry Reese's MarcEdit 
program.  Under its MARC Tools section it allows you to do batch 
conversions.  You can download it from:

http://oregonstate.edu/~reeset/marcedit/html/downloads.html


Andy.





RE: MARC::Record and UTF-8

2005-01-07 Thread Houghton,Andrew
 From: Ed Summers [mailto:[EMAIL PROTECTED] 
 Sent: 07 January, 2005 09:56
 To: perl4lib@perl.org
 Subject: Re: MARC::Record and UTF-8
 
 On Fri, Jan 07, 2005 at 08:13:08AM -0500, Houghton,Andrew wrote:
  This is not a Perl solution, but if you are just looking to convert
  MARC-8 records to UTF-8 record you can use Terry Reese's MarcEdit 
  program.
 
 Does MarcEdit completely map MARC-8 to UTF-8?

Yes it does.  I think he uses the LC code table XML document for his
conversions.  The URL is:

http://www.loc.gov/marc/specifications/codetables.xml

which can be found off the Character Sets: Code Tables page at:

http://www.loc.gov/marc/specifications/specchartables.html


Andy.


RE: Warnings during decode() of raw MARC

2004-08-18 Thread Houghton,Andrew
 From: Bryan Baldus [mailto:[EMAIL PROTECTED] 
 Sent: 18 August, 2004 09:24
 Subject: Warnings during decode() of raw MARC
 
 I'm probably missing something obvious, but I have been 
 unsuccessful in trying to capture the warnings reported by  
 MARC::Record that are set by MARC::File::USMARC-decode(). Is 
 there a simple way to store the warnings reported during the 
 decode() process (using a MARC::Record or MARC::Batch object)? 
 

How about this technique:

#!perl

package main;

sub out {

  print STDERR ERR: Error 1\n;
  print STDERR ERR: Error 2\n;
  print STDERR ERR: Error 3\n;
  return;
}

sub main {

  my @errs = ();

  open SAVERR, STDERR;
  open STDERR, errors.txt;
  open GETERR, errors.txt;

  main::out;

  while (GETERR) {
push(@errs,$_);
  }

  close(GETERR);

  open STDERR, SAVERR;
  
  print STDOUT scalar(@errs),\n;
  return 0;
}

exit main::main;



RE: Filing-rules sort subroutine for authors' names?

2004-07-26 Thread Houghton,Andrew


 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] 
 Sent: 26 July, 2004 13:58
 Subject: Re: Filing-rules sort subroutine for authors' names?
 
 Definitely possible--library automation systems and card 
 printing systems do it.
 
 I'm not fully conversant with the rules myself, but can tell 
 you that it's a lot more work than a regex plus a string 
 compare.  (I'm thinking of sorting titles here, actually.)  
 Handling diacritics, sorting 10 after 9, etc. adds up to 
 a decent amount of work.  It's worth your while to determine 
 whether you have to implement the rules 100% or have some wiggle room.
 
From: Ben Ostrowsky [EMAIL PROTECTED]
Subject: Filing-rules sort subroutine for authors' names?


 Just a sanity check: is it really possible to create a Perl 
 subroutine that
 
 would compare two authors being sorted and enforce the ALA 
 filing rules?  I
 

Don't confuse ALA filing rules with NACO normalization rules.
If you are trying to compare two author names you should use
the NACO normalization rules [1].  If you are trying to sort
the headings into order then use the filing rules [2].  You
can probably write a simple Perl routine to do the NACO 
normalization rules.  The filing rules are much more complex,
since you need to take into account numbers, e.g. 9 vs. 999,
as well as Roman numerals, dates in a variety of formats, 
including spans, and articles in various foreign languages.
So it's a lot more complex to do ALA filing rules correctly.

IMHO, it's almost impossible to do ALA filing by computer, due
to the rules, and having tried this for several concordances 
between LCSH and Dewey.  My last attempt was for the publication
People, Places  Things [3] where the Editors only found four
headings that were miss-filed out of 60,000.  And this was a
small subset of LCSH.  With a lot of work you can get it mostly
correct...


Andy.

[1] http://www.loc.gov/catdir/pcc/naco/normrule.html
[2] ALA Filing Rules, Americal Library Association (ALA),
(c) 1980, ISBN: 0-8389-3255-X
[3] People, Places  Things, OCLC Online Computer Library
Center, Inc., (c) 2001, ISBN: 0-910608-69-5


Andrew Houghton, OCLC Online Computer Library Center, Inc.
http://www.oclc.org/about/
http://www.oclc.org/research/staff/houghton.htm



RE: Displaying diacritics in a terminal vs. a browser

2004-07-01 Thread Houghton,Andrew
 From: Christopher Morgan [mailto:[EMAIL PROTECTED] 
 Sent: 01 July, 2004 10:50
 Subject: Displaying diacritics in a terminal vs. a browser
 
 I use the $cs-to_utf8 conversion from MARC::Charset to 
 display MARC Authority records in a browser, and the 
 diacritics display properly there.
 But they don't display properly via SDTOUT in my terminal 
 window (I get two characters instead of one -- one with the 
 letter and one with the accent mark). Am I doing something 
 wrong? I'm using:
  
   binmode (STDOUT, :utf8);
 
 Is there any way around this problem, or is it a limitation 
 of terminal displays? 

I'm not sure what MARC::Charset does internally, but MARC-8 
defines the diacritic separate from the base character.  So 
even using binmode(STDOUT,:utf8) will produce two characters,
one for the base character followed by the diacritic.  If you
want them combined then you need to combine them.

It just so happens that I have recently been converting MARC-XML
to RDF.  The RDF specification mandates Unicode Normal form C,
which means that the base character and the diacritic are 
combined.  MARC-XML uses Unicode Normal form D, which means that 
the base character is separate from the diacritic.  So I hacked 
together some Perl scripts to convert Unicode NFD - Unicode NFC.
The scripts require Perl 5.8.0.

I was talking with a colleague, just yesterday, about whether we 
should unleash these on the Net...  They need to be cleaned up a 
little and need some basic documentation on how to run the Perl 
scripts.


Andy.

Andrew Houghton, OCLC Online Computer Library Center, Inc.
http://www.oclc.org/about/
http://www.oclc.org/research/staff/houghton.htm


RE: Displaying diacritics in a terminal vs. a browser

2004-07-01 Thread Houghton,Andrew
 From: Paul Hoffman [mailto:[EMAIL PROTECTED] 
 Sent: 01 July, 2004 11:57
 Subject: Re: Displaying diacritics in a terminal vs. a browser
 
 Unless I'm very much mistaken, Chris's code is outputting 
 UTF-8 to the terminal, not MARC-8.

  From: Christopher Morgan [mailto:[EMAIL PROTECTED]
  Sent: 01 July, 2004 10:50
  Subject: Displaying diacritics in a terminal vs. a browser
  
  (I get two characters instead of one -- one with the letter 
  and one with the accent mark). Am I doing something wrong? 

I realized that he was outputting UTF-8, but if he started with
MARC-8 and used $cs-to_utf8 in MARC::Charset, MARC::Charset 
would most likely keep the data in Unicode Normal form D, which
is why he sees two characters.  When he views them with a browser,
the browser most likely receives the two characters but,
depending upon what fonts you are using, it will combine the
two characters to look as *if* they are one combined character.

 
 http://mail.nl.linux.org/linux-utf8/2003-07/msg00231.html
 

Nice reference...


Andy.

Andrew Houghton, OCLC Online Computer Library Center, Inc.
http://www.oclc.org/about/
http://www.oclc.org/research/staff/houghton.htm


RE: NACO Normalization and Text::Normalize

2003-08-27 Thread Houghton,Andrew
From: Brian Cassidy [mailto:[EMAIL PROTECTED]
Subject: RE: NACO Normalization and Text::Normalize

 * normalize()

 inputs: either a MARC::Record object or a string. This should probably
 accept an arbitrary number of inputs so, you can do
 * compare()
 
 inputs: either two M::R objects or two strings.
 
 Given two M::R objects, both are normalize()'ed. It would return false
 (or should it be true?) if, based on the rules [1], some field in $a
 matches some field in $b.

You may need some additional parameters, like what tags to normalize,
since you may want to do NACO normalization on fields other than the
1XX.  For example, I currently do NACO normalization on the 1XX, 4XX,
5XX and 7XX in my Authority records.  By doing that I can quickly
build a hash that allows me to find the broader, narrower, related 
and use-for references for a record in the entire Authority file.

Andy.