Re: [CODE4LIB] utf8 \xC2 does not map to Unicode

2011-04-11 Thread Eric Lease Morgan
On Apr 6, 2011, at 5:39 PM, Jon Gorman wrote:

 http://zoia.library.nd.edu/tmp/tor.marc
 
 When debugging any encoding issue it's always good to know:
 
   a) how the records were obtained
   b) how have they been manipulated before you
  touch them (basically, how many times may
  they have been converted by some bungling
  process)?
   c) what encoding they claim to be now? and
   d) what encoding they are, if any?



I'm making headway on my MARC records, but only through the use of brute force.

I used wget to retrieve the MARC records (as well as associated PDF and text 
files) from the Internet Archive. The process resulted in 538 records. I then 
used marcdump to look at the records individually. When it choked on some weird 
character I renamed the offending file and re-examined the lot again. Through 
this process my pile of records dwindled to 523. I then concatenated the 
non-offending records into a single file, and I made them available, again, at 
the URL above. Now, when I use marcdump it does not crash and burn on tor.marc, 
but it does say there are 121 errors. 

I did play a bit with yaz-marcdump to seemingly convert things from marc-8 to 
utf-8, but I'm not so sure it does what is expected. Does it actually convert 
characters, or does it simply change a value in the leader of each record? If 
the former, then how do I know it is not double-encoding things? If the later, 
then my resulting data set is still broken.

Upon reflection, I think the validation of MARC records ought to be exactly the 
same as the validation of XML. First they should be well-formed. Leader. 
Directory. Bibliographic section. Complete with ASCII characters 29, 30, and 31 
in the proper locations. Second, they should validate. This means fields where 
integers are expected should include integers. It means there are characters in 
245. Etc. Third, the data should be meaningful. The characters in 245 should be 
titles. The characters in 020 should be ISN numbers (not ISBN number and then 
(pbk)). Etc. Finally, the data should be accurate. The titles placed in 245 
are the real titles. The author names are the real author names. Etc. 
Validations #1-#3 can be done by computers. Validation #4 is the work of humans.

If MARC records are not well-formed and do not validate according to the 
standard, then just like XML processors, they should be used. Garbage in. 
Garbage out. 

-- 
Eric Lease Morgan
University of Notre Dame

Great Books Survey -- http://bit.ly/auPD9Q


Re: [CODE4LIB] utf8 \xC2 does not map to Unicode

2011-04-11 Thread Jonathan Rochkind
XML well-formedness and validity checks can't find badly encoded 
characters either -- char data that claims to be one encoding but is 
really another, or that has been double-encoded and now means something 
different than intended.


There's really no way to catch that but heuristics.  All of the 
marc-validating and well-formedness-checking in the world wouldn't 
prevent you from this problem, if people/software don't properly keep 
track of their encodings and not put mis-encoded chars in the data.


On 4/11/2011 11:31 AM, Eric Lease Morgan wrote:

On Apr 6, 2011, at 5:39 PM, Jon Gorman wrote:


http://zoia.library.nd.edu/tmp/tor.marc

When debugging any encoding issue it's always good to know:

   a) how the records were obtained
   b) how have they been manipulated before you
  touch them (basically, how many times may
  they have been converted by some bungling
  process)?
   c) what encoding they claim to be now? and
   d) what encoding they are, if any?



I'm making headway on my MARC records, but only through the use of brute force.

I used wget to retrieve the MARC records (as well as associated PDF and text 
files) from the Internet Archive. The process resulted in 538 records. I then 
used marcdump to look at the records individually. When it choked on some weird 
character I renamed the offending file and re-examined the lot again. Through 
this process my pile of records dwindled to 523. I then concatenated the 
non-offending records into a single file, and I made them available, again, at 
the URL above. Now, when I use marcdump it does not crash and burn on tor.marc, 
but it does say there are 121 errors.

I did play a bit with yaz-marcdump to seemingly convert things from marc-8 to 
utf-8, but I'm not so sure it does what is expected. Does it actually convert 
characters, or does it simply change a value in the leader of each record? If 
the former, then how do I know it is not double-encoding things? If the later, 
then my resulting data set is still broken.

Upon reflection, I think the validation of MARC records ought to be exactly the same as 
the validation of XML. First they should be well-formed. Leader. Directory. Bibliographic 
section. Complete with ASCII characters 29, 30, and 31 in the proper locations. Second, 
they should validate. This means fields where integers are expected should include 
integers. It means there are characters in 245. Etc. Third, the data should be 
meaningful. The characters in 245 should be titles. The characters in 020 should be ISN 
numbers (not ISBN number and then (pbk)). Etc. Finally, the data should be 
accurate. The titles placed in 245 are the real titles. The author names are the real 
author names. Etc. Validations #1-#3 can be done by computers. Validation #4 is the work 
of humans.

If MARC records are not well-formed and do not validate according to the 
standard, then just like XML processors, they should be used. Garbage in. 
Garbage out.



Re: [CODE4LIB] utf8 \xC2 does not map to Unicode

2011-04-11 Thread Mike Taylor
On 11 April 2011 16:40, Jonathan Rochkind rochk...@jhu.edu wrote:
 XML well-formedness and validity checks can't find badly encoded characters
 either -- char data that claims to be one encoding but is really another, or
 that has been double-encoded and now means something different than
 intended.

 There's really no way to catch that but heuristics.  All of the
 marc-validating and well-formedness-checking in the world wouldn't prevent
 you from this problem, if people/software don't properly keep track of their
 encodings and not put mis-encoded chars in the data.

Right.  Double-encoding, or encoding one way while telling the record
you did it another way, is a data-level pilot error -- on a par with
the kind of error when someone means to type you're but types
your.  The error is not wit hthe MARC record, but with the data
that's been put INTO the MARC records.

-- Mike.




 On 4/11/2011 11:31 AM, Eric Lease Morgan wrote:

 On Apr 6, 2011, at 5:39 PM, Jon Gorman wrote:

 http://zoia.library.nd.edu/tmp/tor.marc

 When debugging any encoding issue it's always good to know:

   a) how the records were obtained
   b) how have they been manipulated before you
      touch them (basically, how many times may
      they have been converted by some bungling
      process)?
   c) what encoding they claim to be now? and
   d) what encoding they are, if any?


 I'm making headway on my MARC records, but only through the use of brute
 force.

 I used wget to retrieve the MARC records (as well as associated PDF and
 text files) from the Internet Archive. The process resulted in 538 records.
 I then used marcdump to look at the records individually. When it choked on
 some weird character I renamed the offending file and re-examined the lot
 again. Through this process my pile of records dwindled to 523. I then
 concatenated the non-offending records into a single file, and I made them
 available, again, at the URL above. Now, when I use marcdump it does not
 crash and burn on tor.marc, but it does say there are 121 errors.

 I did play a bit with yaz-marcdump to seemingly convert things from marc-8
 to utf-8, but I'm not so sure it does what is expected. Does it actually
 convert characters, or does it simply change a value in the leader of each
 record? If the former, then how do I know it is not double-encoding things?
 If the later, then my resulting data set is still broken.

 Upon reflection, I think the validation of MARC records ought to be
 exactly the same as the validation of XML. First they should be well-formed.
 Leader. Directory. Bibliographic section. Complete with ASCII characters 29,
 30, and 31 in the proper locations. Second, they should validate. This means
 fields where integers are expected should include integers. It means there
 are characters in 245. Etc. Third, the data should be meaningful. The
 characters in 245 should be titles. The characters in 020 should be ISN
 numbers (not ISBN number and then (pbk)). Etc. Finally, the data should be
 accurate. The titles placed in 245 are the real titles. The author names are
 the real author names. Etc. Validations #1-#3 can be done by computers.
 Validation #4 is the work of humans.

 If MARC records are not well-formed and do not validate according to the
 standard, then just like XML processors, they should be used. Garbage in.
 Garbage out.





Re: [CODE4LIB] utf8 \xC2 does not map to Unicode

2011-04-11 Thread Jon Gorman
 I'm making headway on my MARC records, but only through the use of brute 
 force.

 I used wget to retrieve the MARC records (as well as associated PDF and text 
 files) from the
 Internet Archive.

I know IA has some bad marc records (and also records w/ bad encoding)
from my experience with them in the past.  I'm also not sure what the
web server / wget will do to the files as well.

 I did play a bit with yaz-marcdump to seemingly convert things from marc-8 to 
 utf-8, but I'm not so
 sure it does what is expected. Does it actually convert characters, or does 
 it simply change a
 value in the leader of each record? If the former, then how do I know it is 
 not double-encoding
things? If the later, then my resulting data set is still broken.

There was a bug I seem to remember with yaz-marcdump where it was just
toggling the leader.  (Or a design flaw where you had to specify a
character conversion as well.).  But that was fixed a while ago I
thought. It's probably one of the better tools out there for this type
of stuff.

 If MARC records are not well-formed and do not validate according to the 
 standard, then just like
 XML processors, they should be used. Garbage in. Garbage out.

I'm guessing you meant they shouldn't be used? ;).  XML processors
aren't really known for flexibility in this regard.

Unfortunately there's a lot of issues here, not the least of it some
of the worse issues I've seen are introduced by well-meaning folks who
do things like dump a file out into MARCXML and twiddle with bits or a
marc-breaker format and start using tools to dump unicode text into
what is really a marc-8 file.  Then at some point in the pipeline of
conversions enough character encoding conversions happens that the
file ends up being messed up.

And then there's always the legacy data that got bungled up in the an
encoding transfer.  I know we've got some bad CJK characters due to
this.  At some point in converting our marc-8 records one or two
characters got mapped to something that's not in the unicode spec at
all.  At some point we'll clean up those records, you know, when we've
got some spare time :P.

The problem here has been the tools and they pass whatever internal
validations are enforced.  Probably more stages need to check for
validity, but there's a lot of records that would fail if they did.
(I don't even want to think about how many people disable validation,
or use the same software stack that generated the marc in the first
place, or changes within the marc spec itself over time that makes
validation even more difficult.

Jon Gorman


Re: [CODE4LIB] utf8 \xC2 does not map to Unicode

2011-04-07 Thread Tod Olson
yaz-marcdump does a really good job of charset and format conversion for MARC 
records, and is blindingly fast.

But yaz-marcdump seems to think there are a lot of separators in the wrong 
place and bad indicator data, whether treating the records as UTF-8 or MARC-8.  
The leaders in the records say they are UTF-8, but looking at the data, the 
byte sequences that Jon G. noticed reminds me of UTF-8 data that was 
UTF-8-encoded a second time.  I wonder if they go re-encoded in transmission 
somewhere along the way.  Maybe just in the download from zoila.

-Tod

On Apr 6, 2011, at 4:11 PM, Jonathan Rochkind wrote:

 That's hilarious, that Terry has had to do enough ugliness with Marc 
 encodings that he indeed can recognize 0xC2 off the bat as the Marc8 
 encoding it represents!  I am in awe, as well as sympathy.
 
 If the record is in Marc8, then you need to know if Perl Batch::Marc can 
 handle Marc8.  If it's supposed to be able to handle it, you need to 
 figure out why it's not. (leader byte says UTF-8 even though it's really 
 Marc8?).
 
 If Batch::Marc can't handle Marc8, you need to convert to UTF-8 first. 
 The only software package I know of that can convert from and to Marc8 
 encoding is Java Marc4J, but I wouldn't be shocked if there was 
 something in Perl to do it. (But yes, as you can tell by the name, 
 Marc8 is a character encoding ONLY used in Marc, nobody but library 
 people write software for dealing with it).
 
 On 4/6/2011 5:01 PM, Reese, Terry wrote:
 I'd echo Jonathan's question -- the 0xC2 code is the sound recording marker 
 in MARC-8.  I'd guess the file isn't in UTF8.
 
 --TR
 
 -Original Message-
 From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of
 Jonathan Rochkind
 Sent: Wednesday, April 06, 2011 1:28 PM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] utf8 \xC2 does not map to Unicode
 
 I am not familar with that Perl module. But I'm more familiar then I'd want
 with char encoding in Marc.
 
 I don't recognize the bytes 0xC2 (there are some bytes I became pathetically
 familiar with in past debugging, but I've forgotten em), but the first 
 things to
 look at:
 
 1. Is your Marc file encoded in Marc8 or UTF-8?  I'm betting Marc8.
 Theoretically there is a Marc leader byte that tells you whether it's
 Marc8 or UTF-8, but the leader byte is often wrong in real world records.  
 Is it
 wrong?
 
 2. Does Perl MARC::Batch  have a function to convert from Marc8 to
 UTF-8?   If so, how does it decide whether to convert? Is it trying to
 do that?  Is it assuming that the leader byte the record accurately
 identifies the encoding, and if so, is the leader byte wrong?   Is it
 trying to convert from Marc8 to UTF-8, when the source was UTF-8 in the
 first place?  Or is it assuming the source was UTF-8 in the first place, 
 when in
 fact it was Marc8?
 
 Not the answer you wanted, maybe someone else will have that. Debugging
 char encoding is hands down the most annoying kind of debugging I ever do.
 
 On 4/6/2011 4:13 PM, Eric Lease Morgan wrote:
 Ack! While using the venerable Perl MARC::Batch module I get the
 following error while trying to read a MARC record:
utf8 \xC2 does not map to Unicode
 
 This is a real pain, and I'm hoping someone here can help me either: 1) 
 trap
 this error allowing me to move on, or 2) figure out how to open the file
 correctly.

Tod Olson t...@uchicago.edu
Systems Librarian
University of Chicago Library


[CODE4LIB] utf8 \xC2 does not map to Unicode

2011-04-06 Thread Eric Lease Morgan
Ack! While using the venerable Perl MARC::Batch module I get the following 
error while trying to read a MARC record:

  utf8 \xC2 does not map to Unicode

This is a real pain, and I'm hoping someone here can help me either: 1) trap 
this error allowing me to move on, or 2) figure out how to open the file 
correctly.

-- 
Eric Morgan


Re: [CODE4LIB] utf8 \xC2 does not map to Unicode

2011-04-06 Thread Jonathan Rochkind
I am not familar with that Perl module. But I'm more familiar then I'd 
want with char encoding in Marc.


I don't recognize the bytes 0xC2 (there are some bytes I became 
pathetically familiar with in past debugging, but I've forgotten em), 
but the first things to look at:


1. Is your Marc file encoded in Marc8 or UTF-8?  I'm betting Marc8. 
Theoretically there is a Marc leader byte that tells you whether it's 
Marc8 or UTF-8, but the leader byte is often wrong in real world 
records.  Is it wrong?


2. Does Perl MARC::Batch  have a function to convert from Marc8 to 
UTF-8?   If so, how does it decide whether to convert? Is it trying to 
do that?  Is it assuming that the leader byte the record accurately 
identifies the encoding, and if so, is the leader byte wrong?   Is it 
trying to convert from Marc8 to UTF-8, when the source was UTF-8 in the 
first place?  Or is it assuming the source was UTF-8 in the first place, 
when in fact it was Marc8?


Not the answer you wanted, maybe someone else will have that. Debugging 
char encoding is hands down the most annoying kind of debugging I ever do.


On 4/6/2011 4:13 PM, Eric Lease Morgan wrote:

Ack! While using the venerable Perl MARC::Batch module I get the following 
error while trying to read a MARC record:

   utf8 \xC2 does not map to Unicode

This is a real pain, and I'm hoping someone here can help me either: 1) trap this error 
allowing me to move on, or 2) figure out how to open the file correctly.



Re: [CODE4LIB] utf8 \xC2 does not map to Unicode

2011-04-06 Thread LeVan,Ralph
Can you share the record somewhere?  I suspect many of us have tools we
can turn loose on it.

Ralph

 -Original Message-
 From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf
Of
 Jonathan Rochkind
 Sent: Wednesday, April 06, 2011 4:28 PM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] utf8 \xC2 does not map to Unicode
 
 I am not familar with that Perl module. But I'm more familiar then I'd
 want with char encoding in Marc.
 
 I don't recognize the bytes 0xC2 (there are some bytes I became
 pathetically familiar with in past debugging, but I've forgotten em),
 but the first things to look at:
 
 1. Is your Marc file encoded in Marc8 or UTF-8?  I'm betting Marc8.
 Theoretically there is a Marc leader byte that tells you whether it's
 Marc8 or UTF-8, but the leader byte is often wrong in real world
 records.  Is it wrong?
 
 2. Does Perl MARC::Batch  have a function to convert from Marc8 to
 UTF-8?   If so, how does it decide whether to convert? Is it trying to
 do that?  Is it assuming that the leader byte the record accurately
 identifies the encoding, and if so, is the leader byte wrong?   Is it
 trying to convert from Marc8 to UTF-8, when the source was UTF-8 in
the
 first place?  Or is it assuming the source was UTF-8 in the first
place,
 when in fact it was Marc8?
 
 Not the answer you wanted, maybe someone else will have that.
Debugging
 char encoding is hands down the most annoying kind of debugging I ever
do.
 
 On 4/6/2011 4:13 PM, Eric Lease Morgan wrote:
  Ack! While using the venerable Perl MARC::Batch module I get the
following
 error while trying to read a MARC record:
 
 utf8 \xC2 does not map to Unicode
 
  This is a real pain, and I'm hoping someone here can help me either:
1) trap
 this error allowing me to move on, or 2) figure out how to open the
file correctly.
 


Re: [CODE4LIB] utf8 \xC2 does not map to Unicode

2011-04-06 Thread Eric Lease Morgan
On Apr 6, 2011, at 4:46 PM, LeVan,Ralph wrote:

 Ack! While using the venerable Perl MARC::Batch module I get the
 following error while trying to read a MARC record:
 
   utf8 \xC2 does not map to Unicode
 
 Can you share the record somewhere?  I suspect many of us have tools we
 can turn loose on it.

Sure, thanks. Try:

  http://zoia.library.nd.edu/tmp/tor.marc

-- 
Eric Lease Morgan


Re: [CODE4LIB] utf8 \xC2 does not map to Unicode

2011-04-06 Thread Reese, Terry
I'd echo Jonathan's question -- the 0xC2 code is the sound recording marker in 
MARC-8.  I'd guess the file isn't in UTF8.

--TR

 -Original Message-
 From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of
 Jonathan Rochkind
 Sent: Wednesday, April 06, 2011 1:28 PM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] utf8 \xC2 does not map to Unicode
 
 I am not familar with that Perl module. But I'm more familiar then I'd want
 with char encoding in Marc.
 
 I don't recognize the bytes 0xC2 (there are some bytes I became pathetically
 familiar with in past debugging, but I've forgotten em), but the first things 
 to
 look at:
 
 1. Is your Marc file encoded in Marc8 or UTF-8?  I'm betting Marc8.
 Theoretically there is a Marc leader byte that tells you whether it's
 Marc8 or UTF-8, but the leader byte is often wrong in real world records.  Is 
 it
 wrong?
 
 2. Does Perl MARC::Batch  have a function to convert from Marc8 to
 UTF-8?   If so, how does it decide whether to convert? Is it trying to
 do that?  Is it assuming that the leader byte the record accurately
 identifies the encoding, and if so, is the leader byte wrong?   Is it
 trying to convert from Marc8 to UTF-8, when the source was UTF-8 in the
 first place?  Or is it assuming the source was UTF-8 in the first place, when 
 in
 fact it was Marc8?
 
 Not the answer you wanted, maybe someone else will have that. Debugging
 char encoding is hands down the most annoying kind of debugging I ever do.
 
 On 4/6/2011 4:13 PM, Eric Lease Morgan wrote:
  Ack! While using the venerable Perl MARC::Batch module I get the
 following error while trying to read a MARC record:
 
 utf8 \xC2 does not map to Unicode
 
  This is a real pain, and I'm hoping someone here can help me either: 1) trap
 this error allowing me to move on, or 2) figure out how to open the file
 correctly.
 


Re: [CODE4LIB] utf8 \xC2 does not map to Unicode

2011-04-06 Thread LeVan,Ralph
Lol!

So right off the bat I see that the leader says the record is 1091 bytes
long, but it is actually 1089 bytes long and I end up missing the leader
for the next record.  Maybe a CR/LF problem?  I see that frequently as a
way to mangle MARC records when moving them around.

Is your problem in the very first record?

Ralph

 -Original Message-
 From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf
Of
 Eric Lease Morgan
 Sent: Wednesday, April 06, 2011 4:55 PM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] utf8 \xC2 does not map to Unicode
 
 On Apr 6, 2011, at 4:46 PM, LeVan,Ralph wrote:
 
  Ack! While using the venerable Perl MARC::Batch module I get the
  following error while trying to read a MARC record:
 
utf8 \xC2 does not map to Unicode
 
  Can you share the record somewhere?  I suspect many of us have tools
we
  can turn loose on it.
 
 Sure, thanks. Try:
 
   http://zoia.library.nd.edu/tmp/tor.marc
 
 --
 Eric Lease Morgan


Re: [CODE4LIB] utf8 \xC2 does not map to Unicode

2011-04-06 Thread Jonathan Rochkind
That's hilarious, that Terry has had to do enough ugliness with Marc 
encodings that he indeed can recognize 0xC2 off the bat as the Marc8 
encoding it represents!  I am in awe, as well as sympathy.


If the record is in Marc8, then you need to know if Perl Batch::Marc can 
handle Marc8.  If it's supposed to be able to handle it, you need to 
figure out why it's not. (leader byte says UTF-8 even though it's really 
Marc8?).


If Batch::Marc can't handle Marc8, you need to convert to UTF-8 first. 
The only software package I know of that can convert from and to Marc8 
encoding is Java Marc4J, but I wouldn't be shocked if there was 
something in Perl to do it. (But yes, as you can tell by the name, 
Marc8 is a character encoding ONLY used in Marc, nobody but library 
people write software for dealing with it).


On 4/6/2011 5:01 PM, Reese, Terry wrote:

I'd echo Jonathan's question -- the 0xC2 code is the sound recording marker in 
MARC-8.  I'd guess the file isn't in UTF8.

--TR


-Original Message-
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of
Jonathan Rochkind
Sent: Wednesday, April 06, 2011 1:28 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] utf8 \xC2 does not map to Unicode

I am not familar with that Perl module. But I'm more familiar then I'd want
with char encoding in Marc.

I don't recognize the bytes 0xC2 (there are some bytes I became pathetically
familiar with in past debugging, but I've forgotten em), but the first things to
look at:

1. Is your Marc file encoded in Marc8 or UTF-8?  I'm betting Marc8.
Theoretically there is a Marc leader byte that tells you whether it's
Marc8 or UTF-8, but the leader byte is often wrong in real world records.  Is it
wrong?

2. Does Perl MARC::Batch  have a function to convert from Marc8 to
UTF-8?   If so, how does it decide whether to convert? Is it trying to
do that?  Is it assuming that the leader byte the record accurately
identifies the encoding, and if so, is the leader byte wrong?   Is it
trying to convert from Marc8 to UTF-8, when the source was UTF-8 in the
first place?  Or is it assuming the source was UTF-8 in the first place, when in
fact it was Marc8?

Not the answer you wanted, maybe someone else will have that. Debugging
char encoding is hands down the most annoying kind of debugging I ever do.

On 4/6/2011 4:13 PM, Eric Lease Morgan wrote:

Ack! While using the venerable Perl MARC::Batch module I get the

following error while trying to read a MARC record:

utf8 \xC2 does not map to Unicode

This is a real pain, and I'm hoping someone here can help me either: 1) trap

this error allowing me to move on, or 2) figure out how to open the file
correctly.


Re: [CODE4LIB] utf8 \xC2 does not map to Unicode

2011-04-06 Thread Jon Gorman
I'm not quite convinced that it's marc-8 just because there's \xC2 ;).
 If you look at a hex dump I'm seeing a lot of what might be combining
characters.  The leader appears to have 'a' in the field to indicate
unicode.  In the raw hex I'm seeing a lot of  two character sequences
like: 756c 69c3 83c2 a872 (culir).  If I knew my utf-8 better, I
could guess what combining diacritics these are.  Doing a look up on
http://www.fileformat.info seems to indicate that this might be utf-8,
a 'DIAERESIS'

When debugging any encoding issue it's always good to know

a) how the records were obtained
b) how have they been manipulated before you touch them (basically,
how many times may they have been converted by some bungling process)?
c) what encoding they claim to be now?
and
d) what encoding they are, if any?


It's been a while since I used Marc::Batch.  Is there any reason
you're using that instead of just using MARC::Record?  I'd try just
creating a MARC::Record object.

I've seen people do really bizarre things to break MARC files such as
editing the raw binary, thus invalidating the leader and the directory
as the byte counts were no longer right)

I hate to say it, but we still come across files that are no longer in
any encoding due to too many bad conversions.  It's possible these are
as well.

The enca tool (haven't used it much) guesses this at utf-8 mixed w/
non-text data.

Jon


Re: [CODE4LIB] utf8 \xC2 does not map to Unicode

2011-04-06 Thread William Denton

On 6 April 2011, Eric Lease Morgan wrote:


 http://zoia.library.nd.edu/tmp/tor.marc


Happily, Kevin's magic formula recognizes this as MARC!

Bill
--
William Denton, Toronto : miskatonic.org www.frbr.org openfrbr.org