Re: [CODE4LIB] utf8 \xC2 does not map to Unicode
On Apr 6, 2011, at 5:39 PM, Jon Gorman wrote: http://zoia.library.nd.edu/tmp/tor.marc When debugging any encoding issue it's always good to know: a) how the records were obtained b) how have they been manipulated before you touch them (basically, how many times may they have been converted by some bungling process)? c) what encoding they claim to be now? and d) what encoding they are, if any? I'm making headway on my MARC records, but only through the use of brute force. I used wget to retrieve the MARC records (as well as associated PDF and text files) from the Internet Archive. The process resulted in 538 records. I then used marcdump to look at the records individually. When it choked on some weird character I renamed the offending file and re-examined the lot again. Through this process my pile of records dwindled to 523. I then concatenated the non-offending records into a single file, and I made them available, again, at the URL above. Now, when I use marcdump it does not crash and burn on tor.marc, but it does say there are 121 errors. I did play a bit with yaz-marcdump to seemingly convert things from marc-8 to utf-8, but I'm not so sure it does what is expected. Does it actually convert characters, or does it simply change a value in the leader of each record? If the former, then how do I know it is not double-encoding things? If the later, then my resulting data set is still broken. Upon reflection, I think the validation of MARC records ought to be exactly the same as the validation of XML. First they should be well-formed. Leader. Directory. Bibliographic section. Complete with ASCII characters 29, 30, and 31 in the proper locations. Second, they should validate. This means fields where integers are expected should include integers. It means there are characters in 245. Etc. Third, the data should be meaningful. The characters in 245 should be titles. The characters in 020 should be ISN numbers (not ISBN number and then (pbk)). Etc. Finally, the data should be accurate. The titles placed in 245 are the real titles. The author names are the real author names. Etc. Validations #1-#3 can be done by computers. Validation #4 is the work of humans. If MARC records are not well-formed and do not validate according to the standard, then just like XML processors, they should be used. Garbage in. Garbage out. -- Eric Lease Morgan University of Notre Dame Great Books Survey -- http://bit.ly/auPD9Q
Re: [CODE4LIB] utf8 \xC2 does not map to Unicode
XML well-formedness and validity checks can't find badly encoded characters either -- char data that claims to be one encoding but is really another, or that has been double-encoded and now means something different than intended. There's really no way to catch that but heuristics. All of the marc-validating and well-formedness-checking in the world wouldn't prevent you from this problem, if people/software don't properly keep track of their encodings and not put mis-encoded chars in the data. On 4/11/2011 11:31 AM, Eric Lease Morgan wrote: On Apr 6, 2011, at 5:39 PM, Jon Gorman wrote: http://zoia.library.nd.edu/tmp/tor.marc When debugging any encoding issue it's always good to know: a) how the records were obtained b) how have they been manipulated before you touch them (basically, how many times may they have been converted by some bungling process)? c) what encoding they claim to be now? and d) what encoding they are, if any? I'm making headway on my MARC records, but only through the use of brute force. I used wget to retrieve the MARC records (as well as associated PDF and text files) from the Internet Archive. The process resulted in 538 records. I then used marcdump to look at the records individually. When it choked on some weird character I renamed the offending file and re-examined the lot again. Through this process my pile of records dwindled to 523. I then concatenated the non-offending records into a single file, and I made them available, again, at the URL above. Now, when I use marcdump it does not crash and burn on tor.marc, but it does say there are 121 errors. I did play a bit with yaz-marcdump to seemingly convert things from marc-8 to utf-8, but I'm not so sure it does what is expected. Does it actually convert characters, or does it simply change a value in the leader of each record? If the former, then how do I know it is not double-encoding things? If the later, then my resulting data set is still broken. Upon reflection, I think the validation of MARC records ought to be exactly the same as the validation of XML. First they should be well-formed. Leader. Directory. Bibliographic section. Complete with ASCII characters 29, 30, and 31 in the proper locations. Second, they should validate. This means fields where integers are expected should include integers. It means there are characters in 245. Etc. Third, the data should be meaningful. The characters in 245 should be titles. The characters in 020 should be ISN numbers (not ISBN number and then (pbk)). Etc. Finally, the data should be accurate. The titles placed in 245 are the real titles. The author names are the real author names. Etc. Validations #1-#3 can be done by computers. Validation #4 is the work of humans. If MARC records are not well-formed and do not validate according to the standard, then just like XML processors, they should be used. Garbage in. Garbage out.
Re: [CODE4LIB] utf8 \xC2 does not map to Unicode
On 11 April 2011 16:40, Jonathan Rochkind rochk...@jhu.edu wrote: XML well-formedness and validity checks can't find badly encoded characters either -- char data that claims to be one encoding but is really another, or that has been double-encoded and now means something different than intended. There's really no way to catch that but heuristics. All of the marc-validating and well-formedness-checking in the world wouldn't prevent you from this problem, if people/software don't properly keep track of their encodings and not put mis-encoded chars in the data. Right. Double-encoding, or encoding one way while telling the record you did it another way, is a data-level pilot error -- on a par with the kind of error when someone means to type you're but types your. The error is not wit hthe MARC record, but with the data that's been put INTO the MARC records. -- Mike. On 4/11/2011 11:31 AM, Eric Lease Morgan wrote: On Apr 6, 2011, at 5:39 PM, Jon Gorman wrote: http://zoia.library.nd.edu/tmp/tor.marc When debugging any encoding issue it's always good to know: a) how the records were obtained b) how have they been manipulated before you touch them (basically, how many times may they have been converted by some bungling process)? c) what encoding they claim to be now? and d) what encoding they are, if any? I'm making headway on my MARC records, but only through the use of brute force. I used wget to retrieve the MARC records (as well as associated PDF and text files) from the Internet Archive. The process resulted in 538 records. I then used marcdump to look at the records individually. When it choked on some weird character I renamed the offending file and re-examined the lot again. Through this process my pile of records dwindled to 523. I then concatenated the non-offending records into a single file, and I made them available, again, at the URL above. Now, when I use marcdump it does not crash and burn on tor.marc, but it does say there are 121 errors. I did play a bit with yaz-marcdump to seemingly convert things from marc-8 to utf-8, but I'm not so sure it does what is expected. Does it actually convert characters, or does it simply change a value in the leader of each record? If the former, then how do I know it is not double-encoding things? If the later, then my resulting data set is still broken. Upon reflection, I think the validation of MARC records ought to be exactly the same as the validation of XML. First they should be well-formed. Leader. Directory. Bibliographic section. Complete with ASCII characters 29, 30, and 31 in the proper locations. Second, they should validate. This means fields where integers are expected should include integers. It means there are characters in 245. Etc. Third, the data should be meaningful. The characters in 245 should be titles. The characters in 020 should be ISN numbers (not ISBN number and then (pbk)). Etc. Finally, the data should be accurate. The titles placed in 245 are the real titles. The author names are the real author names. Etc. Validations #1-#3 can be done by computers. Validation #4 is the work of humans. If MARC records are not well-formed and do not validate according to the standard, then just like XML processors, they should be used. Garbage in. Garbage out.
Re: [CODE4LIB] utf8 \xC2 does not map to Unicode
I'm making headway on my MARC records, but only through the use of brute force. I used wget to retrieve the MARC records (as well as associated PDF and text files) from the Internet Archive. I know IA has some bad marc records (and also records w/ bad encoding) from my experience with them in the past. I'm also not sure what the web server / wget will do to the files as well. I did play a bit with yaz-marcdump to seemingly convert things from marc-8 to utf-8, but I'm not so sure it does what is expected. Does it actually convert characters, or does it simply change a value in the leader of each record? If the former, then how do I know it is not double-encoding things? If the later, then my resulting data set is still broken. There was a bug I seem to remember with yaz-marcdump where it was just toggling the leader. (Or a design flaw where you had to specify a character conversion as well.). But that was fixed a while ago I thought. It's probably one of the better tools out there for this type of stuff. If MARC records are not well-formed and do not validate according to the standard, then just like XML processors, they should be used. Garbage in. Garbage out. I'm guessing you meant they shouldn't be used? ;). XML processors aren't really known for flexibility in this regard. Unfortunately there's a lot of issues here, not the least of it some of the worse issues I've seen are introduced by well-meaning folks who do things like dump a file out into MARCXML and twiddle with bits or a marc-breaker format and start using tools to dump unicode text into what is really a marc-8 file. Then at some point in the pipeline of conversions enough character encoding conversions happens that the file ends up being messed up. And then there's always the legacy data that got bungled up in the an encoding transfer. I know we've got some bad CJK characters due to this. At some point in converting our marc-8 records one or two characters got mapped to something that's not in the unicode spec at all. At some point we'll clean up those records, you know, when we've got some spare time :P. The problem here has been the tools and they pass whatever internal validations are enforced. Probably more stages need to check for validity, but there's a lot of records that would fail if they did. (I don't even want to think about how many people disable validation, or use the same software stack that generated the marc in the first place, or changes within the marc spec itself over time that makes validation even more difficult. Jon Gorman
Re: [CODE4LIB] utf8 \xC2 does not map to Unicode
yaz-marcdump does a really good job of charset and format conversion for MARC records, and is blindingly fast. But yaz-marcdump seems to think there are a lot of separators in the wrong place and bad indicator data, whether treating the records as UTF-8 or MARC-8. The leaders in the records say they are UTF-8, but looking at the data, the byte sequences that Jon G. noticed reminds me of UTF-8 data that was UTF-8-encoded a second time. I wonder if they go re-encoded in transmission somewhere along the way. Maybe just in the download from zoila. -Tod On Apr 6, 2011, at 4:11 PM, Jonathan Rochkind wrote: That's hilarious, that Terry has had to do enough ugliness with Marc encodings that he indeed can recognize 0xC2 off the bat as the Marc8 encoding it represents! I am in awe, as well as sympathy. If the record is in Marc8, then you need to know if Perl Batch::Marc can handle Marc8. If it's supposed to be able to handle it, you need to figure out why it's not. (leader byte says UTF-8 even though it's really Marc8?). If Batch::Marc can't handle Marc8, you need to convert to UTF-8 first. The only software package I know of that can convert from and to Marc8 encoding is Java Marc4J, but I wouldn't be shocked if there was something in Perl to do it. (But yes, as you can tell by the name, Marc8 is a character encoding ONLY used in Marc, nobody but library people write software for dealing with it). On 4/6/2011 5:01 PM, Reese, Terry wrote: I'd echo Jonathan's question -- the 0xC2 code is the sound recording marker in MARC-8. I'd guess the file isn't in UTF8. --TR -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Jonathan Rochkind Sent: Wednesday, April 06, 2011 1:28 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] utf8 \xC2 does not map to Unicode I am not familar with that Perl module. But I'm more familiar then I'd want with char encoding in Marc. I don't recognize the bytes 0xC2 (there are some bytes I became pathetically familiar with in past debugging, but I've forgotten em), but the first things to look at: 1. Is your Marc file encoded in Marc8 or UTF-8? I'm betting Marc8. Theoretically there is a Marc leader byte that tells you whether it's Marc8 or UTF-8, but the leader byte is often wrong in real world records. Is it wrong? 2. Does Perl MARC::Batch have a function to convert from Marc8 to UTF-8? If so, how does it decide whether to convert? Is it trying to do that? Is it assuming that the leader byte the record accurately identifies the encoding, and if so, is the leader byte wrong? Is it trying to convert from Marc8 to UTF-8, when the source was UTF-8 in the first place? Or is it assuming the source was UTF-8 in the first place, when in fact it was Marc8? Not the answer you wanted, maybe someone else will have that. Debugging char encoding is hands down the most annoying kind of debugging I ever do. On 4/6/2011 4:13 PM, Eric Lease Morgan wrote: Ack! While using the venerable Perl MARC::Batch module I get the following error while trying to read a MARC record: utf8 \xC2 does not map to Unicode This is a real pain, and I'm hoping someone here can help me either: 1) trap this error allowing me to move on, or 2) figure out how to open the file correctly. Tod Olson t...@uchicago.edu Systems Librarian University of Chicago Library
[CODE4LIB] utf8 \xC2 does not map to Unicode
Ack! While using the venerable Perl MARC::Batch module I get the following error while trying to read a MARC record: utf8 \xC2 does not map to Unicode This is a real pain, and I'm hoping someone here can help me either: 1) trap this error allowing me to move on, or 2) figure out how to open the file correctly. -- Eric Morgan
Re: [CODE4LIB] utf8 \xC2 does not map to Unicode
I am not familar with that Perl module. But I'm more familiar then I'd want with char encoding in Marc. I don't recognize the bytes 0xC2 (there are some bytes I became pathetically familiar with in past debugging, but I've forgotten em), but the first things to look at: 1. Is your Marc file encoded in Marc8 or UTF-8? I'm betting Marc8. Theoretically there is a Marc leader byte that tells you whether it's Marc8 or UTF-8, but the leader byte is often wrong in real world records. Is it wrong? 2. Does Perl MARC::Batch have a function to convert from Marc8 to UTF-8? If so, how does it decide whether to convert? Is it trying to do that? Is it assuming that the leader byte the record accurately identifies the encoding, and if so, is the leader byte wrong? Is it trying to convert from Marc8 to UTF-8, when the source was UTF-8 in the first place? Or is it assuming the source was UTF-8 in the first place, when in fact it was Marc8? Not the answer you wanted, maybe someone else will have that. Debugging char encoding is hands down the most annoying kind of debugging I ever do. On 4/6/2011 4:13 PM, Eric Lease Morgan wrote: Ack! While using the venerable Perl MARC::Batch module I get the following error while trying to read a MARC record: utf8 \xC2 does not map to Unicode This is a real pain, and I'm hoping someone here can help me either: 1) trap this error allowing me to move on, or 2) figure out how to open the file correctly.
Re: [CODE4LIB] utf8 \xC2 does not map to Unicode
Can you share the record somewhere? I suspect many of us have tools we can turn loose on it. Ralph -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Jonathan Rochkind Sent: Wednesday, April 06, 2011 4:28 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] utf8 \xC2 does not map to Unicode I am not familar with that Perl module. But I'm more familiar then I'd want with char encoding in Marc. I don't recognize the bytes 0xC2 (there are some bytes I became pathetically familiar with in past debugging, but I've forgotten em), but the first things to look at: 1. Is your Marc file encoded in Marc8 or UTF-8? I'm betting Marc8. Theoretically there is a Marc leader byte that tells you whether it's Marc8 or UTF-8, but the leader byte is often wrong in real world records. Is it wrong? 2. Does Perl MARC::Batch have a function to convert from Marc8 to UTF-8? If so, how does it decide whether to convert? Is it trying to do that? Is it assuming that the leader byte the record accurately identifies the encoding, and if so, is the leader byte wrong? Is it trying to convert from Marc8 to UTF-8, when the source was UTF-8 in the first place? Or is it assuming the source was UTF-8 in the first place, when in fact it was Marc8? Not the answer you wanted, maybe someone else will have that. Debugging char encoding is hands down the most annoying kind of debugging I ever do. On 4/6/2011 4:13 PM, Eric Lease Morgan wrote: Ack! While using the venerable Perl MARC::Batch module I get the following error while trying to read a MARC record: utf8 \xC2 does not map to Unicode This is a real pain, and I'm hoping someone here can help me either: 1) trap this error allowing me to move on, or 2) figure out how to open the file correctly.
Re: [CODE4LIB] utf8 \xC2 does not map to Unicode
On Apr 6, 2011, at 4:46 PM, LeVan,Ralph wrote: Ack! While using the venerable Perl MARC::Batch module I get the following error while trying to read a MARC record: utf8 \xC2 does not map to Unicode Can you share the record somewhere? I suspect many of us have tools we can turn loose on it. Sure, thanks. Try: http://zoia.library.nd.edu/tmp/tor.marc -- Eric Lease Morgan
Re: [CODE4LIB] utf8 \xC2 does not map to Unicode
I'd echo Jonathan's question -- the 0xC2 code is the sound recording marker in MARC-8. I'd guess the file isn't in UTF8. --TR -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Jonathan Rochkind Sent: Wednesday, April 06, 2011 1:28 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] utf8 \xC2 does not map to Unicode I am not familar with that Perl module. But I'm more familiar then I'd want with char encoding in Marc. I don't recognize the bytes 0xC2 (there are some bytes I became pathetically familiar with in past debugging, but I've forgotten em), but the first things to look at: 1. Is your Marc file encoded in Marc8 or UTF-8? I'm betting Marc8. Theoretically there is a Marc leader byte that tells you whether it's Marc8 or UTF-8, but the leader byte is often wrong in real world records. Is it wrong? 2. Does Perl MARC::Batch have a function to convert from Marc8 to UTF-8? If so, how does it decide whether to convert? Is it trying to do that? Is it assuming that the leader byte the record accurately identifies the encoding, and if so, is the leader byte wrong? Is it trying to convert from Marc8 to UTF-8, when the source was UTF-8 in the first place? Or is it assuming the source was UTF-8 in the first place, when in fact it was Marc8? Not the answer you wanted, maybe someone else will have that. Debugging char encoding is hands down the most annoying kind of debugging I ever do. On 4/6/2011 4:13 PM, Eric Lease Morgan wrote: Ack! While using the venerable Perl MARC::Batch module I get the following error while trying to read a MARC record: utf8 \xC2 does not map to Unicode This is a real pain, and I'm hoping someone here can help me either: 1) trap this error allowing me to move on, or 2) figure out how to open the file correctly.
Re: [CODE4LIB] utf8 \xC2 does not map to Unicode
Lol! So right off the bat I see that the leader says the record is 1091 bytes long, but it is actually 1089 bytes long and I end up missing the leader for the next record. Maybe a CR/LF problem? I see that frequently as a way to mangle MARC records when moving them around. Is your problem in the very first record? Ralph -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Eric Lease Morgan Sent: Wednesday, April 06, 2011 4:55 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] utf8 \xC2 does not map to Unicode On Apr 6, 2011, at 4:46 PM, LeVan,Ralph wrote: Ack! While using the venerable Perl MARC::Batch module I get the following error while trying to read a MARC record: utf8 \xC2 does not map to Unicode Can you share the record somewhere? I suspect many of us have tools we can turn loose on it. Sure, thanks. Try: http://zoia.library.nd.edu/tmp/tor.marc -- Eric Lease Morgan
Re: [CODE4LIB] utf8 \xC2 does not map to Unicode
That's hilarious, that Terry has had to do enough ugliness with Marc encodings that he indeed can recognize 0xC2 off the bat as the Marc8 encoding it represents! I am in awe, as well as sympathy. If the record is in Marc8, then you need to know if Perl Batch::Marc can handle Marc8. If it's supposed to be able to handle it, you need to figure out why it's not. (leader byte says UTF-8 even though it's really Marc8?). If Batch::Marc can't handle Marc8, you need to convert to UTF-8 first. The only software package I know of that can convert from and to Marc8 encoding is Java Marc4J, but I wouldn't be shocked if there was something in Perl to do it. (But yes, as you can tell by the name, Marc8 is a character encoding ONLY used in Marc, nobody but library people write software for dealing with it). On 4/6/2011 5:01 PM, Reese, Terry wrote: I'd echo Jonathan's question -- the 0xC2 code is the sound recording marker in MARC-8. I'd guess the file isn't in UTF8. --TR -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Jonathan Rochkind Sent: Wednesday, April 06, 2011 1:28 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] utf8 \xC2 does not map to Unicode I am not familar with that Perl module. But I'm more familiar then I'd want with char encoding in Marc. I don't recognize the bytes 0xC2 (there are some bytes I became pathetically familiar with in past debugging, but I've forgotten em), but the first things to look at: 1. Is your Marc file encoded in Marc8 or UTF-8? I'm betting Marc8. Theoretically there is a Marc leader byte that tells you whether it's Marc8 or UTF-8, but the leader byte is often wrong in real world records. Is it wrong? 2. Does Perl MARC::Batch have a function to convert from Marc8 to UTF-8? If so, how does it decide whether to convert? Is it trying to do that? Is it assuming that the leader byte the record accurately identifies the encoding, and if so, is the leader byte wrong? Is it trying to convert from Marc8 to UTF-8, when the source was UTF-8 in the first place? Or is it assuming the source was UTF-8 in the first place, when in fact it was Marc8? Not the answer you wanted, maybe someone else will have that. Debugging char encoding is hands down the most annoying kind of debugging I ever do. On 4/6/2011 4:13 PM, Eric Lease Morgan wrote: Ack! While using the venerable Perl MARC::Batch module I get the following error while trying to read a MARC record: utf8 \xC2 does not map to Unicode This is a real pain, and I'm hoping someone here can help me either: 1) trap this error allowing me to move on, or 2) figure out how to open the file correctly.
Re: [CODE4LIB] utf8 \xC2 does not map to Unicode
I'm not quite convinced that it's marc-8 just because there's \xC2 ;). If you look at a hex dump I'm seeing a lot of what might be combining characters. The leader appears to have 'a' in the field to indicate unicode. In the raw hex I'm seeing a lot of two character sequences like: 756c 69c3 83c2 a872 (culir). If I knew my utf-8 better, I could guess what combining diacritics these are. Doing a look up on http://www.fileformat.info seems to indicate that this might be utf-8, a 'DIAERESIS' When debugging any encoding issue it's always good to know a) how the records were obtained b) how have they been manipulated before you touch them (basically, how many times may they have been converted by some bungling process)? c) what encoding they claim to be now? and d) what encoding they are, if any? It's been a while since I used Marc::Batch. Is there any reason you're using that instead of just using MARC::Record? I'd try just creating a MARC::Record object. I've seen people do really bizarre things to break MARC files such as editing the raw binary, thus invalidating the leader and the directory as the byte counts were no longer right) I hate to say it, but we still come across files that are no longer in any encoding due to too many bad conversions. It's possible these are as well. The enca tool (haven't used it much) guesses this at utf-8 mixed w/ non-text data. Jon
Re: [CODE4LIB] utf8 \xC2 does not map to Unicode
On 6 April 2011, Eric Lease Morgan wrote: http://zoia.library.nd.edu/tmp/tor.marc Happily, Kevin's magic formula recognizes this as MARC! Bill -- William Denton, Toronto : miskatonic.org www.frbr.org openfrbr.org