Re: Character sets

2004-11-23 Thread Ed Summers
On Tue, Nov 23, 2004 at 04:10:05PM -0600, John Hammer wrote:
> I have a character problem that I hope someone can help me with. In a MARC 
> record I am modifying using MARC::Record, one of the names contains letters 
> with diacritics. Looking at the name with a hex editor, it gives, with hex 
> values in curly brackets,"Bis{e5}a{f2}t{e5}i, Mu{f2}hammad." After running 
> through MARC::Record, the name now appears as "Bis{ef bf bd}a{ef bf bd}t{ef 
> bf bd}i, Mu{ef bf bd}hammad."

That's pretty odd. Any chance you could send me the MARC record? At this
time MARC::Record does not play nicely with Unicode (UTF8). 

http://rt.cpan.org/NoAuth/Bug.html?id=3707

//Ed


Re: Character sets

2004-11-24 Thread Ashley Sanders
Ed Summers wrote:
On Tue, Nov 23, 2004 at 04:10:05PM -0600, John Hammer wrote:
I have a character problem that I hope someone can help me with. In
a MARC record I am modifying using MARC::Record, one of the names
contains letters with diacritics. Looking at the name with a hex editor,
it gives, with hex values in curly brackets,"Bis{e5}a{f2}t{e5}i,
Mu{f2}hammad." After running through MARC::Record, the name now appears
as "Bis{ef bf bd}a{ef bf bd}t{ef bf bd}i, Mu{ef bf bd}hammad."

That's pretty odd. Any chance you could send me the MARC record? At this
time MARC::Record does not play nicely with Unicode (UTF8). 

http://rt.cpan.org/NoAuth/Bug.html?id=3707
It is possible they are MARC-8 characters rather than utf-8. In MARC-8
E5 is "macron" and F2 is "dot below." Is MARC::Record trying to treat
than as Unicode when in fact they are MARC-8?
Ashley.
--
Ashley Sanders [EMAIL PROTECTED]
Copac http://copac.ac.uk -- A MIMAS service funded by JISC


RE: Character sets

2004-11-24 Thread Jacobs, Jane W
Hi John,

You might consider whether your problem involves the fact that most Diacritics 
involve a shift from Basic Latin (ASCII) to Extended Latin and then back.  This 
is accomplished with an escape sequence.  I'm a little fuzzy on the details of 
how this works but the raw MARC looks something like this:
Bisåaòtåi, Muòhammad 

Check out:
http://www.loc.gov/marc/specifications/specchartables.html 
And more specifically:
http://lcweb2.loc.gov/cocoon/codetables/42.html
http://lcweb2.loc.gov/cocoon/codetables/45.html


Another possibility is that you are working with G1 and the program is looking 
for G0 character sets or vice versa. 

It looks like your bracketed characters:
{e5} etc. are G1, not G0.  I'm no great programmer (all right, no programmer at 
all really) but my experience is that G0 seems to be the preferred.

I hope some of that might be helpful.

JJ

**Views expressed by the author do not necessarily represent those of the 
Queens Library.**

Jane Jacobs
Asst. Coord., Catalog Division
Queens Borough Public Library
89-11 Merrick Blvd.
Jamaica, NY 11432

tel.: (718) 990-0804
e-mail: [EMAIL PROTECTED]
FAX. (718) 990-8566



-Original Message-
From: John Hammer [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, November 23, 2004 5:10 PM
To: [EMAIL PROTECTED]
Subject: Character sets


I apologize if this is not the correct list to ask this question. I'm at a 
loss, however, to know where to ask.

I have a character problem that I hope someone can help me with. In a MARC 
record I am modifying using MARC::Record, one of the names contains letters 
with diacritics. Looking at the name with a hex editor, it gives, with hex 
values in curly brackets,"Bis{e5}a{f2}t{e5}i, Mu{f2}hammad." After running 
through MARC::Record, the name now appears as "Bis{ef bf bd}a{ef bf bd}t{ef bf 
bd}i, Mu{ef bf bd}hammad."

Does this have anything to do with Perl? Or is it more properly the way Linux 
(I'm using RedHat) is set up on my machine?

Any help would be appreciated.

-- 

John C. Hammer, MMus, MLIS
Automation Librarian
Library and Media Services
San Antonio College
1001 Howard St.
San Antonio, TX  78212
(210)733-2669 (v)  (210)733-2597 (f)
  [EMAIL PROTECTED]



Re: Character sets

2004-11-24 Thread Ed Summers
On Wed, Nov 24, 2004 at 08:22:47AM +, Ashley Sanders wrote:
> Is MARC::Record trying to treat than as Unicode when in fact they 
> are MARC-8?

MARC::Record currently does no transformation of character sets that 
I'm aware of. There is a completely separate module MARC::Charset
which provides some MARC8/UTF8 tranformation support, but it is
functionally separate from MARC::Record.

//Ed


RE: Character sets - kind of solved?

2004-12-03 Thread Doran, Michael D
First off, Ashley's suggestion that the original encoding was likely
MARC-8 is correct.  The author's Arabic name, transliterated into the
Latin alphabet, should be "Bis{latin small letter a with macron}{latin
small letter t with dot below}{latin small letter i with macron},
Mu{latin small letter h with dot below}ammad."  I am basing this on
MARC-21 records that can be seen in UCLA's online catalog [1].  So, if
the above name is encoded in MARC-8 then the underlying code would match
John's original code points [2]:
 > >> Looking at the name with a hex editor, it gives, with hex values
in curly brackets,
 > >> "Bis{e5}a{f2}t{e5}i, Mu{f2}hammad."

Then the question becomes: "What happened?" 

 > >> the name now appears as
 > >> "Bis{ef bf bd}a{ef bf bd}t{ef bf bd}i, Mu{ef bf bd}hammad."

The fact that one byte turned into three bytes, suggests UTF-8 encoding.
And the fact that *both* MARC-8 combining characters (i.e. "e5" and
"f2") now appear as the *same* combination of characters (i.e. "ef bf
bd") suggests that it was not an encoding translation from one coded
character set to the equivalent codepoint in another character set.  If
we assume UTF-8 and convert UTF-8 "ef bf bd" to its Unicode code point,
we get U+FFFD [3].  If we look up U+FFFD we see that it is the
"REPLACEMENT CHARACTER" [4].  

Since MARC::Record (obviously) would't object to the original MARC-8
character encoding, I'm guessing that sometime *after* processing the
record with MARC::Record that it was either moved to, or viewed in, a
client/platform/environment that was not MARC-8 savvy (which is pretty
much everything) and that the client/platform/environment, not
recognizing the hex e5 and f2 as valid character encodings, replaced
them with the generic replacement character for that
client/platform/environment.

So I'm thinking that we can rule out MARC::Record and look closer at
what happened to the data subsequent to MARC::Record processing.  That's
my guess anyway, and I'm sticking with it until I hear a better story.
;-)

[1] UCLA's Voyager ILMS has been upgraded to a Unicode version, and is
able to display the characters accurately.  My assumption is that the
author in the links below is the one in question.
See for example (looking at the title field, rather than the underlined
author/name field):
 http://catalog.library.ucla.edu/cgi-bin/Pwebrecon.cgi?bbid=603048
 http://catalog.library.ucla.edu/cgi-bin/Pwebrecon.cgi?bbid=603049
 http://catalog.library.ucla.edu/cgi-bin/Pwebrecon.cgi?bbid=5053287
 http://catalog.library.ucla.edu/cgi-bin/Pwebrecon.cgi?bbid=4490052

[2] In MARC-8, combining diacritic characters precede the base
character, and as Ashley pointed out, E5 is "macron" and F2 is "dot
below."

[3] hex "ef bf bd" = binary "1110 1011 1001"  
A three-octet UTF-8 character has the format of 1110 10xx
10xx, with the "x" positions being the significant values in
determining the Unicode code point.  When we concatenate those x
position values from the above binary code, we get 1101,
which converted to hex, is FFFD
 
[4] See:
http://rocky.uta.edu/doran/urdu/search.cgi?char_set=unicode&char_type=he
x&char_value=fffd
(or just go to http://rocky.uta.edu/doran/urdu/search.cgi and plug
in fffd

-- Michael

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 cell
# [EMAIL PROTECTED]
# http://rocky.uta.edu/doran/ 

> -Original Message-
> From: Ashley Sanders [mailto:[EMAIL PROTECTED] 
> Sent: Wednesday, November 24, 2004 2:23 AM
> Cc: [EMAIL PROTECTED]
> Subject: Re: Character sets
> 
> Ed Summers wrote:
> > On Tue, Nov 23, 2004 at 04:10:05PM -0600, John Hammer wrote:
> > 
> >>I have a character problem that I hope someone can help me with. In
> >>a MARC record I am modifying using MARC::Record, one of the names
> >>contains letters with diacritics. Looking at the name with a hex
editor,
> >>it gives, with hex values in curly brackets,"Bis{e5}a{f2}t{e5}i,
> >>Mu{f2}hammad." After running through MARC::Record, the name now
appears
> >>as "Bis{ef bf bd}a{ef bf bd}t{ef bf bd}i, Mu{ef bf bd}hammad."
> > 
> > 
> > That's pretty odd. Any chance you could send me the MARC record? At
this
> > time MARC::Record does not play nicely with Unicode (UTF8). 
> > 
> > http://rt.cpan.org/NoAuth/Bug.html?id=3707
> 
> It is possible they are MARC-8 characters rather than utf-8. In MARC-8
> E5 is "macron" and F2 is "dot below." Is MARC::Record trying to treat
> than as Unicode when in fact they are MARC-8?
> 
> Ashley.
> 
> -- 
> Ashley Sanders [EMAIL PROTECTED]
> Copac http://copac.ac.uk -- A MIMAS service funded by JISC
> 


Re: Character sets - kind of solved?

2004-12-04 Thread Mike Rylander
; A three-octet UTF-8 character has the format of 1110 10xx
> 10xx, with the "x" positions being the significant values in
> determining the Unicode code point.  When we concatenate those x
> position values from the above binary code, we get 1101,
> which converted to hex, is FFFD
> 
> [4] See:
> http://rocky.uta.edu/doran/urdu/search.cgi?char_set=unicode&char_type=he
> x&char_value=fffd
> (or just go to http://rocky.uta.edu/doran/urdu/search.cgi and plug
> in fffd
> 
> -- Michael
> 
> # Michael Doran, Systems Librarian
> # University of Texas at Arlington
> # 817-272-5326 office
> # 817-688-1926 cell
> # [EMAIL PROTECTED]
> # http://rocky.uta.edu/doran/
> 
> > -Original Message-
> > From: Ashley Sanders [mailto:[EMAIL PROTECTED]
> > Sent: Wednesday, November 24, 2004 2:23 AM
> > Cc: [EMAIL PROTECTED]
> > Subject: Re: Character sets
> >
> > Ed Summers wrote:
> > > On Tue, Nov 23, 2004 at 04:10:05PM -0600, John Hammer wrote:
> > >
> > >>I have a character problem that I hope someone can help me with. In
> > >>a MARC record I am modifying using MARC::Record, one of the names
> > >>contains letters with diacritics. Looking at the name with a hex
> editor,
> > >>it gives, with hex values in curly brackets,"Bis{e5}a{f2}t{e5}i,
> > >>Mu{f2}hammad." After running through MARC::Record, the name now
> appears
> > >>as "Bis{ef bf bd}a{ef bf bd}t{ef bf bd}i, Mu{ef bf bd}hammad."
> > >
> > >
> > > That's pretty odd. Any chance you could send me the MARC record? At
> this
> > > time MARC::Record does not play nicely with Unicode (UTF8).
> > >
> > > http://rt.cpan.org/NoAuth/Bug.html?id=3707
> >
> > It is possible they are MARC-8 characters rather than utf-8. In MARC-8
> > E5 is "macron" and F2 is "dot below." Is MARC::Record trying to treat
> > than as Unicode when in fact they are MARC-8?
> >
> > Ashley.
> >
> > --
> > Ashley Sanders [EMAIL PROTECTED]
> > Copac http://copac.ac.uk -- A MIMAS service funded by JISC
> >
> 


-- 
Mike Rylander
[EMAIL PROTECTED]
GPLS -- PINES Development
Database Developer
http://open-ils.org


marc-xml-fixup.tgz
Description: Binary data


Re: Character sets - kind of solved?

2004-12-05 Thread Ed Summers
On Sat, Dec 04, 2004 at 02:30:53PM -0500, Mike Rylander wrote:
> I've got a working patch that correctly transcodes records from
> USMARC(MARC-8) to MARC21slim(UTF8) and back again.

Mike, would you like CVS access priveledges on the sourceforge site
so you can commit this stuff? I'm not actively using MARC::File::XML
or MARC::Charset so it would help to have a developer who was using 
them routinely.

If you are interested I can make you a co-maintainer of the CPAN modules
as well. 

//Ed


RE: Character sets - kind of solved?

2004-12-06 Thread Doran, Michael D
> One (perhaps large) caveat: as of now all USMARC records are assumed
> to be MARC-8 encoded, and the data within is always run through
> to_utf8/to_marc8 during XML export/import.

The MARC-21 standard allows for either MARC-8 or UCS/Unicode.  Position
09 in the record leader indicates the character encoding: a "blank" for
MARC-8, and an "a" for UCS/Unicode.  Perhaps your patch could test for
this and then only apply the transformation when required.  Note: I
believe the leader itself is limited to characters in the ASCII range,
so you wouldn't have to know the encoding of the record prior to parsing
the leader.

> What that means is that
> the records from the problem below (containing UTF8 directly in the
> data, without an encoding marker) would probably break during export
> to XML.

The original record from John Hammer did not contain UTF-8, it contained
MARC-8.  I believe that the fact that the combining MARC-8 characters
were replaced by a generic replacement character only indicates that the
app he was using to view the data (post processing by MARC::Record) was
using a character set in which hex E5 and F2, encoded as single octets,
were not valid characters in that app's character set.  That app's
character set was apparently Unicode (UTF-8) and so E5 and F2 were
replaced by U+FFFD.  That's the long way of saying that the patch should
work fine in his case.  :-)

-- Michael

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 cell
# [EMAIL PROTECTED]
# http://rocky.uta.edu/doran/ 

> -Original Message-
> From: Mike Rylander [mailto:[EMAIL PROTECTED] 
> Sent: Saturday, December 04, 2004 1:31 PM
> To: [EMAIL PROTECTED]
> Subject: Re: Character sets - kind of solved?
> 
> I've run into some record encoding issues myself, though not the
> problem from below.  In any case, this got me thinking about the
> current state of MARC::File::XML, specifically that it could not
> handle MARC8 encoded records.
> 
> I submitted a patch a while back to hack around this, but that just
> lets us get the MARC records into well formed XML.  Basically, it just
> lets you set the encoding on the XML to something that has embedded
> 8-bit characters, like ISO-8859-1, aka LATIN1.
> 
> But that is far from optimal, since the data is being misinterpreted. 
> So I took a look at using MARC::Charset inside MARC::File::XML, and
> I've got a working patch that correctly transcodes records from
> USMARC(MARC-8) to MARC21slim(UTF8) and back again.
> 
> It's attached below, if anyone would be so kind as to test it.  If all
> goes well we sould be able to actually use MARC::File::XML in
> production.  If you do decide to test it, it requires MARC::Charset.
> 
> One (perhaps large) caveat: as of now all USMARC records are assumed
> to be MARC-8 encoded, and the data within is always run through
> to_utf8/to_marc8 during XML export/import.  What that means is that
> the records from the problem below (containing UTF8 directly in the
> data, without an encoding marker) would probably break during export
> to XML.
> 
> The attached tarball contains a patched XML.pm and SAX.pm.  Replace
> your current MARC/File/XML.pm and MARC/File/SAX.pm with those and you
> should be good to go.  I've also included the scripts I used to test
> and one of my old MARC8 encoded records.  http://redlightgreen.com
> confirms that the illustrators name is properly transcoded.
> 
> On Fri, 3 Dec 2004 17:53:32 -0600, Doran, Michael D 
> <[EMAIL PROTECTED]> wrote:
> > First off, Ashley's suggestion that the original encoding was likely
> > MARC-8 is correct.  The author's Arabic name, 
> transliterated into the
> > Latin alphabet, should be "Bis{latin small letter a with 
> macron}{latin
> > small letter t with dot below}{latin small letter i with macron},
> > Mu{latin small letter h with dot below}ammad."  I am basing this on
> > MARC-21 records that can be seen in UCLA's online catalog 
> [1].  So, if
> > the above name is encoded in MARC-8 then the underlying 
> code would match
> > John's original code points [2]:
> >  > >> Looking at the name with a hex editor, it gives, with 
> hex values
> > in curly brackets,
> >  > >> "Bis{e5}a{f2}t{e5}i, Mu{f2}hammad."
> > 
> > Then the question becomes: "What happened?"
> > 
> >  > >> the name now appears as
> >  > >> "Bis{ef bf bd}a{ef bf bd}t{ef bf bd}i, Mu{ef bf bd}hammad."
> > 
> > The fact that one byte turned into three bytes, suggests 
> UTF-8 encoding.
> > And the fact that *both* MARC-8 combining characters (i.e. &

Re: Character sets - kind of solved?

2004-12-06 Thread John Hammer
On Mon, 6 Dec 2004 08:54:21 -0600
"Doran, Michael D" <[EMAIL PROTECTED]> wrote:

> The original record from John Hammer did not contain UTF-8, it contained
> MARC-8.  I believe that the fact that the combining MARC-8 characters
> were replaced by a generic replacement character only indicates that the
> app he was using to view the data (post processing by MARC::Record) was
> using a character set in which hex E5 and F2, encoded as single octets,
> were not valid characters in that app's character set.  That app's
> character set was apparently Unicode (UTF-8) and so E5 and F2 were
> replaced by U+FFFD.  That's the long way of saying that the patch should
> work fine in his case.  :-)
> 
You are correct in assuming the locale environment is set up for UTF-8 on my 
computer. However, that wouldn't explain why the record is different 
pre-processing vs. post-processing with MARC::Record. Viewing the two records 
with the same app (in this case vi) gives different results, both incorrect.

I tried changing the locale to ISO-8859-1 but that was no help. Does this mean 
I am unable to programmatically modify records that come to me in MARC-8?

An interesting discussion. Thanks to all for your input.


-- 

John C. Hammer, MMus, MLIS
Automation Librarian
Library and Media Services
San Antonio College
1001 Howard St.
San Antonio, TX  78212
(210)733-2669 (v)  (210)733-2597 (f)
  [EMAIL PROTECTED]



Re: Character sets - kind of solved

2004-12-07 Thread Ed Summers
John Hammer wrote:
> You are correct in assuming the locale environment is set up for UTF-8
> on my computer. However, that wouldn't explain why the record is
> different pre-processing vs. post-processing with MARC::Record. Viewing
> the two records with the same app (in this case vi) gives different
> results, both incorrect.

John, can you please send me the program you are using to read/write the
MARC record, and the actual record. Also, what version of MARC::Record
are you using, and with which version of Perl?

//Ed


Re: Character sets - kind of solved

2004-12-08 Thread Ed Summers
On Tue, Dec 07, 2004 at 12:53:44PM -0600, John Hammer wrote:
> Attached are the two files. The Marc file seems to be using a Windows font 
> (1251?). As for the program, the same changes occur if I just read the Marc 
> file and write it back out with no changes. The Perl I am using is 5.8.3

Ok, I've confirmed that simply reading this record in and writing it out
will yield a different file. The unix diff program confirms this, but
does not isolate the difference, since MARC records are not multiline
documents. 

Using diff with hexdump provides some more concrete data. First hexdump the
original file and the processed file like so:

% hexdump -C original.dat > original.dump
% hexdump -C processed.dat > processed.dump

Then compare these two files with diff:

% diff original.dump processed.dump

You should see this:

148,149c148,149
< 0930  73 20 1e 1d 0a 0a |s |
< 0936
---
> 0930  73 20 1e 1d   |s ..|
> 0934

What this shows is that the original file has two trailing 0a bytes at
the end of the record, and that the processed file does not. This makes
sense because MARC::Record was adjusted back in v1.24 (Apr 2003) to
remove certain illegal characters between records that some library
systems place there. See line 58 in MARC::File::USMARC in the latest
version of the MARC-Record distribution if you are curious :-)

So unless you are unable to reproduce this I think this mystery is solved.

//Ed


Re: Character sets - kind of solved

2004-12-08 Thread John Hammer
That's different from what I get. What I get is:

1c1
<   30 32 33 35 36 63 61 6d  20 20 32 32 30 30 34 38  |02356cam  220048|
---
>   30 32 33 36 34 63 61 6d  20 20 32 32 30 30 34 38  |02364cam  220048|
21,30c21,30

105,149c105,149
< 0680  20 1f 61 42 69 73 e5 61  f2 74 e5 69 2c 20 4d 75  | .aBis_, 
Mu|
< 0690  f2 68 61 6d 6d 61 64 2e  1f 74 43 6f 6e 76 65 72  
|___ammad..tConver|
< ... not shown>
< 0930  73 20 1e 1d 0a 0a |s |
< 0936
---
> 0680  20 1f 61 42 69 73 ef bf  bd 61 ef bf bd 74 ef bf  | .aBis___a___t___
> 0690  bd 69 2c 20 4d 75 ef bf  bd 68 61 6d 6d 61 64 2e  |i, Mu___hammad.|
< ... not shown>
> 0930  69 61 20 47 61 6c 65 27  73 20 1e 1d  |ia Gale's ..|
> 093c

How would deleting the illegal characters cause changes to the characters in 
lines 680 and 690 above?

John

On Wed, 8 Dec 2004 10:23:38 -0600
Ed Summers <[EMAIL PROTECTED]> wrote:

> On Tue, Dec 07, 2004 at 12:53:44PM -0600, John Hammer wrote:
> > Attached are the two files. The Marc file seems to be using a Windows font 
> > (1251?). As for the program, the same changes occur if I just read the Marc 
> > file and write it back out with no changes. The Perl I am using is 5.8.3
> 
> Ok, I've confirmed that simply reading this record in and writing it out
> will yield a different file. The unix diff program confirms this, but
> does not isolate the difference, since MARC records are not multiline
> documents. 
> 
> Using diff with hexdump provides some more concrete data. First hexdump the
> original file and the processed file like so:
> 
> % hexdump -C original.dat > original.dump
> % hexdump -C processed.dat > processed.dump
> 
> Then compare these two files with diff:
> 
> % diff original.dump processed.dump
> 
> You should see this:
> 
> 148,149c148,149
> < 0930  73 20 1e 1d 0a 0a |s |
> < 0936
> ---
> > 0930  73 20 1e 1d   |s ..|
> > 0934
> 
> What this shows is that the original file has two trailing 0a bytes at
> the end of the record, and that the processed file does not. This makes
> sense because MARC::Record was adjusted back in v1.24 (Apr 2003) to
> remove certain illegal characters between records that some library
> systems place there. See line 58 in MARC::File::USMARC in the latest
> version of the MARC-Record distribution if you are curious :-)
> 
> So unless you are unable to reproduce this I think this mystery is solved.
> 
> //Ed


Re: Character sets - kind of solved

2004-12-08 Thread Ed Summers
On Wed, Dec 08, 2004 at 03:31:18PM -0600, John Hammer wrote:
> How would deleting the illegal characters cause changes to the characters in 
> lines 680 and 690 above?

It doesn't explain it :) What version of MARC::Record are you using? What
happens when you use perl to read in the data and write it out, without
MARC::Record in the mix?

//Ed


Re: Character sets - kind of solved

2004-12-08 Thread John Hammer
MARC::Record version 1.39_01. Using diff there is no difference in the files 
when using Perl to read in and write out the data.

John

On Wed, 8 Dec 2004 15:43:29 -0600
Ed Summers <[EMAIL PROTECTED]> wrote:

> On Wed, Dec 08, 2004 at 03:31:18PM -0600, John Hammer wrote:
> > How would deleting the illegal characters cause changes to the characters 
> > in 
> > lines 680 and 690 above?
> 
> It doesn't explain it :) What version of MARC::Record are you using? What
> happens when you use perl to read in the data and write it out, without
> MARC::Record in the mix?
> 
> //Ed


Re: Character sets - kind of solved

2004-12-08 Thread Ed Summers
On Wed, Dec 08, 2004 at 05:47:23PM -0600, John Hammer wrote:
> MARC::Record version 1.39_01. Using diff there is no difference in the 
> files when using Perl to read in and write out the data.

Can you try downgrading to v1.38? v1.39_01 has some experimental utf8
handling code in it which was released as a beta to CPAN.

//Ed


Re: Character sets - kind of solved

2004-12-09 Thread John Hammer
That fixed the problem, going back a version. That will teach me not to use a 
beta version for production.

Thanks to all who took the time to help me with this, especially Ed.

John

On Wed, 8 Dec 2004 19:57:26 -0600
Ed Summers <[EMAIL PROTECTED]> wrote:

> On Wed, Dec 08, 2004 at 05:47:23PM -0600, John Hammer wrote:
> > MARC::Record version 1.39_01. Using diff there is no difference in the 
> > files when using Perl to read in and write out the data.
> 
> Can you try downgrading to v1.38? v1.39_01 has some experimental utf8
> handling code in it which was released as a beta to CPAN.
> 
> //Ed


Re: Character sets - kind of solved

2004-12-09 Thread Ed Summers
On Thu, Dec 09, 2004 at 10:32:25AM -0600, John Hammer wrote:
> That fixed the problem, going back a version. That will teach me not to 
> use a beta version for production.

Perhaps v1.39_01 should be removed from CPAN to avoid any further
confusion. For that matter I think it's time to remove MARC.pm as well
:) I can do the latter unless there are objections.

//Ed


RE: Character sets - kind of solved

2004-12-09 Thread Bryan Baldus
>For that matter I think it's time to remove MARC.pm as well :) I can do the
latter unless there are objections.

I'm taking inspiration from MARC.pm for the MARC::File::MARCMaker module I'm
working on, but I've got a local copy of v. 1.13 from the SourceForge files
page. Removing the older v. 1.07 from CPAN seems to make sense.

Bryan Baldus
[EMAIL PROTECTED]
[EMAIL PROTECTED]
http://home.inwave.com/eija


Updating MARC::File::XML (was Re: Character sets - kind of solved?)

2004-12-06 Thread Mike Rylander
On Mon, 6 Dec 2004 08:54:21 -0600, Doran, Michael D <[EMAIL PROTECTED]> wrote:
> > One (perhaps large) caveat: as of now all USMARC records are assumed
> > to be MARC-8 encoded, and the data within is always run through
> > to_utf8/to_marc8 during XML export/import.
> 
> The MARC-21 standard allows for either MARC-8 or UCS/Unicode.  Position
> 09 in the record leader indicates the character encoding: a "blank" for
> MARC-8, and an "a" for UCS/Unicode.  Perhaps your patch could test for
> this and then only apply the transformation when required.  Note: I
> believe the leader itself is limited to characters in the ASCII range,
> so you wouldn't have to know the encoding of the record prior to parsing
> the leader.

Yeah.  I've got a new version that takes this into account.  The
problem is that MARC::Record on modern Perls (post 5.6) doesn't seem
to work properly with Unicode encoded records, at least not without
some Encode.pm work.  It seems to truncate fields containing combining
octets in cases where there is a valid LATIN1 (well, current system
encoding/locale, actually) version of the character, such as LATIN1
char 0xF8.  This is due to modern Perls "helping" you with string
encoding.  Because of that, I am now "downgrading" all XML Unicode
records to MARC8, though there shouldn't be any loss of data.  I am
now using the Encode module inside ...::XML.pm and ...::SAX.pm to
handle this, but until I get everything fully tested I'll continuing
to reencode records to MARC8.  Older Perls (pre 5.6) should not
actually need Encode's help, but it should not hurt in those cases.

> 
> > What that means is that
> > the records from the problem below (containing UTF8 directly in the
> > data, without an encoding marker) would probably break during export
> > to XML.
> 
> The original record from John Hammer did not contain UTF-8, it contained
> MARC-8.  I believe that the fact that the combining MARC-8 characters
> were replaced by a generic replacement character only indicates that the
> app he was using to view the data (post processing by MARC::Record) was
> using a character set in which hex E5 and F2, encoded as single octets,
> were not valid characters in that app's character set.  That app's
> character set was apparently Unicode (UTF-8) and so E5 and F2 were
> replaced by U+FFFD.  That's the long way of saying that the patch should
> work fine in his case.  :-)
> 

I understand.  It wasn't that I was trying to solve that particular
problem, it just got me thinking about MARC::File::XML.  Sorry for any
confusion there.

I'm using File::XML regularly now, and I'm trying to fix it up.  I am
glad that the patch should work with those records, though!

One last note.  I'm rather new to encoding issues as they pertain to
MARC8, since they cannot by implicitly handled by Perl, as other
encodings can be in some cases.  This will be evolving, and I will do
my best not to break anything and to follow the MARC standard, but
IANAL(ibrarian), so be gentle. ;)

Thanks for the pointers, and I'll send more updates here unless
everyone would rather I not. :)

-- 
Mike Rylander
[EMAIL PROTECTED]
GPLS -- PINES Development
Database Developer
http://open-ils.org

> 
> 
> -- Michael
> 
> # Michael Doran, Systems Librarian
> # University of Texas at Arlington
> # 817-272-5326 office
> # 817-688-1926 cell
> # [EMAIL PROTECTED]
> # http://rocky.uta.edu/doran/
> 
> > -Original Message-
> > From: Mike Rylander [mailto:[EMAIL PROTECTED]
> > Sent: Saturday, December 04, 2004 1:31 PM
> > To: [EMAIL PROTECTED]
> > Subject: Re: Character sets - kind of solved?
> >
> > I've run into some record encoding issues myself, though not the
> > problem from below.  In any case, this got me thinking about the
> > current state of MARC::File::XML, specifically that it could not
> > handle MARC8 encoded records.
> >
> > I submitted a patch a while back to hack around this, but that just
> > lets us get the MARC records into well formed XML.  Basically, it just
> > lets you set the encoding on the XML to something that has embedded
> > 8-bit characters, like ISO-8859-1, aka LATIN1.
> >
> > But that is far from optimal, since the data is being misinterpreted.
> > So I took a look at using MARC::Charset inside MARC::File::XML, and
> > I've got a working patch that correctly transcodes records from
> > USMARC(MARC-8) to MARC21slim(UTF8) and back again.
> >
> > It's attached below, if anyone would be so kind as to test it.  If all
> > goes well we sould be able to actually use MARC::File::XML in
> > production.  If you do decide t