from:"Doran, Michael D"

RE: printing UTF-8 encoded MARC records with as_usmarc

2012-08-15 Thread Doran, Michael D

Hi Devon,

> I just recently came across this presentation which lays out pretty much
> all the issues with Unicode in perl, and makes some recommendations for
> best practices.

While Nick Patch's presentation is excellent, I'm not sure that it "lays out 
pretty much all the issues with Unicode in perl".  ;-)

To fit that bill, I highly recommend this series of talks given by Tom 
Christiansen at OSCON 2011:

 1. Perl Unicode Essentials
 2. Unicode in Perl Regexes
 3. Unicode Support Shootout: The Good, The Bad, & the (mostly) Ugly

http://training.perl.com/OSCON2011/index.html
(resolves to http://98.245.80.27/tcpc/OSCON2011/index.html)

If you read through those presentations and disagree, I promise to buy you a 
beer at the next conference (code4lib?) we both attend.

-- Michael

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 mobile
# do...@uta.edu
# http://rocky.uta.edu/doran/

> -Original Message-
> From: Smith,Devon [mailto:smit...@oclc.org]
> Sent: Tuesday, July 31, 2012 8:26 AM
> To: William Dueber; Shelley Doljack
> Cc: perl4lib@perl.org
> Subject: RE: printing UTF-8 encoded MARC records with as_usmarc
> 
> I just recently came across this presentation which lays out pretty much
> all the issues with Unicode in perl, and makes some recommendations for
> best practices. You may find some general insight into the whole
> situation by going over it.
> 
> http://www.slideshare.net/nickpatch/fundamental-unicode-at-dcbaltimore-
> perl-workshop-2012
> 
> /dev
> --
> Devon Smith
> Consulting Software Engineer
> OCLC Research
> http://www.oclc.org/research/people/smith.htm
> 
> 
> -Original Message-
> From: William Dueber [mailto:dueb...@umich.edu]
> Sent: Monday, July 30, 2012 8:14 PM
> To: Shelley Doljack
> Cc: perl4lib@perl.org
> Subject: Re: printing UTF-8 encoded MARC records with as_usmarc
> 
> First off, it's entirely possible that you have bad UTF-8 (perhaps rogue
> MARC-8, perhaps just lousy characters) in your MARC. I know we have
> plenty
> of that crap.
> 
> You need to tell perl that you'll be outputting UTF-8 using 'bincode'
> 
>   binmode(FILE, ':utf8');
> 
> In general, you'll want to do this to basically every file you open for
> reading or writing.
> 
> A great overview of Perl and UTF-8 can be found at:
> 
> http://stackoverflow.com/questions/6162484/why-does-modern-perl-avoid-
> utf-8-by-default
> 
> 
> 
> 
> 
> On Mon, Jul 30, 2012 at 6:51 PM, Shelley Doljack
> wrote:
> 
> > Hi,
> >
> > I wrote a script that extracts marc records from a file given certain
> > conditions and puts them in a new file. When my input record is
> correctly
> > encoded in UTF-8 and I run my script from windows command prompt, this
> > warning message appears: "Wide character in print at
> record_extraction.plline 99" (the line in my script where I print to a
> new file using
> > as_usmarc). I compared the extracted record before and after in
> MarcEdit
> > and the diacritic was changed. I tried marcdump newfile.mrc to see what
> > happens and I get this error: "utf8 \xF4 does not map to Unicode at
> > C:/Perl64/lib/Encode.pm line 176." When I run my extraction script
> again
> > with MARC-8 encoded data then I don't have the same problem.
> >
> > The basic outline of my script is:
> >
> > my $batch = MARC::Batch->new('USMARC', $input_file);
> >
> > while (my $record = $batch->next()) {
> >  #do some checks
> >  #if checks ok then
> >  print FILE $record->as_usmarc();
> > }
> >
> > Do I need to add something that specifies to interpret the data as UTF-
> 8?
> > Does MARC::Record not handle UTF-8 at all?
> >
> > Thanks,
> > Shelley
> >
> > 
> > Shelley Doljack
> > E-Resources Metadata Librarian
> > Metadata and Library Systems
> > Stanford University Libraries
> > sdolj...@stanford.edu
> > 650-725-0167
> >
> 
> 
> 
> --
> 
> Bill Dueber
> Programmer -- Library Systems
> University of Michigan

Re: File open head scratcher UPDATE

2012-03-17 Thread Doran, Michael D

Hi Brad,

This is interesting.  Thanks for delving deeper into this.

I'd prefer that my high level scripting language didn't make me think about 
this type of thing... but I guess that's not being realistic.  ;-)

-- Michael

Sent from my iPad

On Mar 17, 2012, at 6:17 PM, "Brad Baxter"  wrote:

> On Sat, Mar 17, 2012 at 5:25 PM, Doran, Michael D  wrote:
>> It looks like the read pointer was going to the beginning of the file on 
>> Solaris, but the end of the file on Linux.  I've edited the script to do 
>> separate opens for when I need to read the file and when I need to append to 
>> it.  I'm running the script now to check for any unintended consequences.
>> 
>> My take-away on this, is to avoid the use of "+>>" to open a file.  In fact, 
>> in doing further research I saw that exact advice in the Perl Cookbook, and 
>> for just this reason.
>> 
>> Thanks to Brad Baxter for (pardon the pun) pointing me in the right 
>> direction.
>> 
>> -- Michael
> 
> FWIW, the Perl version seems to make a difference, too ...
> 
>>> cat qt
> #!/usr/local/bin/perl
> 
> use strict;
> use warnings;
> 
> system 'echo "This is a test" > test';
> 
> open my $fh, '+>>', "test" or die $!;
> print '[',<$fh>,']';
> close $fh;
> 
>>> ./qt
> [This is a test
> ]
> 
>>> /usr/local/bin/perl -v
> 
> This is perl, v5.8.8 built for sun4-solaris
> 
>>> cat qt
> #!/usr/local/bin/perl
> 
> use strict;
> use warnings;
> 
> system 'echo "This is a test" > test';
> 
> open my $fh, '+>>', "test" or die $!;
> print '[',<$fh>,']';
> close $fh;
> 
>>> ./qt
> []
> 
>>> /usr/local/bin/perl -v
> 
> This is perl 5, version 12, subversion 1 (v5.12.1) built for sun4-solaris

RE: File open head scratcher

2012-03-17 Thread Doran, Michael D

Hi Dan,

> For what it's worth, "Mixing reads and writes" in perlopentut says
> that you probably want:
> 
> open (my $DATEFILE, "+<", $date_file) ...

Hmmm.  I may give that a try, too!  Thanks!

-- Michael

> -Original Message-
> From: deni...@gmail.com [mailto:deni...@gmail.com] On Behalf Of Dan Scott
> Sent: Saturday, March 17, 2012 4:19 PM
> To: Doran, Michael D
> Cc: perl4lib
> Subject: Re: File open head scratcher
> 
> On Sat, Mar 17, 2012 at 3:09 PM, Doran, Michael D  wrote:
> > I am migrating  a perl script from a server running perl v5.8.5 on
> Solaris 9 to a server running perl v5.12.2 on Redhat Linux 5.5.  The new
> environment doesn't seem to like the syntax I'm using to open a file, and
> I'm scratching my head over why that is the case.
> >
> > That part that is not working appears to be where it opens and reads a
> file (a file which it will later append to).  The file that is being
> opened for read and appending exists and contains data.
> >
> > This appears to be the relevant code:
> >
> >  open (my $DATEFILE, "+>>$date_file")
> >        || die "Cannot open $date_file: $!";
> 
> The head-scratching behaviour you describe, where only the system call
> outputs results, matches mine with perl 5.14.2. Maybe there's a
> difference in the versions of perl on your two systems?
> 
> For what it's worth, "Mixing reads and writes" in perlopentut says
> that you probably want:
> 
> open (my $DATEFILE, "+<", $date_file) ...
> 
> (and making that change to my copy of your script makes it work for me).
> 
> --
> Dan Scott
> Laurentian University

RE: File open head scratcher UPDATE

2012-03-17 Thread Doran, Michael D

It looks like the read pointer was going to the beginning of the file on 
Solaris, but the end of the file on Linux.  I've edited the script to do 
separate opens for when I need to read the file and when I need to append to 
it.  I'm running the script now to check for any unintended consequences.  

My take-away on this, is to avoid the use of "+>>" to open a file.  In fact, in 
doing further research I saw that exact advice in the Perl Cookbook, and for 
just this reason.

Thanks to Brad Baxter for (pardon the pun) pointing me in the right direction.

-- Michael

> -Original Message-
> From: Doran, Michael D [mailto:do...@uta.edu]
> Sent: Saturday, March 17, 2012 2:09 PM
> To: perl4lib
> Subject: File open head scratcher
> 
> I am migrating  a perl script from a server running perl v5.8.5 on Solaris
> 9 to a server running perl v5.12.2 on Redhat Linux 5.5.  The new
> environment doesn't seem to like the syntax I'm using to open a file, and
> I'm scratching my head over why that is the case.
> 
> That part that is not working appears to be where it opens and reads a
> file (a file which it will later append to).  The file that is being
> opened for read and appending exists and contains data.
> 
> This appears to be the relevant code:
> 
>   open (my $DATEFILE, "+>>$date_file")
> || die "Cannot open $date_file: $!";
> 
>   my @run_dates = <$DATEFILE>;
> 
> 
> Any idea why this wouldn't work?
> 
> Below is a test script I'm using to isolate the behavior.  I'm using a
> system call to cat the contents of the file out, then after opening the
> file, using perl to print out the contents.
> 
> #!/m1/shared/bin/perl
> 
> use strict;
> use warnings;
> 
> my $date_file  = "/m1/incoming/ab/dates.txt";
> 
> system qq(cat $date_file);
> 
> print "\n\n  cat done \n\n";
> 
> open (my $DATEFILE, "+>>$date_file")
> || die "Cannot open $date_file: $!";
> 
> my @run_dates = <$DATEFILE>;
> 
> foreach my $foo (@run_dates) {
>   print $foo;
> }
> 
> exit(0);
> 
> 
> When the test script is run on the original server, both the system cat
> call and the foreach print output the contents of the file:
> 
> ab/ => ./fileopen.pl
> 2012-03-02  incr
> 2012-03-03  full
> 2012-03-04  incr
> 2012-03-05  incr
> 2012-03-06  incr
> 2012-03-07  incr
> 2012-03-08  incr
> 2012-03-09  incr
> 2012-03-10  full
> 2012-03-11  incr
> 
> 
>   cat done 
> 
> 2012-03-02  incr
> 2012-03-03  full
> 2012-03-04  incr
> 2012-03-05  incr
> 2012-03-06  incr
> 2012-03-07  incr
> 2012-03-08  incr
> 2012-03-09  incr
> 2012-03-10  full
> 2012-03-11  incr
> ab/ =>
> 
> When the test script is run on the new server, only the system cat call
> outputs the file contents, indicating that the dates.txt file contents are
> not being assigned to the @run_dates array:
> 
> ab/ => ./fileopen.pl
> 2012-03-02  incr
> 2012-03-03  full
> 2012-03-04  incr
> 2012-03-05  incr
> 2012-03-06  incr
> 2012-03-07  incr
> 2012-03-08  incr
> 2012-03-09  incr
> 2012-03-10  full
> 2012-03-11  incr
> 
> 
>   cat done 
> 
> ab/ =>
> 
> 
> I've tried the "three-argument" open syntax (which doesn't seem to make a
> difference):
> 
>   open my ($DATEFILE), '+>>', $date_file || or die...;
> 
> Any ideas what's going on (and why)?
> 
> -- Michael
> 
> # Michael Doran, Systems Librarian
> # University of Texas at Arlington
> # 817-272-5326 office
> # 817-688-1926 mobile
> # do...@uta.edu
> # http://rocky.uta.edu/doran/
>

File open head scratcher

2012-03-17 Thread Doran, Michael D

I am migrating  a perl script from a server running perl v5.8.5 on Solaris 9 to 
a server running perl v5.12.2 on Redhat Linux 5.5.  The new environment doesn't 
seem to like the syntax I'm using to open a file, and I'm scratching my head 
over why that is the case.

That part that is not working appears to be where it opens and reads a file (a 
file which it will later append to).  The file that is being opened for read 
and appending exists and contains data.

This appears to be the relevant code:

  open (my $DATEFILE, "+>>$date_file")
|| die "Cannot open $date_file: $!";

  my @run_dates = <$DATEFILE>;


Any idea why this wouldn't work?

Below is a test script I'm using to isolate the behavior.  I'm using a system 
call to cat the contents of the file out, then after opening the file, using 
perl to print out the contents.

#!/m1/shared/bin/perl

use strict;
use warnings;

my $date_file  = "/m1/incoming/ab/dates.txt";

system qq(cat $date_file);

print "\n\n  cat done \n\n";

open (my $DATEFILE, "+>>$date_file")
|| die "Cannot open $date_file: $!";

my @run_dates = <$DATEFILE>;

foreach my $foo (@run_dates) {
  print $foo;
}

exit(0);


When the test script is run on the original server, both the system cat call 
and the foreach print output the contents of the file:

ab/ => ./fileopen.pl
2012-03-02  incr
2012-03-03  full
2012-03-04  incr
2012-03-05  incr
2012-03-06  incr
2012-03-07  incr
2012-03-08  incr
2012-03-09  incr
2012-03-10  full
2012-03-11  incr


  cat done 

2012-03-02  incr
2012-03-03  full
2012-03-04  incr
2012-03-05  incr
2012-03-06  incr
2012-03-07  incr
2012-03-08  incr
2012-03-09  incr
2012-03-10  full
2012-03-11  incr
ab/ =>

When the test script is run on the new server, only the system cat call outputs 
the file contents, indicating that the dates.txt file contents are not being 
assigned to the @run_dates array:

ab/ => ./fileopen.pl
2012-03-02  incr
2012-03-03  full
2012-03-04  incr
2012-03-05  incr
2012-03-06  incr
2012-03-07  incr
2012-03-08  incr
2012-03-09  incr
2012-03-10  full
2012-03-11  incr


  cat done 

ab/ =>


I've tried the "three-argument" open syntax (which doesn't seem to make a 
difference):

  open my ($DATEFILE), '+>>', $date_file || or die...;

Any ideas what's going on (and why)?  

-- Michael

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 mobile
# do...@uta.edu
# http://rocky.uta.edu/doran/

RE: Anyone create MFHD records using MARC/Perl

2011-09-12 Thread Doran, Michael D

Hi Mark,

Over the years, I've done a few projects that involved manipulation of, and/or 
creating MARC holdings (MFHD) records using the Perl MARC::Record module.  No 
problems that I know of.

-- Michael

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 mobile
# do...@uta.edu
# http://rocky.uta.edu/doran/


> -Original Message-
> From: Mark Jordan [mailto:mjor...@sfu.ca]
> Sent: Friday, September 09, 2011 4:55 PM
> To: perl4lib
> Subject: Anyone create MFHD records using MARC/Perl
> 
> Hi,
> 
> Anyone know if there are any reasons that MARC::Record et al can't be used to
> create dumps of MFHD? For example, are there any leader/indicator values that
> are specific to MFHD that are illegal in MARC bib records that might cause
> MARC/Perl to puke?
> 
> Mark
> 
> Mark Jordan
> Head of Library Systems
> W.A.C. Bennett Library, Simon Fraser University
> Burnaby, British Columbia, V5A 1S6, Canada
> Voice: 778.782.5753 / Fax: 778.782.3023 / Skype: mark.jordan50
> mjor...@sfu.ca

RE: marcdump hex switch

2011-05-18 Thread Doran, Michael D

I never got an answer to this back in 2008 and thought I might have better luck 
now...

-- Michael

> -Original Message-
> From: Doran, Michael D
> Sent: Thursday, February 21, 2008 11:03 AM
> To: perl4lib@perl.org
> Subject: marcdump hex switch
> 
> I have MARC::Record 2.0 installed [1].  According to the Changes file
> marcdump now has a "--hex" switch [2]:
> 
>   [ENHANCEMENTS]
>   - Added --hex switch to marcdump, which dumps the record in
> hexadecimal.  The offsets are in decimal so that you can match
> them up to values in the leader.  The offset is reset to 0
> when we're past the directory so that you can match up the data
> with the offsets in the directory.
> 
> However I'm *not* finding that my marcdump actually has that hex switch:
> 
>   /usr/local/scripts/xml/marc => marcdump --hex test.mrc
>   Unknown option: hex
>   Usage: marcdump [options] file(s)
> 
>   Options:
> --[no]print Print a MicroLIF-style dump of each record
> --lif   Input files are MicroLIF, not USMARC
> --field=specSpecify a field spec to include.  There may be many.
> Examples:
> --field=245 --field=1XX
> --[no]quiet Print status messages
> --[no]stats Print a statistical summary by file at the end
> --version   Print version information
> --help  Print this summary
> 
> I poked around the marcdump script and didn't find anything "hex":
> 
> my $opt_print = 1;
> my $opt_quiet = 0;
> my $opt_stats = 1;
> my @opt_field = ();
> my $opt_help = 0;
> my $opt_lif = 0;
> 
> Any ideas/explanations?  The hex dump functionality would sure be handy.
> 
> -- Michael
> 
> [1] /usr/local/scripts/xml/marc => marcdump -v
> /usr/local/bin/marcdump, using MARC::Record v2.0
> 
> [2] CPAN > Revision history for Perl extension MARC::Record
> http://search.cpan.org/src/MIKERY/MARC-Record-2.0.0/Changes
> 
> # Michael Doran, Systems Librarian
> # University of Texas at Arlington
> # 817-272-5326 office
> # 817-688-1926 mobile
> # do...@uta.edu
> # http://rocky.uta.edu/doran/

RE: Invalid UTF-8 characters causing MARC::Record crash.

2011-05-18 Thread Doran, Michael D

Hi Al,

> For me I've found the best solution is to leave Encode.pm alone
> and redefine the offending subroutine within my processing script.

This was timely help for me, too, due to problems with fatal errors when 
processing a large file of bibs with MARC::Record.  Thanks!

(Although, when I checked, I had Encode.pm version 2.12)

-- Michael

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 mobile
# do...@uta.edu
# http://rocky.uta.edu/doran/


> -Original Message-
> From: Al [mailto:ra...@berkeley.edu]
> Sent: Tuesday, May 17, 2011 9:27 AM
> To: Mike Barrett; perl4lib@perl.org
> Subject: Re: Invalid UTF-8 characters causing MARC::Record crash.
> 
>  >Anybody ever see this before?
> 
> All. The. Time.
> 
> When I use Encode.pm version 2.12 I don't have this problem. But it occurs
> repeatedly with version 2.40.
> 
> There are a few different solutions, but I'm assuming, like me, that it's
> not practical for you to clean up your MARC records *before* you try and
> process them. So you can downgrade your Encode.pm or modify it to make it
> less demanding. For me I've found the best solution is to leave Encode.pm
> alone and redefine the offending subroutine within my processing script. I
> paste this in at the bottom of every script:
> 
> package Encode;
> use Encode::Alias;
> 
> sub decode($$;$)
> {
> my ($name,$octets,$check) = @_;
> my $altstring = $octets;
> return undef unless defined $octets;
> $octets .= '' if ref $octets;
> $check ||=0;
> my $enc = find_encoding($name);
> unless(defined $enc){
>require Carp;
>Carp::croak("Unknown encoding '$name'");
> }
> my $string;
> eval { $string = $enc->decode($octets,$check); };
> $_[1] = $octets if $check and !($check & LEAVE_SRC());
> if ($@) {
>return $altstring;
> } else {
>return $string;
> }
> }
> 
> But I'll be interested in other solutions people may bring up.
> 
> Good luck!
> 
> Al
> 
> 
> At 5/17/2011, Mike Barrett wrote:
>  >I'm using MARC::Batch and MARC::Field to iterate through a text file of
>  >bibliographic records from Voyager.
>  >
>  >The unrecoverable error is actually occurring in the Perl Unicode module
>  >which is, of course, called by MARC::Record.
>  >It's running into "invalid UTF-8 character 0xC2."
>  >When I looked up the Unicode character list, all of the C2 entries are
> found
>  >hex characters, so it appears that the second half is missing.
>  >
>  >After looking at the records in Voyager (using Arial Unicode MS font), I
>  >find that all of the problem records I've found are maps with Field 255|a
>  >[scale] |b [projection] |c [geo cordinates].
>  >
>  >Here's an example:
>  >As it appears in the text file:  c(W 106Â¿Â¿Â¿30Â¿Â¿00Â¿Â¿--W
>  >104Â¿Â¿Â¿52Â¿Â¿30Â¿Â¿/N
>  >39Â¿Â¿Â¿22Â¿Â¿30Â¿Â¿--N 37Â¿Â¿Â¿15Â¿Â¿00Â¿Â¿).
>  >As it appears in Voyager Cataloging module:  ‡a Scale 1:126,720 ââ€¡c (W
>  >106â°30Ê¹00Êº--W 104â°52Ê¹30Êº/N 39â°22Ê¹30Êº--N 37â°15Ê¹00Êº).
>  >
>  >
>  >Thanks,
>  >Mike Barrett

RE: MARC blob to MARC::Record object

2011-01-10 Thread Doran, Michael D

Hi Leif,

> you can simply read from your database (i.e. your statement handle) like you
> were reading from a file.

Very interesting!  Tack för att du tog dig tid att lägga ett tillägg!

A bit on the bleeding edge for me, perhaps, but I may try it for the experience.

-- Michael

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 mobile
# do...@uta.edu
# http://rocky.uta.edu/doran/
 

> -Original Message-
> From: Leif Andersson [mailto:leif.anders...@sub.su.se]
> Sent: Monday, January 10, 2011 8:35 AM
> To: Doran, Michael D; perl4lib
> Subject: Re: MARC blob to MARC::Record object
> 
> I hope you will forgive me for a late addendum.
> Not only do I have to apologize for the late arrival of this post, I also
> should apologize for its (lack of) seriousness.
> Actually - this is in every respect just a programming scherzo, so to speak.
> (Even though the code below works, at least for me)
> Now you are all warned. ;-)
> 
> So: If you are used to letting MARC::Batch read the records from a file, then
> you can simply read from your database (i.e. your statement handle) like you
> were reading from a file.
> Like this:
> 
> 
> #!/usr/local/bin/perl -w
> use DBI;
> use MARC::Batch;
> use strict;
> #BEGIN {
> #$ENV{NLS_LANG} =  ...;
> #}
> my $dbh = DBI->connect(...) || die 1;
> $dbh->{LongReadLen} = 9;
> $dbh->{LongTruncOk} = 0;
> my $sql = q( select GetBibBlob(bib_id) from bib_master where rownum <= 3 );
> my $sth = $dbh->prepare($sql) || die 2;
> my $rv  = $sth->execute() || die 3;
> # add some magic:
> tie(*MARC, 'dbfile', $sth);
> # pass the virtual filehandle to MARC::Batch
> my $batch = MARC::Batch->new('USMARC', *MARC );
> $batch->strict_off;
> # read as usual
> while ( my $marc = $batch->next ) {
> print $marc->as_formatted(), "\n\n";
> }
> 
> #---
> package dbfile;
> use strict;
> sub TIEHANDLE {
> my ($class, $sth) = @_;
> my $i = { 'sth' => $sth,
>   'eof' => 0, };
> bless $i, $class;
> }
> sub READLINE {
> my ($marc) = $_[0]->{sth}->fetchrow_array() ;
> if (defined $marc) {
> my $len = substr($marc,0,5);
> return substr($marc,0,$len);
> }
> else {
> $_[0]->{'eof'} = 1;
> return undef;
> }
> }
> sub EOF {
> # eof()
> $_[0]->{'eof'};
> }
> sub FILENO {1}
> sub BINMODE {1}
> sub CLOSE {1}
> sub DESTROY {1}
> __END__
> 
> 
> That's all folks,
> 
> /Leif
> Leif Andersson, Systems Librarian
> Stockholm University Library
> 
> 
> Från: Doran, Michael D [do...@uta.edu]
> Skickat: den 7 januari 2011 15:11
> Till: Leif Andersson; 'Jon Gorman'; perl4lib
> Ämne: RE: MARC blob to MARC::Record object
> 
> Hi Leif and Jon,
> 
> > use MARC::Record;
> > ...
> > my $record = MARC::Record->new_from_usmarc( $blob );
> 
> This works!
> 
> > From: Jon Gorman [mailto:jonathan.gor...@gmail.com]
> > Sent: Friday, January 07, 2011 7:51 AM
> > You'll probably think of this when you get up, but did you make sure
> > to import the package? ie use MARC::FILE::USMARC;?
> 
> This made the other way work, too! (I had only "use MARC::File")
> 
> Much thanks to Leif and Jon.
> 
> -- Michael
> 
> # Michael Doran, Systems Librarian
> # University of Texas at Arlington
> # 817-272-5326 office
> # 817-688-1926 mobile
> # do...@uta.edu
> # http://rocky.uta.edu/doran/
> 
> > -Original Message-
> > From: Leif Andersson [mailto:leif.anders...@sub.su.se]
> > Sent: Friday, January 07, 2011 7:50 AM
> > To: Doran, Michael D; perl4lib
> > Subject: Re: MARC blob to MARC::Record object
> >
> > Hi Michael,
> >
> > this is how I - in principle - usually do it:
> >
> > use MARC::Record;
> > ...
> > my $record = MARC::Record->new_from_usmarc( $blob );
> >
> > /Leif
> >
> > Leif Andersson, Systems librarian
> > Stockholm University Library
> > 
> > Från: Doran, Michael D [do...@uta.edu]
> > Skickat: den 7 januari 2011 00:18
> > Till: perl4lib
> > Ämne: MARC blob to MARC::Record object
> >
> > I am working on a Perl script that retrieves data from our Voyager ILS via
> an
> > SQL query.  Among other data, I have MARC records in blob form, and the
> script
> > processes one MARC record at a time.  I want to be able to parse and
> > modify/convert the MARC record (using MARC::Record) before writing/printing
> > data to a file.
> >
> > How do I make the MARC blob into a MARC::Record object (without having to
> > first save it a file and read it in with MARC::File/Batch)?  The MARC blob
> is
> > already in a variable, so it doesn't make sense (to me) to write it out to a
> > file just so I can read it back in.  Unless I have to, natch.
> >
> > I apologize if I am missing something obvious.
> >
> > -- Michael
> >
> > # Michael Doran, Systems Librarian
> > # University of Texas at Arlington
> > # 817-272-5326 office
> > # 817-688-1926 mobile
> > # do...@uta.edu
> > # http://rocky.uta.edu/doran/

RE: MARC blob to MARC::Record object

2011-01-07 Thread Doran, Michael D

Hi Leif and Jon,

> use MARC::Record;
> ...
> my $record = MARC::Record->new_from_usmarc( $blob );

This works!

> From: Jon Gorman [mailto:jonathan.gor...@gmail.com]
> Sent: Friday, January 07, 2011 7:51 AM
> You'll probably think of this when you get up, but did you make sure
> to import the package? ie use MARC::FILE::USMARC;?

This made the other way work, too! (I had only "use MARC::File")

Much thanks to Leif and Jon.

-- Michael

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 mobile
# do...@uta.edu
# http://rocky.uta.edu/doran/
 
> -Original Message-
> From: Leif Andersson [mailto:leif.anders...@sub.su.se]
> Sent: Friday, January 07, 2011 7:50 AM
> To: Doran, Michael D; perl4lib
> Subject: Re: MARC blob to MARC::Record object
> 
> Hi Michael,
> 
> this is how I - in principle - usually do it:
> 
> use MARC::Record;
> ...
> my $record = MARC::Record->new_from_usmarc( $blob );
> 
> /Leif
> 
> Leif Andersson, Systems librarian
> Stockholm University Library
> 
> Från: Doran, Michael D [do...@uta.edu]
> Skickat: den 7 januari 2011 00:18
> Till: perl4lib
> Ämne: MARC blob to MARC::Record object
> 
> I am working on a Perl script that retrieves data from our Voyager ILS via an
> SQL query.  Among other data, I have MARC records in blob form, and the script
> processes one MARC record at a time.  I want to be able to parse and
> modify/convert the MARC record (using MARC::Record) before writing/printing
> data to a file.
> 
> How do I make the MARC blob into a MARC::Record object (without having to
> first save it a file and read it in with MARC::File/Batch)?  The MARC blob is
> already in a variable, so it doesn't make sense (to me) to write it out to a
> file just so I can read it back in.  Unless I have to, natch.
> 
> I apologize if I am missing something obvious.
> 
> -- Michael
> 
> # Michael Doran, Systems Librarian
> # University of Texas at Arlington
> # 817-272-5326 office
> # 817-688-1926 mobile
> # do...@uta.edu
> # http://rocky.uta.edu/doran/

RE: MARC blob to MARC::Record object

2011-01-06 Thread Doran, Michael D

Hi Jon,

> You should be able to create a MARC::Record object via MARC::File::USMARC from
> string (see decode). 

>   my $MARC = MARC::File::USMARC->decode($rawMARC);

This looks like what I need.

However, (arggh) now I am getting this error message when I try to use that 
line of code:

  "Can't locate object method "decode" via package "MARC::File::USMARC" at 
spco-export.pl line 404."

I'm getting tired and punchy, so will take it up again tomorrow.

> If however you mean by the MARC blob you don't have a "complete" record but
> part, I'm less sure how to do that.

Yes, I the MARC blob is a complete record.

Thanks for the help!

-- Michael

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 mobile
# do...@uta.edu
# http://rocky.uta.edu/doran/
 

> -Original Message-
> From: Gorman, Jon [mailto:jtgor...@illinois.edu]
> Sent: Thursday, January 06, 2011 6:19 PM
> To: Doran, Michael D; perl4lib
> Subject: RE: MARC blob to MARC::Record object
> 
> 
> 
> > How do I make the MARC blob into a MARC::Record object (without having
> > to first save it a file and read it in with MARC::File/Batch)?  The
> > MARC blob is already in a variable, so it doesn't make sense (to me) to
> > write it out to a file just so I can read it back in.  Unless I have
> > to, natch.
> >
> 
> You should be able to create a MARC::Record object via MARC::File::USMARC from
> string (see decode). An example below
> http://search.cpan.org/dist/MARC-Record/lib/MARC/File/USMARC.pm
> 
> 
> So with per/Voyager I know I've done something like:
> 
> Had a query ...
> SELECT BIB_ID, record_segment, seqnum
> FROM UIUDB.BIB_DATA
> WHERE BIB_ID = ?
> ORDER BY seqnum ASC;
> 
> I prepare/execute that query with something like:
> 
>   $getBibRecordH->execute($row->{'BIBID'}) or $logger->logdie("Could not
> execute query to get Bib Record");
>   while (my ($rec_id, $recseg, $seqnum) = $getBibRecordH->fetchrow_array) {
> $rawMARC .= $recseg ;
>   }
> 
> 
> 
>   my $MARC = MARC::File::USMARC->decode($rawMARC);
> 
> 
> There's also some nice ways if you're only using certain parts of the record
> to apply a filter,  although I don't have any examples off-hand.
> 
> If however you mean by the MARC blob you don't have a "complete" record but
> part, I'm less sure how to do that.  I'd just pull in the entire file.
> 
> Jon Gorman
> University of Illinois
> 
> 
> 
> >
> > -- Michael
> >
> > # Michael Doran, Systems Librarian
> > # University of Texas at Arlington
> > # 817-272-5326 office
> > # 817-688-1926 mobile
> > # do...@uta.edu
> > # http://rocky.uta.edu/doran/

MARC blob to MARC::Record object

2011-01-06 Thread Doran, Michael D

I am working on a Perl script that retrieves data from our Voyager ILS via an 
SQL query.  Among other data, I have MARC records in blob form, and the script 
processes one MARC record at a time.  I want to be able to parse and 
modify/convert the MARC record (using MARC::Record) before writing/printing 
data to a file.

How do I make the MARC blob into a MARC::Record object (without having to first 
save it a file and read it in with MARC::File/Batch)?  The MARC blob is already 
in a variable, so it doesn't make sense (to me) to write it out to a file just 
so I can read it back in.  Unless I have to, natch.

I apologize if I am missing something obvious.

-- Michael

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 mobile
# do...@uta.edu
# http://rocky.uta.edu/doran/

RE: Regular Expression for non-Roman characters

2008-09-25 Thread Doran, Michael D

Hi Jane,

In a MARC-8 character set environment, I would assume that the key to detecting 
non-Latin characters would be the presence of an escape sequence to indicate a 
switch to an alternate character set (e.g. Arabic, Greek, Cyrillic, etc) [1].  
Everything from that point on would be non-Latin until there was an escape 
sequence back to Latin.

In a MARC Unicode character set environment, if you are using Perl for your 
regular expression matching, you can probably take advantage of the Unicode 
\p{} constructs [2].  Something along the lines of...

\P{Latin}

...which means doesn't belong to the Latin script (lowercase 'p' = belongs to, 
uppercase 'P' = does not belong to).

For more info on the regular expression Unicode scripts/blocks see this 
tutorial:
http://www.regular-expressions.info/unicode.html

I'll point out that when I've used Unicode \p{} constructs in a program, it was 
necessary to explicitly label strings as being Unicode (assuming they are, 
natch) before regex matching, using...

decode('UTF-8',$string_tobe_matched);

I know that's not exactly what you asked for, but (assuming I didn't 
misunderstand your question) it may suggest some approaches should you end up 
tackling it yourself.

-- Michael

[1] MARC 21 Specification > ACCESSING ALTERNATE GRAPHIC CHARACTER SETS
http://www.loc.gov/marc/specifications/speccharmarc8.html#alternative

[2] Perl > Unicode Regular Expression Support Level

http://perldoc.perl.org/perlunicode.html#Unicode-Regular-Expression-Support-Level

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 mobile
# [EMAIL PROTECTED]
# http://rocky.uta.edu/doran/
  

> -Original Message-
> From: Jacobs, Jane W [mailto:[EMAIL PROTECTED] 
> Sent: Thursday, September 25, 2008 1:24 PM
> To: perl4lib@perl.org
> Subject: Regular Expression for non-Roman characters
> 
> Hi folks,
> 
> I'm wondering if anyone has codified a regular expression that would
> indicate the presence of non-Latin characters.  I want to detect the
> presence of non-Roman letters in authority records.  Currently
> Authorities with non-Roman forms of name place these in the 
> 4XX fields.
> Our system can't handle that so I want to flip them to 5XX 
> and possibly
> add a subfield to note what they but first I need something to detect
> them
> 
> I had in mind something like \xE0-\xFE which detects 
> diacritics nicely.
> I'd prefer not to figure it out for myself if someone else has already
> done it.
> Thanks in advance.
> JJ 
> 
> **Views expressed by the author do not necessarily represent those of
> the Queens Library.**
> 
> Jane Jacobs
> Asst. Coord., Catalog Division
> Queens Borough Public Library
> 89-11 Merrick Blvd.
> Jamaica, NY 11432
> tel.: (718) 990-0804
> e-mail: [EMAIL PROTECTED]
> FAX. (718) 990-8566
> 
> 
> 
> 
> 
> The information contained in this message may be privileged 
> and confidential and protected from disclosure. If the reader 
> of this message is not the intended recipient, or an employee 
> or agent responsible for delivering this message to the 
> intended recipient, you are hereby notified that any 
> dissemination, distribution or copying of this communication 
> is strictly prohibited. If you have received this 
> communication in error, please notify us immediately by 
> replying to the message and deleting it from your computer.
>

RE: Biblio::Isis and character encoding

2008-07-14 Thread Doran, Michael D

Hi Emmanuel,

> I'm trying to convert an ISIS database to MARC21

What is the character set encoding of the data in the ISIS database?

What is the desired character set encoding for the MARC21 records? I.e. MARC-8 
or MARC Unicode(UTF-8)?

If they are dissimilar character encodings, is the data undergoing a character 
set conversion?

-- Michael
 
# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 mobile
# [EMAIL PROTECTED]
# http://rocky.uta.edu/doran/
 

> -Original Message-
> From: Emmanuel Di Pretoro [mailto:[EMAIL PROTECTED]
> Sent: Monday, July 14, 2008 2:15 AM
> To: perl4lib@perl.org
> Subject: Biblio::Isis and character encoding
> 
> Hi,
> 
> Currently I'm trying to convert an ISIS database to MARC21. So I use
> Biblio::Isis and MARC::Record to do that. No problem with this conversion,
> except for some weird character encoding problems. Some bibliographic
> records are in written in french, and accentuated characters like 'é' are
> display as '<82>'.
> 
> I've tried to use some Encode::* modules (Encode, Encode::Guess,
> Encode::Detec, Encode::First), but without success.
> 
> Is there anybody who have this kind of problem? Is there a solution?
> 
> Thanks in advance.
> 
> Regards,
> 
> Emmanuel Di Pretoro

RE: Problem installing MARC::Record 2.0.0 under perl 5.8.0

2008-07-08 Thread Doran, Michael D

Hi Chris,

> I'll try that version.

I sure hope you meant upgrading to Perl 5.8.2 (or higher) rather than 
downgrading to MARC::Record 1.39_02.  ;-)

This is just my un-asked for 2 cents, but I wouldn't stint on anything that 
will make the processing of Unicode-encoded text easier.  Last December seemed 
to mark a tipping point for Unicode, both on the internet:

  "Just last December [2007] there was an interesting milestone
  on the web. For the first time, we found that Unicode was the
  most frequent encoding found on web pages, overtaking both
  ASCII and Western European encodings" [1]  

...as well as for its use in MARC records:

  "To facilitate the movement of records between MARC-8 and Unicode
  environments, it was recommended for an initial period that the use of
  Unicode be restricted to a repertoire identical in extent to the MARC-8
  repertoire. [...] however, such a restriction is no longer appropriate.
  The full UCS repertoire, as currently defined at the Unicode web site,
  is valid for encoding MARC 21 records subject only to the constraints
  described [in the current MARC 21 Specifications]." [2]

-- Michael

[1] The Official Google Blog: "Moving to Unicode 5.1"
http://googleblog.blogspot.com/2008/05/moving-to-unicode-51.html

[2] MARC 21 Specifications: Unicode Encoding Environment
   (revised December 2007)
http://www.loc.gov/marc/specifications/speccharucs.html

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 mobile
# [EMAIL PROTECTED]
# http://rocky.uta.edu/doran/

> -Original Message-
> From: Christopher Morgan [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, July 08, 2008 2:12 PM
> To: 'Bryan Baldus'; perl4lib@perl.org
> Subject: RE: Problem installing MARC::Record 2.0.0 under perl 5.8.0
> 
> Brian,
> 
> Thanks very much. I'll try that version.
> 
> - Chris
> 
> 
> -Original Message-
> From: Bryan Baldus [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, July 08, 2008 2:31 PM
> To: Christopher Morgan; perl4lib@perl.org
> Subject: RE: Problem installing MARC::Record 2.0.0 under perl 5.8.0
> 
>  On Tuesday, July 08, 2008 12:35 PM, Christopher Morgan wrote:
> >I am in the process of rebuilding my web site after a phishing site
> >break-in (yikes!). The site is fine now, and secure, but for some
> >reason I can't get MARC::Record-2.0.0 to install. I get an error
> >message saying that perl 5.8.2 is required, but that I only have perl
> >5.8.0. (And indeed I do have perl
> 5.8.0) But I'm pretty sure this version of MARC::Record *did* install
> under
> perl 5.8.0 that last time I tried.<
> 
> MARC::Record 1.39_02 appears to be the latest version on CPAN that would
> work on 5.8.0. MARC::Record 2.x is incompatible with pre-5.8.2 versions of
> Perl due to Unicode-related changes. The change was announced in a
> Perl4Lib
> message "MARC::Record v2.0 RC1", sent Fri 5/20/2005 2:35 PM, by Ed
> Summers.
> [1]
> 
> [1] 
> 
> I hope this helps,
> 
> Bryan Baldus
> [EMAIL PROTECTED]
> [EMAIL PROTECTED]
> http://home.inwave.com/eija

RE: Stripping out Unicode combining characters (diacritics) -

2008-05-07 Thread Doran, Michael D

I received a number of helpful suggestions and solutions.  The approach I 
decided to adopt in my larger script is to 'decode' all the incoming form input 
as UTF-8 as well as the input from the database that I'll be matching the form 
input against.  This seems to allow the '\p{M}' syntax to work as expected in a 
Perl regexp.  In my test.cgi script for form input it would like like this:

#!/usr/local/bin/perl
use strict;
use CGI;
use Encode;
my $query = CGI::new();
my $search_term = decode('UTF-8',$query->param('text'));
my $sans_diacritics  = $search_term;
$sans_diacritics =~ s/\pM*//g;
print qq(Content-type: text/plain; charset=utf-8

search_term is $search_term
sans_diacritics is $sans_diacritics
);
exit(0);

I'm slowly figuring out how to work with Unicode in my web scripts, but still 
have a lot to learn.  Thanks for all the help. :-)

-- Michael

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 mobile
# [EMAIL PROTECTED]
# http://rocky.uta.edu/doran/
 

> -Original Message-
> From: Doran, Michael D [mailto:[EMAIL PROTECTED] 
> Sent: Monday, May 05, 2008 7:27 PM
> To: [EMAIL PROTECTED]
> Cc: Perl4lib
> Subject: Stripping out Unicode combining characters (diacritics)
> 
> I'm trying to strip out combining diacritics from some form 
> input using this code:
> 
> 
>   
>   
> 
> 
>   
> 
> 
> 
> #!/usr/local/bin/perl
> use CGI;
> $query = CGI::new();
> $search_term = $query->param('text');
> $sans_diacritics  = $search_term;
> $sans_diacritics  =~ s/\p{M}*//g;
> #$sans_diacritics  =~ s/o//g;
> print qq(Content-type: text/plain; charset=utf-8
> 
> $sans_diacritics
> );
> exit(0);
> 
> 
> In the form, I'm inputting the string "Bartók" with the 
> accented character being a base character (small Latin letter 
> "o") followed by a combining acute accent.  However, when I 
> print (to the web) $sans_diacritics, I get my input with no 
> change -- the combining diacritic is still there.  I know 
> that my input is not a precomposed accented character, 
> because I can strip out the base "o" and the combining accent 
> either stands alone or jumps to another character [2].
> 
> The "\p{M}" is a Unicode class name for the character class 
> of Unicode 'marks', for example accent marks [1].  I've tried 
> these variations (and many others) and none seem to be doing 
> what I want:
> 
>$sans_diacritics =~ s#[\p{Mark}]*##g;
>$sans_diacritics =~ tr#[\p{InCombiningDiacriticalMarks}]##;
>$sans_diacritics =~ tr#[\p{M}]##;
>$sans_diacritics =~ s/\p{M}*//g;
>$sans_diacritics =~ s#[\p{M}]##g;
>$sans_diacritics =~ s#\x{0301}##g;
>$sans_diacritics =~ s#\x{006F}\x{0301}##g;
>$sans_diacritics =~ s#[\x{0300}-\x{036F}]*##g;
> 
> I'm pulling my hair out on this... so any help would be 
> appreciated.  If there's any other info I can provide, let me know.
> 
> My Perl version is 5.8.8 and the script is running on a 
> server running Solaris 9.
> 
> -- Michael
> 
> [1] per http://perldoc.perl.org/perlretut.html and other documentation
> 
> [2] using $sans_diacritics  =~ s/o//g;
> 
> # Michael Doran, Systems Librarian
> # University of Texas at Arlington
> # 817-272-5326 office
> # 817-688-1926 mobile
> # [EMAIL PROTECTED]
> # http://rocky.uta.edu/doran/
>

RE: Stripping out Unicode combining characters (diacritics)

2008-05-06 Thread Doran, Michael D

Hi Leif,

> This is what I do. You can try that.
> See if it helps:
> 
> Encode::_utf8_on($str);  # <<<
> $str =~ s/\pM*//g;

That works!  I will gladly buy the beers Leif, should we ever meet in person.

> I mean - have you for instance tried running your cgi scripts 
> in tainted mode (-T)?

No, I do not run my CGI scripts in tainted mode (although I realize that I 
probably should).  

Thanks (once again) for your help.

-- Michael

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 mobile
# [EMAIL PROTECTED]
# http://rocky.uta.edu/doran/
 

> -Original Message-
> From: Leif Andersson [mailto:[EMAIL PROTECTED] 
> Sent: Tuesday, May 06, 2008 3:33 AM
> To: Doran, Michael D
> Subject: Re: Stripping out Unicode combining characters (diacritics)
> 
> Oh, now I see your REAL question.
> 
> This is what I do. You can try that.
> See if it helps:
> 
> Encode::_utf8_on($str);  # <<<
> $str =~ s/\pM*//g;
> 
> You are not the only one having problems with Unicode.
> Esp. in web programming it can be very confusing.
> 
> I am quite surprised that there are not more discussions of this kind.
> Not even in the "official" channels.
> 
> I mean - have you for instance tried running your cgi scripts 
> in tainted mode (-T)?
> 
> I had all my scripts set up that way. Before Unicode.
> But basic Unicode stuff became broken with -T enabled.
> Have they fixed that now?
> I have at least seen no mentioning of it.
> 
> And screen scraping. If you want to mess around with 
> javascript embedded in an HTML page, you may find that the 
> content encoding is mixed. And Perl gets very confused 
> getting mixed character encodings.
> And so do I.
> 
> You may also have to deal with mixed encodings doing SQL 
> against the Voyager database.
> 
> What would we do if we could not fall back on "use bytes"
> every now and then! ;-)
> 
> Leif
> 
> ==
> Leif Andersson, Systems Librarian
> Stockholm University Library
> SE-106 91 Stockholm
> SWEDEN
> Phone : +46 8 162769
> Mobile: +46 70 6904281
> 
> 
> -Ursprungligt meddelande-
> Från: Doran, Michael D [mailto:[EMAIL PROTECTED]
> Skickat: den 6 maj 2008 04:13
> Till: Mike Rylander
> Kopia: [EMAIL PROTECTED]; Perl4lib
> Ämne: RE: Stripping out Unicode combining characters (diacritics)
> 
> Hi Mike,
> 
> I appreciate the quick reply.  I am familiar with the 
> Unicode::Normalize module (and will also be using that), but 
> I left it out of this question because it's not relevant to 
> the problem I'm currently trying to solve.  The text I'm 
> trying to strip diacritics out of does not have precomposed 
> accented characters.
> 
> -- Michael
> 
> # Michael Doran, Systems Librarian
> # University of Texas at Arlington
> # 817-272-5326 office
> # 817-688-1926 cell
> # [EMAIL PROTECTED]
> # http://rocky.uta.edu/doran/
> 
> 
> 
> -Original Message-
> From: Mike Rylander [mailto:[EMAIL PROTECTED]
> Sent: Mon 5/5/2008 8:52 PM
> To: Doran, Michael D
> Cc: [EMAIL PROTECTED]; Perl4lib
> Subject: Re: Stripping out Unicode combining characters (diacritics)
>  
> On Mon, May 5, 2008 at 8:26 PM, Doran, Michael D 
> <[EMAIL PROTECTED]> wrote:
> [snip]
> >
> >  I'm pulling my hair out on this... so any help would be 
> appreciated.  If there's any other info I can provide, let me know.
> >
> 
> You'll want to transform the text to NFD format (nominally, 
> base characters plus combining marks) instead of NFC (precombined
> characters) using Unicode::Normalize:
> 
>  use Unicode::Normalize;
> 
>  my $text = NFD($original);
>  $text =~ s/\pM+//go;
> 
> Hope that helps.
> 
> --
> Mike Rylander
>  | VP, Research and Design
>  | Equinox Software, Inc. / The Evergreen Experts  | phone: 
> 1-877-OPEN-ILS (673-6457)  | email: [EMAIL PROTECTED]  | 
> web: http://www.esilibrary.com
> 
>

RE: Importing Perl package variables into a Perl script with "require"

2008-05-05 Thread Doran, Michael D

Hi Mike,

> You can use UNIVERSAL::require to do this:

Hmmm.  This is interesting.  I wasn't familiar with this module, so will be 
giving it a look.

Thanks!

-- Michael

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 cell
# [EMAIL PROTECTED]
# http://rocky.uta.edu/doran/



-Original Message-
From: Mike Rylander [mailto:[EMAIL PROTECTED]
Sent: Mon 5/5/2008 8:57 PM
To: Doran, Michael D
Cc: Perl4lib
Subject: Re: Importing Perl package variables into a Perl script with "require"
 
On Fri, Apr 25, 2008 at 8:46 PM, Doran, Michael D <[EMAIL PROTECTED]> wrote:
> Back-story:
>
>  I have a Perl CGI program.  The CGI program needs to utilize variables in 
> one of several separate configuration files (packages).  The different 
> packages all contain the same variables, but with different values for those 
> variables.  Each package represents a different language for a multilingual 
> interface for the CGI program.  e.g. English.pm, French.pm, Spanish.pm.
>
>  The CGI program can't determine which language package is needed until it 
> parses the form input and does a test based on the value for a query string 
> name/value pair.  Based on the test, I assign a value to a package load 
> command (i.e. 'use English' or 'require English').  Because of that, I can't 
> load the package with "use Package" since "use" runs at compile time, before 
> I can assign a value to it.
>

You can use UNIVERSAL::require to do this:


use UNIVERSAL::require;
use CGI;

my $cgi = new CGI();
my $package = 'Language::' . $cgi->param('lang'); # say, 'English'
$package->use;

my $thing = $package->new();


Hope that helps. FWIW, we use U::r pretty heavily inside Evergreen.

-- 
Mike Rylander
 | VP, Research and Design
 | Equinox Software, Inc. / The Evergreen Experts
 | phone: 1-877-OPEN-ILS (673-6457)
 | email: [EMAIL PROTECTED]
 | web: http://www.esilibrary.com

RE: Stripping out Unicode combining characters (diacritics)

2008-05-05 Thread Doran, Michael D

Hi Mike,

I appreciate the quick reply.  I am familiar with the Unicode::Normalize module 
(and will also be using that), but I left it out of this question because it's 
not relevant to the problem I'm currently trying to solve.  The text I'm trying 
to strip diacritics out of does not have precomposed accented characters.

-- Michael

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 cell
# [EMAIL PROTECTED]
# http://rocky.uta.edu/doran/



-Original Message-
From: Mike Rylander [mailto:[EMAIL PROTECTED]
Sent: Mon 5/5/2008 8:52 PM
To: Doran, Michael D
Cc: [EMAIL PROTECTED]; Perl4lib
Subject: Re: Stripping out Unicode combining characters (diacritics)
 
On Mon, May 5, 2008 at 8:26 PM, Doran, Michael D <[EMAIL PROTECTED]> wrote:
[snip]
>
>  I'm pulling my hair out on this... so any help would be appreciated.  If 
> there's any other info I can provide, let me know.
>

You'll want to transform the text to NFD format (nominally, base
characters plus combining marks) instead of NFC (precombined
characters) using Unicode::Normalize:

 use Unicode::Normalize;

 my $text = NFD($original);
 $text =~ s/\pM+//go;

Hope that helps.

-- 
Mike Rylander
 | VP, Research and Design
 | Equinox Software, Inc. / The Evergreen Experts
 | phone: 1-877-OPEN-ILS (673-6457)
 | email: [EMAIL PROTECTED]
 | web: http://www.esilibrary.com

Stripping out Unicode combining characters (diacritics)

2008-05-05 Thread Doran, Michael D

I'm trying to strip out combining diacritics from some form input using this 
code:





  


  



#!/usr/local/bin/perl
use CGI;
$query = CGI::new();
$search_term = $query->param('text');
$sans_diacritics  = $search_term;
$sans_diacritics  =~ s/\p{M}*//g;
#$sans_diacritics  =~ s/o//g;
print qq(Content-type: text/plain; charset=utf-8

$sans_diacritics
);
exit(0);


In the form, I'm inputting the string "Bartók" with the accented character 
being a base character (small Latin letter "o") followed by a combining acute 
accent.  However, when I print (to the web) $sans_diacritics, I get my input 
with no change -- the combining diacritic is still there.  I know that my input 
is not a precomposed accented character, because I can strip out the base "o" 
and the combining accent either stands alone or jumps to another character [2].

The "\p{M}" is a Unicode class name for the character class of Unicode 'marks', 
for example accent marks [1].  I've tried these variations (and many others) 
and none seem to be doing what I want:

   $sans_diacritics =~ s#[\p{Mark}]*##g;
   $sans_diacritics =~ tr#[\p{InCombiningDiacriticalMarks}]##;
   $sans_diacritics =~ tr#[\p{M}]##;
   $sans_diacritics =~ s/\p{M}*//g;
   $sans_diacritics =~ s#[\p{M}]##g;
   $sans_diacritics =~ s#\x{0301}##g;
   $sans_diacritics =~ s#\x{006F}\x{0301}##g;
   $sans_diacritics =~ s#[\x{0300}-\x{036F}]*##g;

I'm pulling my hair out on this... so any help would be appreciated.  If 
there's any other info I can provide, let me know.

My Perl version is 5.8.8 and the script is running on a server running Solaris 
9.

-- Michael

[1] per http://perldoc.perl.org/perlretut.html and other documentation

[2] using $sans_diacritics  =~ s/o//g;

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 mobile
# [EMAIL PROTECTED]
# http://rocky.uta.edu/doran/

RE: Importing Perl package variables into a Perl script with "require"

2008-04-27 Thread Doran, Michael D

Hi Leif,

> A variation on this idea would be to define a certain scope, 
> instead of main:: for the multilingual strings.
> $LANG::this = ...
> $LANG::that = ...

I hadn't thought of that, but your suggestion looks to be a good approach.  It 
looks like if I declare each language module, regardless of the language (or 
file name), to be "package Lang;" it will allow me to use the 
"$Lang::" syntax in the main program.

Like usual, you have provided useful and helpful advice.  Thank you.

-- Michael

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 mobile
# [EMAIL PROTECTED]
# http://rocky.uta.edu/doran/
 

> -Original Message-
> From: Leif Andersson [mailto:[EMAIL PROTECTED] 
> Sent: Sunday, April 27, 2008 3:20 PM
> To: Doran, Michael D; Perl4lib
> Subject: Re: Importing Perl package variables into a Perl 
> script with "require"
> 
> Michael,
> 
> Interesting question.
> 
> There are probably several approaches to implementing 
> multilingual support.
> I hope you get many responses to this question. I would 
> really like to see what others have to say.
> 
> From me - just a few simple comments:
> 
> I see two problems.
> 1) to load a particular file, unknown at compile time.
> 2) to "export" a variable into the main program
> 
> The first you have solved with require ("do" would also be fine).
> But you don't have to load a package, you can just load a file.
> 
> The second: define your variable(s) with "our".
> 
> #!/usr/bin/perl -wl
> use strict;
> 
> our %msg;
> my $lang = $ARGV[0] ||'en';
> 
> require "Mylang_$lang.pl";
> 
> print $msg{alert};
> print $msg{success};
> __END__
> 
> And Mylang_en.pl and friends looks like
> use strict;
> %main::msg = (
> alert   => 'ALERT',
> success => 'OK',
> );
> __END__
> 
> 
> A variation on this idea would be to define a certain scope, 
> instead of main:: for the multilingual strings.
> $LANG::this = ...
> $LANG::that = ...
> 
> But is it actually anything wrong with your original idea of 
> using no strict 'refs' for that part of the code building up 
> the loading of a module?
> 
> I don't know.
> 
> Doing a little google-ing a found this article quite interesting:
> http://www.drdobbsjournal.net/web-development/184416008
> 
> It is from The Perl Journal (May 2003), Autrijus Tang writing 
> about Web Localization and Perl. In that article he is 
> demonstrating the use of Locale::Maketext::Lexicon
> 
> Probably too demanding for smaller projects, but still interesting.
> 
> There is always a balance between how much work to spend on 
> solving a problem, how flexible and scalable it has to be, if 
> we like avoiding installing extra modules etc etc...
> 
> Leif
> 
> ==
> Leif Andersson, Systems Librarian
> Stockholm University Library
> SE-106 91 Stockholm
> SWEDEN
> Phone : +46 8 162769
> Mobile: +46 70 6904281
> 
> -Ursprungligt meddelande-
> Från: Doran, Michael D [mailto:[EMAIL PROTECTED]
> Skickat: den 26 april 2008 02:46
> Till: Perl4lib
> Ämne: Importing Perl package variables into a Perl script 
> with "require"
> 
> Back-story:
> 
> I have a Perl CGI program.  The CGI program needs to utilize 
> variables in one of several separate configuration files 
> (packages).  The different packages all contain the same 
> variables, but with different values for those variables.  
> Each package represents a different language for a 
> multilingual interface for the CGI program.  e.g. English.pm, 
> French.pm, Spanish.pm.
> 
> The CGI program can't determine which language package is 
> needed until it parses the form input and does a test based 
> on the value for a query string name/value pair.  Based on 
> the test, I assign a value to a package load command (i.e. 
> 'use English' or 'require English').  Because of that, I 
> can't load the package with "use Package" since "use" runs at 
> compile time, before I can assign a value to it.
> 
> So I am using the 'require' command, which executes at 
> runtime.  All's good so far.
> 
> Problem:
> 
> When loading a module with 'require' AND ALSO using 'use 
> strict' I can't figure out how to utilize those package 
> variables *without* also using an explicit package name (e.g. 
> $English::button_label).  Doing so is pretty straightforward 
> if I had been able to use 'use English', but it's n

Importing Perl package variables into a Perl script with "require"

2008-04-25 Thread Doran, Michael D

Back-story:

I have a Perl CGI program.  The CGI program needs to utilize variables in one 
of several separate configuration files (packages).  The different packages all 
contain the same variables, but with different values for those variables.  
Each package represents a different language for a multilingual interface for 
the CGI program.  e.g. English.pm, French.pm, Spanish.pm.

The CGI program can't determine which language package is needed until it 
parses the form input and does a test based on the value for a query string 
name/value pair.  Based on the test, I assign a value to a package load command 
(i.e. 'use English' or 'require English').  Because of that, I can't load the 
package with "use Package" since "use" runs at compile time, before I can 
assign a value to it.

So I am using the 'require' command, which executes at runtime.  All's good so 
far.

Problem:

When loading a module with 'require' AND ALSO using 'use strict' I can't figure 
out how to utilize those package variables *without* also using an explicit 
package name (e.g. $English::button_label).  Doing so is pretty straightforward 
if I had been able to use 'use English', but it's not straightforward (at least 
to me) on how to do it using 'require English' [1].

At issue again, is not knowing what the appropriate package name will be until 
runtime.  I can brute force within the CGI program by doing it along these 
lines (which also requires that I set "no strict 'refs'":

my ($var_1,$var_2,$var_3,$var_4);
{   no strict 'refs';
$var_1 = ${$lang_package . "::" . "var_1"};
$var_2 = ${$lang_package . "::" . "var_2"};
$var_3 = ${$lang_package . "::" . "var_3"};
$var_4 = ${$lang_package . "::" . "var_4"};
}

Below is what I want to do, reduced to the minimum (i.e. I want test.cgi to 
print out the variable without specifying the package name):

test.cgi

#!/usr/local/bin/perl
use strict;
my $lang_package_file = "English.pm";
require $lang_package_file;
print "My package variable \$language value is $language" . "\n"; # This 
doesn't work [2]
exit(0);

English.pm
==
package English;
$language = "English";
1;

I know I can turn off "use strict" but I've tried to walk the straight and 
narrow with recent programming efforts, and I hate to start backsliding.  
There's got to be a better, more elegant way to do this and maintain 'use 
strict' in the main CGI program, I'm just not getting it at the moment...

-- Michael

[1] I'm referring to using this stuff in the package, which works great with 
'use English' but not so great (for me anyway) with 'require English'.

package English;
require Exporter;
@ISA= qw(Exporter);
@EXPORT = qw($language $etc);

[2] I get this when I run test.cgi: 

Global symbol "$language" requires explicit package name at test.cgi line 5.
Execution of test.cgi aborted due to compilation errors.

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 mobile
# [EMAIL PROTECTED]
# http://rocky.uta.edu/doran/

RE: Help for utf-8 output - followup on Record Length

2008-03-03 Thread Doran, Michael D

> I was under the impression that the MARC record length in the 
> Leader was the record length in bytes rather than the number 
> of characters. 

According to this source, the Leader record length is in bytes:

  MARC Leader > record length = "Five numeric characters equal
  to the total number of bytes in the logical record" [1]

I also checked my charset mail folder and found this in a message from way back 
in 2003:

  "...there is some difficulty computing the record length properly,
  since MARC::Record uses character length, rather than byte length,
  which are the same thing when you are dealing with 8 bit characters."
   -- Ed Summers [2]

I looked through the MARC::Record CHANGES file [3].  Although there are some 
enhancements/fixes regarding the use of UTF-8, I don't see anything that 
explicitely states that more current versions of MARC::Record now compute the 
record length in bytes.  It seems like that would be a good thing.

-- Michael

[1] MARC 21 Record Builder
http://www.loc.gov/marc/marc2onix.html

[2] "MARC-Charset-0.5 questions" July 2003 thread on perl4lib

[3] CHANGES : Revision history for Perl extension MARC::Record.
http://search.cpan.org/src/MIKERY/MARC-Record-2.0.0/Changes

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 mobile
# [EMAIL PROTECTED]
# http://rocky.uta.edu/doran/
 

> -Original Message-
> From: Doran, Michael D 
> Sent: Monday, March 03, 2008 10:36 AM
> To: 'Leif Andersson'; perl4lib@perl.org
> Subject: RE: Help for utf-8 output
> 
> Hi Leif,
> 
> I really appreciate you taking a look at this and responding. 
>  Although I consider myself somewhat knowledgeable about 
> character sets, I still find these kinds of problems to be confusing.
> 
> > In this case the leader and actual length will not agree, 
> as your utf8 
> > characters have turned into latin1.
> 
> I was under the impression that the MARC record length in the 
> Leader was the record length in bytes rather than the number 
> of characters.  Is that your understanding?
> 
> Also, I am still troubleshooting my particular set of records 
> (I was out of town last week) since this problem only appears 
> to manifest itself for records with non-ASCII characters in 
> the 100 and 245 fields.  Records with a note field having 
> non-ASCII characters doesn't cause a problem. 
> 
> -- Michael
> 
> # Michael Doran, Systems Librarian
> # University of Texas at Arlington
> # 817-272-5326 office
> # 817-688-1926 mobile
> # [EMAIL PROTECTED]
> # http://rocky.uta.edu/doran/
>  
> 
> > -Original Message-
> > From: Leif Andersson [mailto:[EMAIL PROTECTED]
> > Sent: Saturday, March 01, 2008 2:51 PM
> > To: Doran, Michael D; perl4lib@perl.org; [EMAIL PROTECTED]
> > Subject: Re: Help for utf-8 output
> > 
> > It seems there is a little bug (by design) kicking in.
> > 
> > The leader gets wrong and some characters get wrong in this case:
> >+ Reading a raw marc record (utf8) from file
> >+ Turning it into a MARC::Record object
> >+ Without modification writing it out to file.
> >  Yes. Even without modification the bug manifests itself!
> > 
> > Let's start with code simply copying one record from a file 
> utf8.mrc 
> > containing one or more marc records. This basic operation not 
> > involving MARC::Record  is OK.
> > 
> > #!perl -w
> > use strict;
> > #
> > open(IN, "utf8.mrc")  || die "1";
> > open(OUT, ">out_good.mrc") || die "2"; binmode IN; binmode OUT; # # 
> > Read in raw MARC $/ = "\x1D"; my $marc = ; print OUT $marc; 
> > __END__
> > 
> > Now, we're adding MARC::Record to the process, along with 
> some debug 
> > info.
> > Example code producing *faulty* record:
> > 
> > #!perl -w
> > use strict;
> > use MARC::Record;
> > use Devel::Peek;
> > #
> > open(IN, "utf8.mrc")  || die "1";
> > open(OUT, ">out_bad.mrc") || die "2";
> > binmode IN;
> > binmode OUT;
> > #
> > # Read in raw MARC
> > $/ = "\x1D";
> > my $marc = ;
> > Dump($marc);  # the utf8-flag is not on my $obj  = 
> > MARC::Record->new_from_usmarc( $marc ); # Convert back to 
> raw MARC my 
> > $marc2 = $obj->as_usmarc(); Dump($marc2); # the utf8-flag 
> IS on print 
> > OUT $marc2; __END__
> > 
> > 
> > In this case the leader and actual length will not agree, 
> as your utf8 
> > characters have turned into latin1.
> > The problem i

RE: Help for utf-8 output

2008-03-03 Thread Doran, Michael D

Hi Leif,

I really appreciate you taking a look at this and responding.  Although I 
consider myself somewhat knowledgeable about character sets, I still find these 
kinds of problems to be confusing.

> In this case the leader and actual length will not agree, as 
> your utf8 characters have turned into latin1.

I was under the impression that the MARC record length in the Leader was the 
record length in bytes rather than the number of characters.  Is that your 
understanding?

Also, I am still troubleshooting my particular set of records (I was out of 
town last week) since this problem only appears to manifest itself for records 
with non-ASCII characters in the 100 and 245 fields.  Records with a note field 
having non-ASCII characters doesn't cause a problem. 

-- Michael

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 mobile
# [EMAIL PROTECTED]
# http://rocky.uta.edu/doran/
 

> -Original Message-
> From: Leif Andersson [mailto:[EMAIL PROTECTED] 
> Sent: Saturday, March 01, 2008 2:51 PM
> To: Doran, Michael D; perl4lib@perl.org; [EMAIL PROTECTED]
> Subject: Re: Help for utf-8 output
> 
> It seems there is a little bug (by design) kicking in.
> 
> The leader gets wrong and some characters get wrong in this case:
>+ Reading a raw marc record (utf8) from file
>+ Turning it into a MARC::Record object
>+ Without modification writing it out to file.
>  Yes. Even without modification the bug manifests itself!
> 
> Let's start with code simply copying one record from a file 
> utf8.mrc containing one or more marc records. This basic 
> operation not involving MARC::Record  is OK.
> 
> #!perl -w
> use strict;
> #
> open(IN, "utf8.mrc")  || die "1";
> open(OUT, ">out_good.mrc") || die "2";
> binmode IN;
> binmode OUT;
> #
> # Read in raw MARC
> $/ = "\x1D";
> my $marc = ;
> print OUT $marc;
> __END__
> 
> Now, we're adding MARC::Record to the process, along with 
> some debug info.
> Example code producing *faulty* record:
> 
> #!perl -w
> use strict;
> use MARC::Record;
> use Devel::Peek;
> #
> open(IN, "utf8.mrc")  || die "1";
> open(OUT, ">out_bad.mrc") || die "2";
> binmode IN;
> binmode OUT;
> #
> # Read in raw MARC
> $/ = "\x1D";
> my $marc = ;
> Dump($marc);  # the utf8-flag is not on
> my $obj  = MARC::Record->new_from_usmarc( $marc ); # Convert 
> back to raw MARC my $marc2 = $obj->as_usmarc(); Dump($marc2); 
> # the utf8-flag IS on print OUT $marc2; __END__
> 
> 
> In this case the leader and actual length will not agree, as 
> your utf8 characters have turned into latin1.
> The problem is that $marc2 has the utf8 flag set internally by Perl.
> And the conversion on output is made in spite of binmode.
> 
> We can get around the problem by either (for instance) use bytes;
>   or
> Encode::_utf8_off($marc2);
> before printing to file.
> 
> But shouldn't MARC::Record take care of this for us?
> A file of MARC records may contain records in different encodings.
> The text parts of a MARC record can be treated as made up by 
> certain encodings, but the "blob" itself, I suppose, should 
> be exposed to the caller as pure binary.
> 
> Are there any drawbacks in letting MARC::Record strip off any 
> eventual utf8 flag before returning the record as_usmarc() ?
> If not I suggest this change be made to a future release of 
> MARC::Record.
> 
> I shall also add that this character mess only sets in when doing IO.
> If you are updating your databases through one API or another 
> you are probably OK!
> 
> 
> Leif
> ==
> Leif Andersson, Systems Librarian
> Stockholm University Library
> SE-106 91 Stockholm
> SWEDEN
> Phone : +46 8 162769
> Mobile: +46 70 6904281
> 
> -Ursprungligt meddelande-
> Från: Doran, Michael D [mailto:[EMAIL PROTECTED]
> Skickat: den 21 februari 2008 18:49
> Till: perl4lib@perl.org
> Ämne: RE: Help for utf-8 output
> 
> Hi Jackie,
> 
> I'm working on a very similar problem... converting 
> theses/dissertations records (in XML) to MARC records.  I'm 
> still in the testing stage, but have had similar problems 
> with records with diacritics in the 100 or 245 fields 
> (however diacritics in a 520a field don't seem to cause any 
> problems).  Since our records are not "diacritic rich" it's 
> hard to determine the exact extent of the problem.
> 
> I am using these versions:
>   Perl v5.8.8
>   MARC::Charset 0.98
>   MARC::Lint 1.43
>   MARC::Record 2.0
>

RE: Help for utf-8 output

2008-02-21 Thread Doran, Michael D

Hi Brian,

Thanks for your response.

> I'd suggest you first make sure your XML is really UTF-8

I believe it is.  I used a hex editor to look at the XML source file and the 
character in question (the "Registered Sign") is encoded as hex "c2 ae" which 
is the proper UTF-8 encoding for that character [1].  There were other XML 
files processed with the same script that had non-ASCII characters (in the 520 
field where we are sticking the theses abstracts) and also verified as being 
UTF-8 encoded, and they did not seem to cause any errors.  The 520 field isn't 
processed any differently in my script (I'm double-checking, natch) so that's 
partly why I am confused.
 
> ...using JHOVE

I was not familiar with JHOVE, but looked it up and it sounds like a very 
useful tool [2].  I have downloaded it, and will be trying it out.

-- Michael

[1] FileFormat.Info > Unicode Character 'REGISTERED SIGN' (U+00AE)
http://www.fileformat.info/info/unicode/char/00ae/index.htm

[2] JHOVE - JSTOR/Harvard Object Validation Environment
http://hul.harvard.edu/jhove/

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 mobile
# [EMAIL PROTECTED]
# http://rocky.uta.edu/doran/
 

> -Original Message-
> From: Brian Sheppard [mailto:[EMAIL PROTECTED] 
> Sent: Thursday, February 21, 2008 1:00 PM
> To: Doran, Michael D
> Cc: perl4lib@perl.org
> Subject: Re: Help for utf-8 output
> 
> I'd suggest you first make sure your XML is really UTF-8, using JHOVE:
> 
>/path/to/jhove/jhove -c /path/to/jhove/conf/jhove.conf -m 
> utf8-hul myFile.xml
> 
> If it fails you could convert to utf8, on the (perhaps 
> unwarranted) assumption it's windows latin1:
> 
> iconv -c -f windows-1252 -t UTF-8 myFile.xml > myFile.utf8.xml
> 
> Then, of course, test myFile.utf8.xml with jhove to see if it's valid.
> 
> -Brian
> 
> 
> On February 21, at 11:48 AM, Doran, Michael D wrote:
> 
> > Hi Jackie,
> >
> > I'm working on a very similar problem... converting theses/ 
> > dissertations records (in XML) to MARC records.  I'm still in the 
> > testing stage, but have had similar problems with records with 
> > diacritics in the 100 or 245 fields (however diacritics in a 520a 
> > field don't seem to cause any problems).  Since our records are not 
> > "diacritic rich" it's hard to determine the exact extent of the 
> > problem.
> >
> > I am using these versions:
> >   Perl v5.8.8
> >   MARC::Charset 0.98
> >   MARC::Lint 1.43
> >   MARC::Record 2.0
> >   XML::LibXML 1.66
> >
> > Here's an example "bad" record (which I have minimized to just the
> > 245 field):
> >
> > marcdump test.mrc
> > test.mrc
> > LDR 00127cam a2200037   4500
> > 245 13 _aAn Empirical Test Of The Situational Leadership® Model In 
> > Japan /
> >_cRiho Yoshioka.
> >
> >  Recs  Errs Filename
> > - - 
> > 1 1 test.mrc
> >
> > When I run test.mrc through MARC::Lint, I get this message:
> >
> >  Invalid record length in record 1: Leader says 00127 bytes 
> but it's 
> > actually 125  Invalid length in directory for tag 245 in record 1  
> > field does not end in end of field character in tag 245 in record 1
> >
> > When examined in vi the character in question, a Registered Sign, 
> > appears to be correctly UTF-8 encoded C2AE, and the bib Leader 
> > (position 09=a) indicates that it is Unicode encoded.  I've 
> attached 
> > the MARC record.
> >
> > I noticed that when I run your record (ck245.dat) through 
> MARC::Lint, 
> > I get the same invalid record length message:
> >
> >  Invalid record length in record 3: Leader says 00567 bytes 
> but it's 
> > actually 569  field does not end in end of field character 
> in tag 100 
> > in record 3  field does not end in end of field character 
> in tag 245 
> > in record 3  Invalid indicators ".10" forced to blanks in 
> record 3 for 
> > tag 245
> >
> >  field does not end in end of field character in tag 260 in 
> record 3  
> > Invalid indicators ".  " forced to blanks in record 3 for tag 260
> >
> >  field does not end in end of field character in tag 300 in 
> record 3  
> > Invalid indicators ".  " forced to blanks in record 3 for tag 300
> >
> >  field does not end in end of field character in tag 502 in 
> record 3  
> > Invalid indicators ".  " forced to blanks in record 3 for tag 502
> >
> >  fi

RE: Help for utf-8 output

2008-02-21 Thread Doran, Michael D

Hi Jackie,

I'm working on a very similar problem... converting theses/dissertations 
records (in XML) to MARC records.  I'm still in the testing stage, but have had 
similar problems with records with diacritics in the 100 or 245 fields (however 
diacritics in a 520a field don't seem to cause any problems).  Since our 
records are not "diacritic rich" it's hard to determine the exact extent of the 
problem.

I am using these versions:
  Perl v5.8.8
  MARC::Charset 0.98
  MARC::Lint 1.43
  MARC::Record 2.0
  XML::LibXML 1.66

Here's an example "bad" record (which I have minimized to just the 245 field):

marcdump test.mrc
test.mrc
LDR 00127cam a2200037   4500
245 13 _aAn Empirical Test Of The Situational Leadership® Model In Japan /
   _cRiho Yoshioka.

 Recs  Errs Filename
- - 
1 1 test.mrc

When I run test.mrc through MARC::Lint, I get this message:

 Invalid record length in record 1: Leader says 00127 bytes but it's actually 
125
 Invalid length in directory for tag 245 in record 1
 field does not end in end of field character in tag 245 in record 1

When examined in vi the character in question, a Registered Sign, appears to be 
correctly UTF-8 encoded C2AE, and the bib Leader (position 09=a) indicates that 
it is Unicode encoded.  I've attached the MARC record.

I noticed that when I run your record (ck245.dat) through MARC::Lint, I get the 
same invalid record length message:

 Invalid record length in record 3: Leader says 00567 bytes but it's actually 
569
 field does not end in end of field character in tag 100 in record 3
 field does not end in end of field character in tag 245 in record 3
 Invalid indicators ".10" forced to blanks in record 3 for tag 245

 field does not end in end of field character in tag 260 in record 3
 Invalid indicators ".  " forced to blanks in record 3 for tag 260

 field does not end in end of field character in tag 300 in record 3
 Invalid indicators ".  " forced to blanks in record 3 for tag 300

 field does not end in end of field character in tag 502 in record 3
 Invalid indicators ".  " forced to blanks in record 3 for tag 502

 field does not end in end of field character in tag 504 in record 3
 Invalid indicators ".  " forced to blanks in record 3 for tag 504

 field does not end in end of field character in tag 690 in record 3
 Invalid indicators ". 4" forced to blanks in record 3 for tag 690

Anybody have any ideas?

-- Michael

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 mobile
# [EMAIL PROTECTED]
# http://rocky.uta.edu/doran/
 

> -Original Message-
> From: Shieh, Jackie [mailto:[EMAIL PROTECTED] 
> Sent: Tuesday, February 19, 2008 10:50 AM
> To: perl4lib@perl.org
> Subject: Help for utf-8 output
> 
> I was wondering if anyone has similar experience and has come 
> up with good solutions to help solving the challenge below?!
> 
> What I have is an Excel spreadsheet for dissertations which I 
> have saved as a tab delimited file (examining the file in 
> TextPad, the diacritics appears to be fine), then read in and 
> output the file as a utf-8 MARC file. I   title field 
> confirming author field that contains diacritics with the 
> title showing proper indicator values. 
> 
> But when I looked the MARC itself, the fields that follow the 
> field containing diacritics are all off its original 
> position. See attached zip file.  Examples below: first two 
> have diacritics in a 100 field, last one diacritic is in 245 
> subfield b)
> 
> 001 diss 34001
> 100 1  _aPrez, Nancy L.
> 245 _aSynchronic and Diachronic Matlatzinkan Phonology.
> 
> 001 diss 34042
> 100 1  _aValentn-Mrquez, Wilfredo
> 245 _aDoing being boricua :
> 
> 001 diss 33892
> 100 1   _aDavis, Jennifer M.
> 245 14 _aThe Functional Complexities of Inherited Cardiac 
> Troponin I Mutations :
> _bIdentification of Ca+ Independent 
> Contractile Dysfunction.
> 
> I would be greatly appreciate any suggestion to solve this. 
> Thank you most kindly. 
> 
> Regards, 
>  
> --Jackie 
>  
> |Jackie Shieh
> |Data Loads & Development
> |Harlan Hatcher Graduate Library
> |University of Michigan
> |920 North University
> |Ann Arbor, MI 48109-1205
> |Phone: 734.763.6070 FAX: 734.615.9788
> |E-mail: JShieh [AT] umich [DOT] edu
> 


test.mrc
Description: test.mrc

marcdump hex switch

2008-02-21 Thread Doran, Michael D

I have MARC::Record 2.0 installed [1].  According to the Changes file marcdump 
now has a "--hex" switch [2]:

  [ENHANCEMENTS]
  - Added --hex switch to marcdump, which dumps the record in
hexadecimal.  The offsets are in decimal so that you can match
them up to values in the leader.  The offset is reset to 0
when we're past the directory so that you can match up the data
with the offsets in the directory.

However I'm *not* finding that my marcdump actually has that hex switch:

  /usr/local/scripts/xml/marc => marcdump --hex test.mrc
  Unknown option: hex
  Usage: marcdump [options] file(s)

  Options:
--[no]print Print a MicroLIF-style dump of each record
--lif   Input files are MicroLIF, not USMARC
--field=specSpecify a field spec to include.  There may be many.
Examples:
--field=245 --field=1XX
--[no]quiet Print status messages
--[no]stats Print a statistical summary by file at the end
--version   Print version information
--help  Print this summary

I poked around the marcdump script and didn't find anything "hex":

my $opt_print = 1;
my $opt_quiet = 0;
my $opt_stats = 1;
my @opt_field = ();
my $opt_help = 0;
my $opt_lif = 0;

Any ideas/explanations?  The hex dump functionality would sure be handy.

-- Michael

[1] /usr/local/scripts/xml/marc => marcdump -v
/usr/local/bin/marcdump, using MARC::Record v2.0

[2] CPAN > Revision history for Perl extension MARC::Record
http://search.cpan.org/src/MIKERY/MARC-Record-2.0.0/Changes

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 mobile
# [EMAIL PROTECTED]
# http://rocky.uta.edu/doran/

RE: MARC::File::XML and parsing.

2007-09-27 Thread Doran, Michael D

Hi Henri,

> Is there a reason why MARC::File::XML considers only a very 
> strict subset of utf-8 as valid ?

I would guess that it has to do with adhering to the MARC-21 repertoire of 
characters, so as to facilitate the round-trip conversion between the MARC-8 
and Unicode character sets [1,2].  At some point in the future the MARC-21 
repertoire will be decoupled from what was defined for MARC-8.

> For instance no linebreak...

Control characters such as line breaks are a bit of a different issue.  The 
MARC-21 standard currently allows for only a handful of control characters, not 
including (as you have discovered) the line break [3].

> This could be a really BIG trouble for kanjis or hindu languages imho.

The MARC-21 repertoire of characters includes East Asian Ideographs (Han), 
Japanese Hiranga and Katakana, and Korean Hangul [4,5].  I don't believe that 
Indic scripts in the vernacular would be valid MARC-21 characters. 

Are you finding any cases where the Marc::File::XML parser is dropping valid 
MARC-21 characters?

-- Michael

[1] USMARC Character Set Issues and Mapping to Unicode/UCS
http://www.loc.gov/marc/marbi/1996/96-10.html

  WORKING PRINCIPLES TO BE FOLLOWED IN MAPPING OF CHARACTERS FROM
  USMARC TO UNICODE/UCS

  The following Working Principles were established by the
  Subcommittee and continue to inform their mapping decisions:

   * Round-trip mapping will be provided between USMARC characters
 and Unicode/UCS characters wherever possible.

[2] MARC 21 Specifications > CHARACTER SETS: Part 2 UCS/Unicode Environment
http://www.loc.gov/marc/specifications/speccharucs.html
"The specifications are built around enabling round trip movement of MARC
 data between MARC-8 and UCS/Unicode with as little loss as possible."

[3]
  MARC-8  Unicode  Character
  --  ---  -
   0x1B   U+001B   ESCAPE
   0x1D   U+001D   RECORD TERMINATOR 
   0x1E   U+001E   FIELD TERMINATOR 
   0x1F   U+001F   SUBFIELD DELIMITER 
   0x88   U+0098   NON-SORT BEGIN 
   0x89   U+009C   NON-SORT END 
   0x8D   U+200D   JOINER 
   0x8E   U+200C   NON-JOINER 

[4] MARC 21 Specifications > CHARACTER SETS: Part 3 Code Tables
http://www.loc.gov/marc/specifications/specchartables.html

[5] MARC 21 Standard - UCS/Unicode Environment > Character Set Mappings 
http://rocky.uta.edu/doran/charsets/marcU.html

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 mobile
# [EMAIL PROTECTED]
# http://rocky.uta.edu/doran/
 

> -Original Message-
> From: Henri-Damien LAURENT [mailto:[EMAIL PROTECTED] 
> Sent: Wednesday, September 26, 2007 10:45 AM
> To: perl4lib
> Subject: MARC::File::XML and parsing.
> 
> hi,
> I have some problems with Marc::File::XML parser.
> 
> Take those two xml records.
> Despite the fact that I agree that there are odd characters 
> in some subfields.
> I am wondering why, since those characters are UTF8, 
> MARC::File::XML should drop them when parsing.
> Is there a reason why MARC::File::XML considers only a very 
> strict subset of utf-8 as valid ? (For instance no linebreak, 
> no ...) ?
> 
> Couldnot it  say "OK It is XML record, encoded UTF8, i take 
> it for granted and no matter if there are "odd" characters" ?
> This could be a really BIG trouble for kanjis or hindu languages imho.
> 
> 
> 
>   xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance";
>  xsi:schemaLocation="http://www.loc.gov/MARC21/slim 
> http://www.loc.gov/ standards/marcxml/schema/MARC21slim.xsd"
>  xmlns="http://www.loc.gov/MARC21/slim";>
> 
>  00150nx  a2200073   4500 
>  
>Nicolas
>Jérôme
>Traducteur
> 
>19980124afrey50  ba0
>  
>  3568   tag="152" ind1=" " ind2=" ">
>NP
>  
> 
> 
>   xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance";
>  xsi:schemaLocation="http://www.loc.gov/MARC21/slim 
> http://www.loc.gov/ standards/marcxml/schema/MARC21slim.xsd"
>  xmlns="http://www.loc.gov/MARC21/slim";>
> 
>  00151nx  a2200073   4500 
>  
>Guynemer
>Georges
>(1894-1917)
> 
>19980129afrey50  ba0
>  
>  4642   tag="152" ind1=" " ind2=" ">
>NP
>  
> 
> 
> --
> Henri Damien LAURENT et Paul POULAIN
> Consultants indépendants
> en logiciels libres et bibliothéconomie (http://www.koha-fr.org)
> 
>

RE: MARC::Charset 'utf8_to_marc8'

2007-09-18 Thread Doran, Michael D

Hi Laurence,

> I'm trying to create MARC records from serials data exported 
> from SFX, using  MARC::Charset version 0.98 to convert UTF-8 
> strings to MARC-8. It seems to be failing on extended latin 
> characters like U+00C5 CAPITAL LETTER A WITH RING ABOVE

The encoding, U+00C5 (CAPITAL LETTER A WITH RING ABOVE), is a precomposed 
character [1].  While U+00C5 is a perfectly good Unicode encoding, I believe 
that it is still the recommended practice for Unicode-encoded MARC-21 records 
to use base and combining characters, and U+00C5 doesn't have a direct 
equivalent in the MARC-21 repertoire [2,3].

If the strings are first normalized using Unicode Normalization Form D, they 
should convert okay [4,5].  

> The records convert using Terry Reese's MarcEdit OK.

Perhaps MarcEdit incorporates the decomposition or has direct conversion of 
precomposed Unicode to decomposed MARC-8.

-- Michael 

[1] The decomposition (i.e. base and combining character) values for "CAPITAL 
LETTER A WITH RING ABOVE" would be U+0041 (LATIN CAPITAL LETTER A) followed by 
U+030A (COMBINING RING ABOVE).

[2] WORKING PRINCIPLES TO BE FOLLOWED IN MAPPING OF CHARACTERS FROM USMARC TO 
UNICODE/UCS

   * Accented letters ... will continue to be encoded as a base letter
 and non-spacing marks. Use of precomposed accented letters is not
 sanctioned at this stage.

From "USMARC Character Set Issues and Mapping to Unicode/UCS"
http://www.loc.gov/marc/marbi/1996/96-10.html 

[3] MARC 21 Specifications > CHARACTER SETS > Code Tables
http://www.loc.gov/marc/specifications/specchartables.html

[4] Preprocessing Requirements

... preprocessing of the Unicode record before the conversion to
MARC-8 takes place. In all of the above techniques, the following
steps for decomposing diacritics were presumed.

Decompose the precomposed base character/character modifier combinations
using Unicode Normalization Form D (NFD) which produces exact equivalents,
and primarily applies decomposition to precomposed characters with 
diacritics.

From "Technique for conversion of Unicode to MARC-8"
http://www.loc.gov/marc/marbi/2006/2006-04.html

[5] W3C > Charlint - A Character Normalization Tool
http://www.w3.org/International/charlint/

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 mobile
# [EMAIL PROTECTED]
# http://rocky.uta.edu/doran/
 

> -Original Message-
> From: Laurence Lockton [mailto:[EMAIL PROTECTED] 
> Sent: Tuesday, September 18, 2007 5:21 AM
> To: perl4lib@perl.org
> Subject: MARC::Charset 'utf8_to_marc8'
> 
> Hi,
> 
> I'm trying to create MARC records from serials data exported 
> from SFX, using  MARC::Charset version 0.98 to convert UTF-8 
> strings to MARC-8. It seems to be failing on extended latin 
> characters like U+00C5 CAPITAL LETTER A WITH RING ABOVE, 
> giving "no mapping found at position 176" for example. 
> The records convert using Terry Reese's MarcEdit OK. Am I 
> doing the wrong thing? Any advice gratefully received.
> 
> Many thanks,
> Laurence Lockton
> University of Bath
> UK
>

RE: MARC::Charset question

2007-05-18 Thread Doran, Michael D

Oops, this got mangled somehow...

> U+0044  LATIN CAPITAL LETTER D
> U+006F  LATIN SMALL LETTER O
> U+006E  LATIN SMALL LETTER N
> U+0074  LATIN SMALL LETTER T
> U+FE20  LIGATURE, FIRST HALF / COMBINING LIGATURE LEFT HALF
> U+0073  LATIN SMALL LETTER S
> U+FE21  LIGATURE, SECOND HALF / COMBINING LIGATURE RIGHT HALF 006F  
> U+LATIN SMALL LETTER O
> U+0076  LATIN SMALL LETTER V
> U+0061  LATIN SMALL LETTER A
> U+002C  COMMA
> U+0020  SPACE, BLANK / SPACE
> U+0044  LATIN CAPITAL LETTER D
> U+0061  LATIN SMALL LETTER A
> U+0072  LATIN SMALL LETTER R
> U+02B9  SOFT SIGN, PRIME / MODIFIER LETTER PRIME
> U+0069  LATIN SMALL LETTER I
> U+FE20  LIGATURE, FIRST HALF / COMBINING LIGATURE LEFT HALF
> U+0061  LATIN SMALL LETTER A
> U+FE21  LIGATURE, SECOND HALF / COMBINING LIGATURE RIGHT HALF 002E  
> U+PERIOD, DECIMAL POINT / FULL STOP

And should have been this:

U+0044  LATIN CAPITAL LETTER D
U+006F  LATIN SMALL LETTER O
U+006E  LATIN SMALL LETTER N
U+0074  LATIN SMALL LETTER T
U+FE20  LIGATURE, FIRST HALF / COMBINING LIGATURE LEFT HALF
U+0073  LATIN SMALL LETTER S
U+FE21  LIGATURE, SECOND HALF / COMBINING LIGATURE RIGHT HALF
U+006F  LATIN SMALL LETTER O
U+0076  LATIN SMALL LETTER V
U+0061  LATIN SMALL LETTER A
U+002C  COMMA
U+0020  SPACE, BLANK / SPACE
U+0044  LATIN CAPITAL LETTER D
U+0061  LATIN SMALL LETTER A
U+0072  LATIN SMALL LETTER R
U+02B9  SOFT SIGN, PRIME / MODIFIER LETTER PRIME
U+0069  LATIN SMALL LETTER I
U+FE20  LIGATURE, FIRST HALF / COMBINING LIGATURE LEFT HALF
U+0061  LATIN SMALL LETTER A
U+FE21  LIGATURE, SECOND HALF / COMBINING LIGATURE RIGHT HALF
U+002E  PERIOD, DECIMAL POINT / FULL STOP

-- Michael

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 mobile
# [EMAIL PROTECTED]
# http://rocky.uta.edu/doran/
 

> -Original Message-
> From: Doran, Michael D 
> Sent: Friday, May 18, 2007 1:17 PM
> To: perl4lib@perl.org
> Subject: RE: MARC::Charset question
> 
> Hi Michael,
> 
> > An example is the author (personal name) of the book that 
> can be found 
> > at http://catalog.loc.gov/ by searching for ISBN
> > 5040039875 (I'm guessing the fact that the website appears to be 
> > displaying a corrupted name may be part of the problem here).
> 
> The Library of Congress catalog is outputting the MARC data 
> to your browser in Unicode UTF-8 and it looks correct to me.  
> It may *appear* corrupted, depending on what font you choose 
> to display the encoding (try Arial Unicode MS if you are in a 
> Windows environment).
> 
> > This name is 'Dontsova, Daria' (approximately),
> 
> Below is the UTF-16 encoding of the name in question, based 
> on a copy-and-paste directly from the browser 
> (http://catalog.loc.gov/cgi-bin/Pwebrecon.cgi?BBID=12550873).
> 
> U+0044  LATIN CAPITAL LETTER D
> U+006F  LATIN SMALL LETTER O
> U+006E  LATIN SMALL LETTER N
> U+0074  LATIN SMALL LETTER T
> U+FE20  LIGATURE, FIRST HALF / COMBINING LIGATURE LEFT HALF
> U+0073  LATIN SMALL LETTER S
> U+FE21  LIGATURE, SECOND HALF / COMBINING LIGATURE RIGHT HALF 006F  
> U+LATIN SMALL LETTER O
> U+0076  LATIN SMALL LETTER V
> U+0061  LATIN SMALL LETTER A
> U+002C  COMMA
> U+0020  SPACE, BLANK / SPACE
> U+0044  LATIN CAPITAL LETTER D
> U+0061  LATIN SMALL LETTER A
> U+0072  LATIN SMALL LETTER R
> U+02B9  SOFT SIGN, PRIME / MODIFIER LETTER PRIME
> U+0069  LATIN SMALL LETTER I
> U+FE20  LIGATURE, FIRST HALF / COMBINING LIGATURE LEFT HALF
> U+0061  LATIN SMALL LETTER A
> U+FE21  LIGATURE, SECOND HALF / COMBINING LIGATURE RIGHT HALF 002E  
> U+PERIOD, DECIMAL POINT / FULL STOP
> 
> 
> > ... in hex:
> > 446f6eeb74ec736f76612c20446172a7eb69ec612e.
> > When transcoded by marc8_to_utf8() the result is 
> > 446f6e74cda173006f76612c20446172cab969cda161002e
> > - which contains 2 null (00) characters.
> 
> 44 6f 6e [eb] 74[ec] 73  6f 76 61 2c 20 44 61 72 [a7] 
>[eb] 69 [ec]61  2e
> 44 6f 6e  74 [cd a1] 73 [00] 6f 76 61 2c 20 44 61 72 [ca 
> b9]  69 [cd a1] 61 [00] 2e
> 
> H.  It looks like the MARC-8 'COMBINING LIGATURE LEFT 
> HALF' ("0xEB") and/or the MARC-8 'COMBINING LIGATURE RIGHT 
> HALF' ("0xEC") got converted to a Unicode 'COMBINING DOUBLE 
> INVERTED BREVE' ("0xCD 0xA1" in UTF-8 [1]).  That doesn't 
> sound like something that MARC::Charset would do.
> 
> -- Michael
> 
> [1] Unicode Character 'COMBINING DOUBLE INVERTED BREVE' (U+0361)
> http://www.fileformat.info/info/unicode/char/0361/index.htm
> 
> # Michael Doran, Systems Librarian
> # University of Texas at Arlington
> # 817-272-5326 office
> # 817-688-1926 mobile
> # [EMAIL PROTE

RE: MARC::Charset question

2007-05-18 Thread Doran, Michael D

Hi Michael,

> An example is the author (personal name) of the book that can 
> be found at http://catalog.loc.gov/ by searching for ISBN 
> 5040039875 (I'm guessing the fact that the website appears to 
> be displaying a corrupted name may be part of the problem here).

The Library of Congress catalog is outputting the MARC data to your browser in 
Unicode UTF-8 and it looks correct to me.  It may *appear* corrupted, depending 
on what font you choose to display the encoding (try Arial Unicode MS if you 
are in a Windows environment).

> This name is 'Dontsova, Daria' (approximately),

Below is the UTF-16 encoding of the name in question, based on a copy-and-paste 
directly from the browser 
(http://catalog.loc.gov/cgi-bin/Pwebrecon.cgi?BBID=12550873).

U+0044  LATIN CAPITAL LETTER D
U+006F  LATIN SMALL LETTER O
U+006E  LATIN SMALL LETTER N
U+0074  LATIN SMALL LETTER T
U+FE20  LIGATURE, FIRST HALF / COMBINING LIGATURE LEFT HALF
U+0073  LATIN SMALL LETTER S
U+FE21  LIGATURE, SECOND HALF / COMBINING LIGATURE RIGHT HALF
U+006F  LATIN SMALL LETTER O
U+0076  LATIN SMALL LETTER V
U+0061  LATIN SMALL LETTER A
U+002C  COMMA
U+0020  SPACE, BLANK / SPACE
U+0044  LATIN CAPITAL LETTER D
U+0061  LATIN SMALL LETTER A
U+0072  LATIN SMALL LETTER R
U+02B9  SOFT SIGN, PRIME / MODIFIER LETTER PRIME
U+0069  LATIN SMALL LETTER I
U+FE20  LIGATURE, FIRST HALF / COMBINING LIGATURE LEFT HALF
U+0061  LATIN SMALL LETTER A
U+FE21  LIGATURE, SECOND HALF / COMBINING LIGATURE RIGHT HALF
U+002E  PERIOD, DECIMAL POINT / FULL STOP


> ... in hex:
> 446f6eeb74ec736f76612c20446172a7eb69ec612e.
> When transcoded by marc8_to_utf8() the result is
> 446f6e74cda173006f76612c20446172cab969cda161002e
> - which contains 2 null (00) characters.

44 6f 6e [eb] 74[ec] 73  6f 76 61 2c 20 44 61 72 [a7][eb] 69 [ec]   
 61  2e
44 6f 6e  74 [cd a1] 73 [00] 6f 76 61 2c 20 44 61 72 [ca b9]  69 [cd 
a1] 61 [00] 2e

H.  It looks like the MARC-8 'COMBINING LIGATURE LEFT HALF' ("0xEB") and/or 
the MARC-8 'COMBINING LIGATURE RIGHT HALF' ("0xEC") got converted to a Unicode 
'COMBINING DOUBLE INVERTED BREVE' ("0xCD 0xA1" in UTF-8 [1]).  That doesn't 
sound like something that MARC::Charset would do.

-- Michael

[1] Unicode Character 'COMBINING DOUBLE INVERTED BREVE' (U+0361)
http://www.fileformat.info/info/unicode/char/0361/index.htm

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 mobile
# [EMAIL PROTECTED]
# http://rocky.uta.edu/doran/


> -Original Message-
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] 
> Sent: Friday, May 18, 2007 5:49 AM
> To: perl4lib@perl.org; [EMAIL PROTECTED]
> Subject: MARC::Charset question
> 
> Hi,
> 
> I'm using marc8_to_utf8() on Library of Congress data. I'm 
> finding that I get occasional null characters inserted in the 
> output text, and I'm wondering what this means.
> 
> An example is the author (personal name) of the book that can 
> be found at http://catalog.loc.gov/ by searching for ISBN 
> 5040039875 (I'm guessing the fact that the website appears to 
> be displaying a corrupted name may be part of the problem here).
> 
> This name is 'Dontsova, Daria' (approximately), in hex:
> 446f6eeb74ec736f76612c20446172a7eb69ec612e. When transcoded by
> marc8_to_utf8() the result is
> 446f6e74cda173006f76612c20446172cab969cda161002e - which 
> contains 2 null (00) characters.
> 
> Is it safe to ignore these null characters (i.e. strip them 
> out of the result, which otherwise seems good)?
> 
> Thanks,
> 
> Michael
>

RE: Working around a UTF8/Unicode encoding problem

2007-05-15 Thread Doran, Michael D

> > I can also see that this record is broken because the XML entity 
> > ' is in a MARC communications format file.
>
> The character entity ' *is valid* in a MARC-XML file.  
> It is one of the few standard character entities allowed in 
> an XML file, e.g., &, <, >, and '.

A recent MARC Proposal recommends the use of Numeric Character References as an 
alternative for unmappable characters when converting from Unicode to MARC-8 in 
MARC21 records [1].  The "'" entity is not a *numeric* character 
reference, but I'm just mentioning this as an FYI in case you start seeing the 
numeric character entities in MARC communications format files.

-- Michael

[1] MARC PROPOSAL NO. 2006-09
http://www.loc.gov/marc/marbi/2006/2006-09.html
"SUMMARY: This paper specifies a lossless technique utilizing Numeric Character 
References for converting unmappable characters when going from Unicode to 
MARC-8 for systems that cannot handle Unicode encoding. It is intended to be an 
alternative to the lossy technique approved in 2006-04. The MARC advisory 
committee recommended that both a lossy and a lossless technique be officially 
adopted."

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 mobile
# [EMAIL PROTECTED]
# http://rocky.uta.edu/doran/
 

> -Original Message-
> From: Houghton,Andrew [mailto:[EMAIL PROTECTED] 
> Sent: Monday, May 14, 2007 9:56 AM
> To: perl4lib@perl.org
> Subject: RE: Working around a UTF8/Unicode encoding problem
> 
> > From: Jason Ronallo [mailto:[EMAIL PROTECTED]
> > Sent: 12 May, 2007 16:52
> > To: William Denton
> > Cc: perl4lib@perl.org
> > Subject: Re: Working around a UTF8/Unicode encoding problem
> > 
> > I can also see that this record is broken because the XML entity 
> > ' is in a MARC communications format file.
> 
> The character entity ' *is valid* in a MARC-XML file.  
> It is one of the few standard character entities allowed in 
> an XML file, e.g., &, <, >, and '.
> 
> 
> Andy.
>

Character set tests [was MARC::Charset]

2007-03-14 Thread Doran, Michael D

Hi Ashley,

Thanks for the info!  Trying to keep up with i18n and/or character set stuff is 
almost a full time job.

> > How are you testing for UTF-8?
> 
> There's a handy perl regexp on the W3C web site at:
> 
> http://www.w3.org/International/questions/qa-forms-utf-8
> 
> You'll need to change the ASCII part of the regexp to something like:
> 
> [\x01-\x7e]
> 
> This will more than accommodate for the various control 
> characters you can find in MARC records (don't forget Esc as 
> the lead in to Greek, Cyrillic, etc.)

In a MARC UCS/Unicode UTF-8 environment, the Esc (0x1B) character doesn't serve 
any purpose, since it is not necessary to escape to the alternate MARC-8 
character sets (the aforementioned Greek, Cyrillic, etc.).  My understanding is 
that a proper conversion from MARC-8 to UTF-8 should remove any escape 
sequences.  I believe that the only other 'CO' control characters allowed in 
MARC records are these [1]:

 hexMARC control name   ASCII control name  Unicode control 
name
    --- --  
-   
 0x1D [RECORD TERMINATOR]   [GROUP SEPARATOR]   [INFORMATION 
SEPARATOR THREE]
 0x1E [FIELD TERMINATOR][RECORD SEPARATOR]  [INFORMATION SEPARATOR 
TWO]
 0x1F [SUBFIELD DELIMITER]  [UNIT SEPARATOR][INFORMATION 
SEPARATOR ONE]

So, I'm wondering if for MARC record testing, it would make sense to tighten up 
the ASCII part of the regexp a bit to this:

[\x1D-\x7E]

-- Michael

[1] MARC21 > Code Table Basic Latin (ASCII)
http://lcweb2.loc.gov/cocoon/codetables/42.html

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 mobile
# [EMAIL PROTECTED]
# http://rocky.uta.edu/doran/
 

> -Original Message-
> From: Ashley Sanders [mailto:[EMAIL PROTECTED] 
> Sent: Wednesday, March 14, 2007 10:52 AM
> To: Doran, Michael D
> Cc: perl4lib
> Subject: Re: MARC::Charset
> 
> Michael,
> 
> >> So, basically, you either need prior knowledge about the actual 
> >> character encoding used, or you have to test. Testing for UTF-8 is 
> >> fairly straightforward...
> > 
> > How are you testing for UTF-8?
> 
> There's a handy perl regexp on the W3C web site at:
> 
> http://www.w3.org/International/questions/qa-forms-utf-8
> 
> You'll need to change the ASCII part of the regexp to something like:
> 
> [\x01-\x7e]
> 
> This will more than accommodate for the various control 
> characters you can find in MARC records (don't forget Esc as 
> the lead in to Greek, Cyrillic, etc.)
> 
> The W3C regexp tests the whole string -- which may be 
> inefficient if you are testing lots of data. Depending on 
> what sort of accuracy you want and whether or not overlong 
> UTF-8 sequences are a concern, you could just test for the following:
> 
> [\xc2-\xf4][\x80-\xbf]
> 
> The Wikipedia page on UTF-8 is worth a read.
> 
> >> Distinguishing Latin-1 from MARC-8 is a bit more like guess work.
> >> As a test for MARC-8 I look for the common combining diacritics 
> >> followed by a vowel.
> > 
> > Do you have a programmatic way to do that test, or are you 
> "eye-balling" the records.
> 
> I use a simple regexp:
> 
>([\xe1-\xe3][aeiouAEIOU]|\xf0[cC])
> 
> which may be rather too simple. For a critical application 
> I'd come up with something a bit better (after first 
> eye-balling a load of records.)
> 
> Just as an aside, I'm not using perl -- I'm using the Boost 
> Regexp library for C++ (which is a good implementation of 
> perl regexps.)
> 
> Regards,
> 
> Ashley.
> -- 
> Ashley Sanders   [EMAIL PROTECTED]
> Copac http://copac.ac.uk A MIMAS Service funded by JISC
>

RE: MARC::Charset

2007-03-14 Thread Doran, Michael D

Hi Ashley,

> I think 〹 is now legal in MARC-8 now to indicate a 
> Unicode character that isn't in the MARC-8 repertoire.

Yes, that's also my understanding [1,2], though I've not personally come across 
any records yet that use that method.  (Although not being a cataloger, I don't 
routinely examine a lot of MARC records.)

> So, basically, you either need prior knowledge about the 
> actual character encoding used, or you have to test. Testing 
> for UTF-8 is fairly straightforward...

How are you testing for UTF-8?

> Distinguishing Latin-1 from MARC-8 is a bit more like guess work.
> As a test for MARC-8 I look for the common combining diacritics
> followed by a vowel.

Do you have a programmatic way to do that test, or are you "eye-balling" the 
records.

Since MARC-8, Latin-1, and UTF-8 all share the same single octet encodings for 
the ASCII repertoire of characters, it can be a bit of a problem determing the 
character set for a batch of MARC records for English language items, due to 
the paucity of combining accent characters.  And the fact, as you point out, 
that you cannot always trust the MARC leader 09 position and you might in fact 
have a batch that is actually encoded in more that one character set, makes it 
even more interesting. 

-- Michael

[1] MARC PROPOSAL NO. 2006-04: Technique for conversion of Unicode to MARC-8 
http://www.loc.gov/marc/marbi/2006/2006-04.html

[2] MARC PROPOSAL NO. 2006-09: Lossless technique for conversion of Unicode to 
MARC-8
http://www.loc.gov/marc/marbi/2006/2006-09.html

Plug: For more resources on character sets, with an emphasis on library 
automation, see
 - Coded Character Sets
   http://rocky.uta.edu/doran/charsets/
 - and especially
   http://rocky.uta.edu/doran/charsets/resources.html

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 mobile
# [EMAIL PROTECTED]
# http://rocky.uta.edu/doran/
 

> -Original Message-
> From: Ashley Sanders [mailto:[EMAIL PROTECTED] 
> Sent: Wednesday, March 14, 2007 4:59 AM
> Cc: perl4lib
> Subject: Re: MARC::Charset
> 
> > Your MARC records appear to be encoded in MARC-8 as evidenced by 
> > "ergáo" in which the combining accent character comes before the 
> > character to be modified.  I.e. the byte string that displays as 
> > "ergáo" in your email would display as "ergò" (with a Latin 
> small letter o with grave) in a MARC-8 aware client.
> 
> I'd just like to relate my recent experiences of retrieving 
> MARC21 records through various library Z39.50 servers. Put 
> simply, you cannot trust the MARC leader character
> 9 to correctly indicate the character set used.
> 
>  From libraries that have set the leader to indicate the 
> records are in the MARC-8 character set, I have retrieved 
> records encoded as Latin-1, UTF-8 and MARC-8.
> 
>  From libraries that set the leader to indicate Unicode, I 
> get records in MARC-8 and UTF-8.
> 
> You also get encodings in MARC-8 records like \1EF6 to 
> indicate a Unicode character.
> I think 〹 is now legal in MARC-8 now to indicate a 
> Unicode character that isn't in the MARC-8 repertoire.
> 
> So, basically, you either need prior knowledge about the 
> actual character encoding used, or you have to test. Testing 
> for UTF-8 is fairly straightforward and a long string of text 
> (which admittedly you don't tend to get in MARC
> records) that
> tests as UTF-8 is very unlikely to be anything else. Distinguishing
> Latin-1 from
> MARC-8 is a bit more like guess work. As a test for MARC-8 I 
> look for the common combining diacritics followed by a vowel.
> 
> Regards,
> 
> Ashley.
> -- 
> Ashley Sanders   [EMAIL PROTECTED]
> Copac http://copac.ac.uk A MIMAS Service funded by JISC
>

RE: MARC::Charset

2007-03-14 Thread Doran, Michael D

Hi Henri-Damien,

> And any LOWERCASE DIGRAPH AE or UPPERCASE DIGRAPH AE or 
> LOWERCASE DIGRAPH OE is not well encoded. Encoding is 
> **assumed** to be latin1 translated into utf-8 in the 
> catalogue I am working on but appears respectively µ, ¥,¶
> in biblios.

hex MARC-8  ISO-8859-1 (Latin-1)
-   
µ   0xB5LOWERCASE DIGRAPH AEMICRO SIGN
¥   0xA5UPPERCASE DIGRAPH AEYEN SIGN
¶   0xB6LOWERCASE DIGRAPH OEPILCROW SIGN

> Is there a way to fix things up ?

If the underlying numerical encoding in your MARC records for the digraphs in 
question is hex 0xB5, 0xA5, and 0xB6, then the character set is not Latin-1; it 
is MARC-8.  If that is the case, I don't believe that anything needs to be 
fixed; if you are using MARC::Charset to convert the records from MARC-8 to 
UTF-8, it should work.

However, it may also be that I am misunderstanding the issue.  It would help if 
you could provide the pertinent Perl code you are using for the character set 
translation and a couple of the MARC records with digraphs that are failing.

> ... but appears respectively µ, ¥,¶ in biblios.

Please excuse my ignorance, but what is 'biblios' in the context of this 
discussion?

-- Michael

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 mobile
# [EMAIL PROTECTED]
# http://rocky.uta.edu/doran/
 

> -Original Message-
> From: Henri-Damien LAURENT [mailto:[EMAIL PROTECTED] 
> Sent: Wednesday, March 14, 2007 4:18 AM
> To: Doran, Michael D; perl4lib
> Subject: Re: MARC::Charset
> 
> Doran, Michael D a écrit :
> > Hi Henri,
> >   
> > Although in my email client, the character in question 
> appears as a MICRO SIGN ("µ"), I am assuming that it is 
> actually meant to be a LOWERCASE DIGRAPH AE ("æ") since that 
> is consistent with the Latin vernacular text in your record.  
> In MARC-8, the LOWERCASE DIGRAPH AE character is a 
> precomposed character represented by 0xB5 in hex [1].  You 
> mention that you are using MARC::File::XML which in turn uses 
> MARC::Charset.  I'm wondering if there is some confusion as 
> to the expected encoding of the MARC records being 
> processed/converted?  If MARC::Charset is expecting MARC21 
> Unicode/UCS encoded records, but is actually getting MARC-8 
> encoded records, then in that context it likely wouldn't know 
> what to do with the 0xB5 octet and that might be the cause of 
> the error you are seeing.
> >
> > -- Michael
> >
> > [1] Your MARC records appear to be encoded in MARC-8 as 
> evidenced by "ergáo" in which the combining accent character 
> comes before the character to be modified.  I.e. the byte 
> string that displays as "ergáo" in your email would display 
> as "ergò" (with a Latin small letter o with grave) in a 
> MARC-8 aware client.
> >   
> >   
> Thanks for your answer.
> Well, this could be a precious hint.
> Indeed, in that catalogue I want to process, some books are 
> ancient books and were catalogued from OCLC or SUDOC.
> And any LOWERCASE DIGRAPH AE or UPPERCASE DIGRAPH AE or 
> LOWERCASE DIGRAPH OE is not well encoded. Encoding is 
> **assumed** to be latin1 translated into utf-8 in the 
> catalogue I am working on but appears respectively µ, ¥,¶ in biblios.
> 
> Is there a way to fix things up ?
> 
> --
> Henri Damien LAURENT et Paul POULAIN
> Consultants indépendants
> en logiciels libres et bibliothéconomie (http://www.koha-fr.org)
>

RE: MARC::Charset

2007-03-13 Thread Doran, Michael D

Hi Henri,

> MARC::Charset ... fails on each µ character.

> ad Scripturµ sensum

Although in my email client, the character in question appears as a MICRO SIGN 
("µ"), I am assuming that it is actually meant to be a LOWERCASE DIGRAPH AE 
("æ") since that is consistent with the Latin vernacular text in your record.  
In MARC-8, the LOWERCASE DIGRAPH AE character is a precomposed character 
represented by 0xB5 in hex [1].  You mention that you are using MARC::File::XML 
which in turn uses MARC::Charset.  I'm wondering if there is some confusion as 
to the expected encoding of the MARC records being processed/converted?  If 
MARC::Charset is expecting MARC21 Unicode/UCS encoded records, but is actually 
getting MARC-8 encoded records, then in that context it likely wouldn't know 
what to do with the 0xB5 octet and that might be the cause of the error you are 
seeing.

-- Michael

[1] Your MARC records appear to be encoded in MARC-8 as evidenced by "ergáo" in 
which the combining accent character comes before the character to be modified. 
 I.e. the byte string that displays as "ergáo" in your email would display as 
"ergò" (with a Latin small letter o with grave) in a MARC-8 aware client.

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 mobile
# [EMAIL PROTECTED]
# http://rocky.uta.edu/doran/
 

> -Original Message-
> From: Henri-Damien LAURENT [mailto:[EMAIL PROTECTED] 
> Sent: Tuesday, March 13, 2007 5:26 AM
> To: perl4lib@perl.org
> Subject: MARC::Charset
> 
> Hi there,
> I have a problem with MARC::Charset.
> It fails on each µ character.
> This is quite a pain in the neck.
> So I wanted to add some correspondance to MARC::Charset::Table.
> But whatever I tried failed.
> Is there a way to add a character to MARC::Charset Table ?
> Or to *ignore_errors* that is : pass subfield as is, when 
> MARC::File::XML uses MARC::Charset ?
> 
> no mapping found at position 7 in Ferrariµ at 
> /usr/lib/perl5/site_perl/5.8.8/MARC/Charset.pm line 194.
> .no mapping found at position 43 in seu, De 
> comaediis collegiorum in Gallia, prµsertim ineunte sexto 
> decimo sµculo, disquisitionem at 
> /usr/lib/perl5/site_perl/5.8.8/MARC/Charset.pm line 194.
> ...no mapping found at position 9 
> in Petri Cunµi De republica Hebrµorum libri tres at 
> /usr/lib/perl5/site_perl/5.8.8/MARC/Charset.pm line 194.
> no mapping found at position 71 in variis annotationibus, 
> cuivis literato scitu necessariis, & ad Scripturµ sensum 
> eruendum utilissimis illustrati, nunc primum publici boni 
> ergáo in lucem
> 
> --
> Henri Damien LAURENT et Paul POULAIN
> Consultants indépendants
> en logiciels libres et bibliothéconomie (http://www.koha-fr.org)
>

RE: MARC Records, XML, and encoding

2006-05-18 Thread Doran, Michael D

> So I took a look at that position in the marc record and 
> found a 0x9C character at that position, as the error
> message indicates. I can't find a 0x9C in either of the
> mapping tables that this record purports to use:

0x9C is a C1 control character that is generally assigned the function
of STRING TERMINATOR and like Ed states is not a valid MARC-21
character.  Only a small subset of the C0 and C1 control characters are
allowed for in the MARC-21 standard:

 Character  Function (in MARC-21)
--  -
0x1B  ESCAPE
0x1D  RECORD TERMINATOR 
0x1E  FIELD TERMINATOR 
0x1F  SUBFIELD DELIMITER 
0x88  NON-SORT BEGIN 
0x89  NON-SORT END 
0x8D  JOINER 
0x8E  NON-JOINER

> This character conversion stuff is a major pain.

"An apparently simple subject which 
turns out to be brutally complicated"
-- in reference to coded character sets

-- Michael

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 cell
# [EMAIL PROTECTED]
# http://rocky.uta.edu/doran/ 

> -Original Message-
> From: Edward Summers [mailto:[EMAIL PROTECTED] 
> Sent: Thursday, May 18, 2006 11:17 AM
> To: perl4lib
> Subject: Re: MARC Records, XML, and encoding
> 
> So I got curious (thanks to your convo in #code4lib). I isolated the  
> problem to one record:
> 
>   http://www.inkdroid.org/tmp/one.dat
> 
> Your roundtrip conversion complains:
> 
> --
> 
> no mapping found at position 8 in Price : <9c> 7.99;Inv.#  B  
> 476913;Date   06/03/98; Supplier : Dawson UK;  Recd 20/03/98;   
> Contents : 1. The problem : 1. Don't bargain over positions;  2.  
> The method : 2. Separate the people from the problem; 3.  
> Focus on interests, not positions; 4. Invent options for mutual  
> gain; 5. Insist on using objective criteria;  3. Yes, but :  
> 6. What if they are more powerful? 7. What if they won't  
> play? 8. What if they use dirty tricks?  4. In conclusion;  5.  
> Ten questions people ask about getting to yes; g0=ASCII_DEFAULT  
> g1=EXTENDED_LATIN at /usr/local/lib/perl5/site_perl/5.8.7/MARC/ 
> Charset.pm line 126.
> 
> --
> 
> So I took a look at that position in the marc record and 
> found a 0x9C  
> character at that position, as the error message indicates. I can't  
> find a 0x9C in either of the mapping tables that this record 
> purports  
> to use:
> 
> BasicLatin (ASCII): http://lcweb2.loc.gov/cocoon/codetables/42.html
> Extended Latin (ANSEL): 
> http://lcweb2.loc.gov/cocoon/codetables/45.html
> 
> Looks like you might want to preprocess those records before  
> translating. Since this character routinely occurs in the 586 field  
> you could use MARC::Record to remove the offending character before  
> writing as XML.
> 
> Hope that helps somewhat. This character conversion stuff is a major  
> pain.
> 
> //Ed
>

RE: Z39.50 Module

2005-12-09 Thread Doran, Michael D

Hi Jane,

If you don't get an answer on this list, you might consider a posting to
the Net-z3950 list [1].

-- Michael

[1] Net-z3950 mailing list
[EMAIL PROTECTED]
http://www.indexdata.dk/mailman/listinfo/net-z3950

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 cell
# [EMAIL PROTECTED]
# http://rocky.uta.edu/doran/ 

> -Original Message-
> From: Jacobs, Jane W [mailto:[EMAIL PROTECTED] 
> Sent: Friday, December 09, 2005 7:51 AM
> To: perl4lib@perl.org
> Subject: Z39.50 Module
> 
> Hi folks,
> 
>  
> 
> Is anyone using the Z39.50 Module on a Windows System?
> 
>  
> 
> I'm trying to, so far without success. I am using Windows 2000.  I
> working with Perl on D: (a hard drive partition), not C, 
> because our IT
> department, sensibly, likes to keep C: for the standard installation.
> First I tried the basic Perl.  At makefile.pl I get the error:
> 
>  
> 
> D:\Perl\Net-Z3950-0.50>perl Makefile.pl
> 
> 'yaz-config' is not recognized as an internal or external command,
> 
> operable program or batch file.
> 
> 'yaz-config' is not recognized as an internal or external command,
> 
> operable program or batch file.
> 
> ERROR: Unable to call script 'yaz-config': is YAZ installed? at
> Makefile.pl line
> 
>  12.
> 
>  
> 
> I thought I installed YAZ to a folder entitled D:\Perl\YAZ yet perhaps
> this is not where it is expected in the file structure.
> 
>  
> 
> Reading on in the documentation
> (http://perl.z3950.org/support/windows.html) I find that:
> 
> 
> 
> "Despite one or two apocryphal success stories (and I'd love to get a
> definitive one), most people who have tried to build the Net::Z3950
> module on Microsoft's Windows operating systems have found that they
> can't get the Event module (a prerequisite) built."
> 
>  
> 
> I don't think I got quite that far, but decided it might be 
> time to try
> plan B and install VBZOOM anyway.  I couldn't get that to work either.
> Here I get the error message:
> 
>  
> 
> "Component 'RICHTX32.OCX' or one of its dependencies not correctly
> registered; a file is missing or invalid."
> 
>  
> 
> I found RICHTX32.OCX, no clue what "its dependencies" might be.  I
> thought perhaps RICHTX32.OCX was not where it was expected 
> and could be
> copied to another location, but where would that be?
> 
>  
> 
> Thanks in advance to anyone who can help.
> 
> JJ
> 
>  
> 
>  
> 
>  
> 
> **Views expressed by the author do not necessarily represent those of
> the Queens Library.**
> 
> Jane Jacobs
> 
> Asst. Coord., Catalog Division
> 
> Queens Borough Public Library
> 
> 89-11 Merrick Blvd.
> 
> Jamaica, NY 11432
> 
> tel.: (718) 990-0804
> 
> e-mail: [EMAIL PROTECTED]
> 
> FAX. (718) 990-8566
> 
>  
> 
>

RE: MARC-8 to UTF-8 conversion

2005-12-05 Thread Doran, Michael D

> If anyone has any suggestions on how to handle a
> largish character mapping table [...]

For those who aren't familiar with the MARC 21 alternate character set
repertoires (specifically, the East Asian ideographs), by "largish", Ed
is talking on the order of a table containing upwards of 16,000
mappings.  

> Perhaps at the very least I can include some
> information about DB_File difficulties prominently in the
> documentation in the new version.

That would definitely be helpful.  It would also be helpful if error
messages that get kicked out during an automated install ("perl -MCPAN
-e 'install MARC::Charset'") due to missing DB_File prereq components
were more informative as to the problem.  Below are the error messages
that users get now:

BEGIN failed--compilation aborted at lib/MARC/Charset.pm line
12.
Compilation failed in require at Makefile.PL line 7.
BEGIN failed--compilation aborted at Makefile.PL line 7.
Running make test
  Make had some problems, maybe interrupted? Won't test
Running make install
  Make had some problems, maybe interrupted? Won't install

I'm probably starting to sound nit-picky, but please understand that
it's only because I think MARC::Charset is a great module and I'd like
for more people to be using it.  :-)

-- Michael

> -Original Message-
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On 
> Behalf Of Ed Summers
> Sent: Monday, December 05, 2005 12:14 PM
> To: perl4lib@perl.org
> Subject: Re: MARC-8 to UTF-8 conversion
> 
> On 12/5/05, Doran, Michael D <[EMAIL PROTECTED]> wrote:
> > So... this is all very interesting (and I've definitely learned
> > something here), but like I suggested previously, this level of
digging
> > may be a bit beyond the "casual" Perl user.  ;-)
> 
> Yep, point taken. I'm guessing you are right: when you built perl from
> source it couldnt't find BerkeleyDB so didn't install DB_File. Good to
> know for the future. If anyone has any suggestions on how to handle a
> largish character mapping table when someone does a:
> 
> use MARC::Charset;
> 
> I'm open to suggestions. Perhaps at the very least I can include some
> information about DB_File difficulties prominently in the
> documentation in the new version.
> 
> Thanks!
> //Ed
>

RE: MARC-8 to UTF-8 conversion

2005-12-05 Thread Doran, Michael D

Ed,

ED > I don't really understand why Perl 5.8.7 lacked DB_File since
ED > Module::CoreList [...] reports it being standard sine 5.00307.
Perhaps
ED > this is some sort of emasculated version that ships with Solaris
:-)

Nope, I wasn't using a "Perl lite" version. ;-) 

Although Solaris now comes with Perl (v.5.6.1 for Solaris 9), when I set
up a server I always download and compile from the latest Perl source
(v.5.8.7 in this case) and that's the Perl version I use for my scripts
and the version to which I add any non-standard modules.  I did a plain
vanilla "all defaults" Perl installation [1].

ED > It looks like the user tried to install DB_File without having the
ED > BerkeleyDB libraries from SleepyCat already installed.

It *is* true that I tried to install the DB_File.pm module without
having installed any software from SleepyCat (although I *did* check for
Berkeley DB libraries installed as Solaris packages and was hoping that
they would suffice [2]).  If DB_File.pm requires software from SleepyCat
as part of the compile/installation process, I'm curious how it could be
installed as a core module.  However, the DB_File documentation on CPAN
(Paul Marquess > DB_File-1.814 > DB_File)  states this:

AVAILABILITY 
DB_File comes with the standard Perl source distribution.
Look in the directory ext/DB_File.

...and I rechecked my Perl source and it *is* in that directory.

However, the Perl "Configure" script appears to do some system checking
in regard to what Berkeley DB components are available [3].  I'm
guessing the necessary Berkeley DB stuff is installed by default on some
platforms, and thus DB_File.pm gets installed when compiling/installing
Perl on those platforms.  Not apparently, however, the case for my
particular Solaris/Perl setup (nor is it included in the Solaris OS
package v5.6.1 Perl version), although I'd be interested in the
experiences of other Solaris sites in installing MARC::Charset.
Although for security reasons, we no longer install the Solaris "Entire"
metacluster, I did do a Solaris package search for "Berkeley DB"
(http://rocky.uta.edu/doran/pkginfo/search.cgi) and I installed the only
two relevant Solaris packages I found prior to trying to install the
DB_File.pm module from the source code.

So... this is all very interesting (and I've definitely learned
something here), but like I suggested previously, this level of digging
may be a bit beyond the "casual" Perl user.  ;-)

-- Michael

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 cell
# [EMAIL PROTECTED]
# http://rocky.uta.edu/doran/ 

[1] hostname:/> /opt/bin/perl -v

This is perl, v5.8.7 built for sun4-solaris

Copyright 1987-2005, Larry Wall

Perl may be copied only under the terms of either the Artistic License
or the
GNU General Public License, which may be found in the Perl 5 source kit.

Complete documentation for Perl, including FAQ lists, should be found on
this system using `man perl' or `perldoc perl'.  If you have access to
the
Internet, point your browser at http://www.perl.org/, the Perl Home
Page.

bullwinkle:/opt/lib/perl5> /opt/bin/perl -V
Summary of my perl5 (revision 5 version 8 subversion 7) configuration:
  Platform:
osname=solaris, osvers=2.9, archname=sun4-solaris
uname='sunos bullwinkle 5.9 generic_118558-06 sun4u sparc
sunw,ultra-enterprise '
config_args='-de'
hint=recommended, useposix=true, d_sigaction=define
usethreads=undef use5005threads=undef useithreads=undef
usemultiplicity=undef
useperlio=define d_sfio=undef uselargefiles=define usesocks=undef
use64bitint=undef use64bitall=undef uselongdouble=undef
usemymalloc=n, bincompat5005=undef
  Compiler:
cc='gcc', ccflags ='-fno-strict-aliasing -pipe -D_LARGEFILE_SOURCE
-D_FILE_OFFSET_BITS=64',
optimize='-O',
cppflags='-fno-strict-aliasing -pipe'
ccversion='', gccversion='2.95.3 20010315 (release)',
gccosandvers='solaris2.9'
intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=4321
d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16
ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t',
lseeksize=8
alignbytes=8, prototype=define
  Linker and Libraries:
ld='gcc', ldflags =' '
libpth=/usr/lib /usr/ccs/lib
libs=-lsocket -lnsl -ldl -lm -lc
perllibs=-lsocket -lnsl -ldl -lm -lc
libc=/lib/libc.so, so=so, useshrplib=false, libperl=libperl.a
gnulibc_version=''
  Dynamic Linking:
dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags=' '
cccdlflags='-fPIC', lddlflags='-G'


Characteristics of this binary (from libperl): 
  Compile-time options: USE_LARGE_FILES
  Built under solaris
  Compiled at Jun 13 2005 10:21:04
  @INC:
/opt/lib/perl5/5.8.7/sun4-solaris
/opt/lib/perl5/5.8.7
/opt/lib/perl5/site_perl/5.8.7/sun4-solaris
/opt/lib/perl5/site_perl/5.8.7
/opt/lib/perl5/site_perl
.

[2] From previous message:
> The DB_Fi

RE: MARC-8 to UTF-8 conversion

2005-12-05 Thread Doran, Michael D

Hi Ed,

> -Original Message-
> From: Edward Summers [mailto:[EMAIL PROTECTED] 
> Sent: Monday, December 05, 2005 6:14 AM
> To: perl4lib
> Subject: Re: MARC-8 to UTF-8 conversion
> 
> On Dec 2, 2005, at 9:01 AM, Doran, Michael D wrote:
> 
> > Installing the MARC::Charset module can be a bit problematic for  
> > the casual Perl user, due to the prerequisites.
> 
> Is DB_File a big deal as a prerequisite? it's been in Perl since  
> 5.00307. The other prereq is perl 5.8, but doing unicode work in  
> Perls lower than that isn't really a good idea.
> 
> I only ask because the new version of MARC::Charset currently 
> has the same dependencies, but I'd like ot make it easier to
> install if possible.
> 
> //Ed
> 

Below is a previous off-list email regarding my experiences earlier this
year trying to install MARC::Charset and was the basis for the above
editorial comment.  I include myself in the category of "casual" Perl
user, so there may very well be something I did wrong or was
overlooking.  

-- Michael

> -Original Message-
> From: Doran, Michael D 
> Sent: Tuesday, June 14, 2005 4:14 PM
> To: 'Ed Summers'
> Subject: MARC::Record v2.0 & MARC::Charset
> 
> Hi Ed,
> 
> Just wanted to give some feedback on installation of your 
> MARC modules...
> 
> My environment:
>   Solaris 9 4/04
>   Perl 5.8.7 (configured with all defaults)
> 
> As part of setting up a test server, I am doing a fresh 
> Solaris 9 install.  That in turn gives me an opportunity to 
> install newer Perl stuff.  I installed MARC::Record 2.0 with 
> no problems and am looking forward to taking it for a test 
> ride, but am having some trouble installing MARC::Charset 0.6 
> on a fairly plain vanilla system.
> 
> I first tried an automated install:
> 
> /opt/bin/perl -MCPAN -e 'install MARC::Charset'
> 
> which failed with these messages:
> 
> BEGIN failed--compilation aborted at lib/MARC/Charset.pm line 12.
> Compilation failed in require at Makefile.PL line 7.
> BEGIN failed--compilation aborted at Makefile.PL line 7.
> Running make test
>   Make had some problems, maybe interrupted? Won't test
> Running make install
>   Make had some problems, maybe interrupted? Won't install
> 
> I then downloaded the source tarball and tried a manual 
> install and got this error:
> 
> # /opt/bin/perl Makefile.PL
> Can't locate DB_File.pm in @INC (@INC contains: lib 
> /opt/lib/perl5/5.8.7/sun4-solaris /opt/lib/perl5/5.8.7 
> /opt/lib/perl5/site_perl/5.8.7/sun4-solaris 
> /opt/lib/perl5/site_perl/5.8.7 /opt/lib/perl5/site_perl .) at 
> lib/MARC/Charset.pm line 12.
> BEGIN failed--compilation aborted at lib/MARC/Charset.pm line 12.
> Compilation failed in require at Makefile.PL line 7.
> BEGIN failed--compilation aborted at Makefile.PL line 7.
> 
> Line 12 in Charset.pm is "use DB_File;".  The DB_File 
> dependency wasn't mentioned in the MARC::Charset README, but 
> since it is obviously required, I tried to install it, but 
> ran into a problem there (see output below sig file).  From 
> the output, I wasn't sure if the problem was with the DB_File 
> prerequisites or something else.  The DB_File README says 
> that "Berkeley DB" is a prerequisite.  I have the following 
> Solaris packages installed:
>   SFWbdb   berkeleyDB - Berkeley Database Library
>   SFWdb1   Berkeley DB - database library
> ... but I'm not sure if that constitutes having the "Berkeley 
> DB".  There are no other packages on the Solaris Media Kit 
> (Software 1 of 2; Software 2 of 2; Software Companion) that 
> have the word "Berkeley" in the name or description.  I 
> googled "Berkeley DB" and wound up at http://www.sleepycat.com/.
> I've downloaded a tarball but for now have put aside the project.
> 
> I'm not looking for you to solve any problems here, I'm just 
> alerting you to the fact that the MARC::Charset prerequisites 
> may make installation problematic for the casual Perl user 
> and that that may have an impact on how much that module is used.
> 
> -- Michael
> 
> # Michael Doran, Systems Librarian
> # University of Texas at Arlington
> # 817-272-5326 office
> # 817-688-1926 cell
> # [EMAIL PROTECTED]
> # http://rocky.uta.edu/doran/ 
> 
> # /opt/bin/perl -MCPAN -e 'install DB_File'
> CPAN: Storable loaded ok
> Going to read /var/software/.cpan/Metadata
>   Database was generated on Sun, 29 May 2005 15:05:48 GMT
> Running install for module DB_File
> Running make for P/PM/PMQS/DB_File-1.811.tar.gz
> CPAN: Digest::MD5 loaded ok
> Che

RE: MARC-8 to UTF-8 conversion

2005-12-02 Thread Doran, Michael D

Hi Stefano,

Installing the MARC::Charset module can be a bit problematic for the
casual Perl user, due to the prerequisites.  However if you need to do a
MARC-8 to UTF-8 conversion, that's probably the best tool available.

The issue with MARC-8 conversions is that MARC-8 is only really used for
encoding bibliographic records and with its use of combining diacritics
and escape sequences, it is more complex than the typical 8-bit
character set [1].  Most of the software development in the area of
library-centric character sets is done by ILS vendors, who typically
don't make their efforts available in the form of freely available Perl
modules.

You didn't say mention why you were wanting to do a character set
conversion.  If you just need a "quick and dirty" conversion for
ephemeral display of bibliographic information on a web page, you might
look at alternatives such as converting from MARC-8 to Latin-1 (ISO
8859-1).  That's a potentially lossy conversion, however if most of your
records are Italian, the Latin-1 repertoire should suffice.  There are
some available Perl routines that should handle that conversion [2].

-- Michael

[1] Coded Character Sets: A Technical Primer for Librarians  
http://rocky.uta.edu/doran/charsets/

[2] MARC to Latin: a charset conversion routine in Perl
http://rocky.uta.edu/doran/charset/

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 cell
# [EMAIL PROTECTED]
# http://rocky.uta.edu/doran/ 

> -Original Message-
> From: bargioni [mailto:[EMAIL PROTECTED] 
> Sent: Friday, December 02, 2005 4:43 AM
> To: perl4lib@perl.org
> Subject: MARC-8 to UTF-8 conversion
> 
> Hi, I'm trying to convert MARC-8 records to UTF-8 on the fly. 
> March::Charset doesn't work for me.
> Any suggestion? Also a command line way can be good for my purposes.
> TIA. Stefano
> -- 
> Dott. Stefano Bargioni
> Pontificia Universita' della Santa Croce - Roma
> Vicedirettore della Biblioteca
> 
>

RE: yet another character encoding question

2005-09-29 Thread Doran, Michael D

Hi Jason,
 
I believe that MARC::Charset only does MARC-8 to UTF-8 conversion and vice 
versa, so won't be a solution for automating your Latin-1 to MARC-8 conversion, 
unless you were planning to do Latin-1=>UTF-8=>MARC-8.
 
A few years ago, I wrote an imperfect MARC-8 to Latin-1 character set 
conversion routine [1].  If you can't find any off-the-shelf solution, it may 
serve as a basis for writing a Latin-1 to MARC-8 conversion routine.  Because 
MARC-8 is only really used in "library land" and is somewhat complex, I found 
few available open-source conversion routines (this was before Ed Summers wrote 
MARC::Charset), which is why I wrote my own.
 
> During the test install, it says it requires the module DB_File,
> and during the test install of that, it fails 
 
I believe that Berkeley DB is a prerequisite.
 
-- Michael Doran
 
[1] MARC to Latin: A charset conversion routine in Perl
http://rocky.uta.edu/doran/charset/



From: Thomale, J [mailto:[EMAIL PROTECTED]
Sent: Thu 9/29/2005 8:59 AM
To: perl4lib@perl.org
Subject: yet another character encoding question



Hello all,

I'm brand new to this list, and I need some help with a particular
issue. I searched through the mailing list archives but didn't find
anything directly addressing this--despite the seeming popularity of
questions about character sets--so I thought I'd ask.

I've written a perl script that extracts data from a MySQL database,
uses MARC::Record to map that data to MARC, and outputs the MARC record
(based on a script written by Brian Surratt of Texas A&M University).
The resulting records need to have all data encoded in MARC-8 format
(for loading into OCLC and into our local catalog). The data in the
MySQL database is encoded using ISO 8859-1 (latin-1). The MARC records
output by the script work fine so long as they don't contain diacritics
(or other weird stuff). When they do contain diacritics, those
diacritics come out incorrectly when the MARC record is read by a
program expecting MARC-8 (because the diacritics are encoded in
latin-1).

So, is there an easy way to translate from latin-1 encoding to
MARC-8/ANSEL? I've been unable to find any perl modules that help me
with this outside of MARC::Charset. Unfortunately, we're having trouble
getting that module installed on our machine. During the test install,
it says it requires the module DB_File, and during the test install of
that, it fails (not sure what the error message is--I'd have to ask the
admin of that machine). We're running Perl v5.8.3.

FWIW, I did try manually searching/replacing diacritics in the extracted
database fields before converting to MARC and it worked fine (I tried it
on a record that contained Spanish, so there were limited characters
that applied). In order for this approach to be viable, I'd have to map
ALL the latin-1 characters to their MARC-8 counterparts, which would be
a time-consuming process.

On top of this, there are a few records containing the characters hex EF
BF BD, which is the UTF-8 replacement character. I'm a bit mystified as
to where this is coming from, and it would be trivial enough to simply
strip it out, but this approach doesn't guarantee that the script will
catch all non-MARC-8 characters. That's why I'd really prefer to use
MARC::Charset for this--it needs to be robust enough that I won't have
to baby-sit it all the time.

So, I suppose my question is two-fold. 1. Has anyone had similar
problems getting MARC::Charset installed? Could you offer any advice
that I can pass along as to how to get it installed? 2. Are there any
other perl modules that will convert latin-1 to MARC-8/ANSEL?

Thanks in advance for any help you can offer.

Jason Thomale
Metadata Librarian
Texas Tech University Libraries
(806) 742-2240

RE: who help me process BIG5?

2005-08-18 Thread Doran, Michael D

> I have some Excel files by big5 charset. 
> I fetch some column from it and save to usmarc file by using 
> MARC::Record, but I get nothing. 

What is "nothing"?  Does that mean no MARC records were created?  Or
that they are empty?  Or couldn't be imported and/or read in an
integrated library system?  Are you getting any type of error message?
If so what?

> Is it wrong I use MARC::Record? 

I am not an expert regarding MARC::Record so cannot tell you if you
should or shouldn't be using it to process your data, but I will offer
some comments regarding character sets in a MARC environment...

If you are trying to adhere to the MARC 21 standard (formerly USMARC),
then there may be some issues with using text encoded in the BIG5
character set.  The MARC 21 standard allows for encoding in either the
MARC-8 character set or the UCS/Unicode (UTF8) equivalents for the
MARC-8 character repertoire.  While the MARC-8 character set includes
East Asian characters as an alternate (multi-byte) character set, those
character encodings are defined by the NISO Z39.64 [East Asian Character
Code for Bibliographic Use (EACC)] standard which I'm guessing might
have different code points than the BIG5 character set.  It is also my
understanding that you must use escapes sequences as specified in the
ANSI X3.41/ISO 2022 standard [Code Extension Techniques for Use with
7-bit and 8-bit Character Sets] to signal the change from the MARC-8
default (Latin) character set to the MARC-8 East Asian alternate
character set within a MARC record. [1,2]

[1] Coded Character Sets: A Technical Primer for Librarians 
http://rocky.uta.edu/doran/charsets/

[2] MARC 21 Specifications for Record Structure, Character Sets, and
Exchange Media
http://www.loc.gov/marc/specifications/spechome.html

-- Michael

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 cell
# [EMAIL PROTECTED]
# http://rocky.uta.edu/doran/ 

> -Original Message-
> From: sui hm [mailto:[EMAIL PROTECTED] 
> Sent: Friday, August 12, 2005 3:37 AM
> To: perl4lib@perl.org
> Subject: who help me process BIG5?
> 
> I have some Excel files by big5 charset. 
> I fetch some column from it and save to usmarc file by using 
> MARC::Record, 
> but I get nothing. 
> It's correct when I process english excel files. 
> Is it wrong I use MARC::Record? 
>  Please help me.
> Thanks.
>

RE: new books list

2005-05-27 Thread Doran, Michael D

Hi Kindra,

> I'm attaching a new books script that my predecessor created. 

I didn't see an attached script, but I'll take a shot anyway...

> I'm trying to get this to run, but I'm coming up with errors. 
> I'm almost sure it is because of the upgrade to Unicode (we're
> with Endeavor)

Typically, a Perl DBI/DBD script needs to have certain database related
environment parameters set.  With your Oracle upgrade, those parameters
have changed.

If the script included lines like this:

$ENV{ORACLE_SID} = "LIBR";
$ENV{ORACLE_HOME} = "/oracle/app/oracle/product/8.0.5";

It will need to be changed to this (for Voyager with Unicode):

$ENV{ORACLE_SID} = "VGER";
$ENV{ORACLE_HOME} = "/oracle/app/oracle/product/9.2.0";

> the read-only password changing.

You will want to use the read-only username and password that works with
your Voyager canned reports (i.e. reports.mdb).

-- Michael

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 cell
# [EMAIL PROTECTED]
# http://rocky.uta.edu/doran/ 

> -Original Message-
> From: Kindra Morelock [mailto:[EMAIL PROTECTED] 
> Sent: Friday, May 27, 2005 10:15 AM
> To: perl4lib@perl.org
> Subject: new books list
> 
> Hi all,
> 
> I'm attaching a new books script that my predecessor created.  I'm
> trying to get this to run, but I'm coming up with errors.  I'm almost
> sure it is because of the upgrade to Unicode (we're with 
> Endeavor)  with
> the read-only password changing.  I suspect it has something 
> to do with
> the oracle instance being called "VGER" now, but I'm not sure 
> how to fix
> it.  I did change the read-only username/password from 
> "dbread" to ours,
> "ro_trinitydb".
> 
> Would there be a kind soul out there willing to look at this 
> script and
> help me figure out what I need to do to get this to work?  
> When I run it
> from the command line, I get the following errors:
> 
> DBI connect('','ro_trinitydb',...) failed: ORA-01034: ORACLE not
> available
> ORA-27101: shared memory realm does not exist
> SVR4 Error: 2: No such file or directory (DBD ERROR: OCISessionBegin)
> at ./new_books.html line 147
> Database connection not made: ORA-01034: ORACLE not available
> ORA-27101: shared memory realm does not exist
> SVR4 Error: 2: No such file or directory (DBD ERROR: OCISessionBegin)
> at ./new_books.html line 147.
> 
> Thanks,
> Kindra
> 
> ^*^*^*^*^*^*^*^*^*^*^*^*^*^*^*^*^*^*^*^*^*^*^*^*^
> Kindra I. Morelock
> Library Systems Administrator
> Trinity International University
> 2065 Half Day Road
> Deerfield, IL  60015
> (847) 317-4021
>

RE: installing perl 5.8.6

2005-05-19 Thread Doran, Michael D

Kindra,

> Does this mean that I have to change the first
> line of my script to /m1/shared/bin/perl to get
> it to point to 5.8.5 vs. /usr/bin/perl?

You could probably do something with symbolic or hard links, but that
becomes more complicated; so yes, I would just use the
/m1/shared/bin/perl in your scripts.

-- Michael

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 cell
# [EMAIL PROTECTED]
# http://rocky.uta.edu/doran/ 

> -Original Message-
> From: Kindra Morelock [mailto:[EMAIL PROTECTED] 
> Sent: Thursday, May 19, 2005 2:20 PM
> To: perl4lib@perl.org
> Subject: RE: installing perl 5.8.6
> 
> Michael,
> 
> You're right, 5.8.5 *is* installed.  Thanks for pointing that out!
> 
> I did download and run your utility.  Here's my output:
> 
> DOPE 0.9.1 beta - Discover Oracle-Perl Environment
> 
> SunOS endeavor 5.9 Generic_118558-05 sun4u sparc SUNW,Sun-Fire-V250 
> 
> Searching for Solaris Perl packages...
>   EISIdbish Perl DBI and DBD::Oracle modules
>   EISInetpl Perl libnet module  
>   SUNWopl5m Perl 5.005_03 Reference Manual Pages
>   SUNWopl5p Perl 5.005_03 (POD Documentation) 
>   SUNWopl5u Perl 5.005_03   
>   SUNWpl5m Perl 5.6.1 Reference Manual Pages
>   SUNWpl5p Perl 5.6.1 (POD Documentation) 
>   SUNWpl5u Perl 5.6.1 (core)  
>   SUNWpl5v Perl 5.6.1 (non-core)  
> 
> Searching for 'perl' executables...
>   /usr/perl5/5.6.1/bin/perl
> Inode#: 3696Version: 5.6.1
>   /usr/perl5/5.00503/bin/perl
> Inode#: 32424   Version: 5.005_03
> ...has these 'after market' modules:
>   DBD::Oracle 1.06
>   DBI 1.14
>   Net ???
>   /m1/shared/perl/5.8.5-09/bin/perl
> Inode#: 4304Version: 5.8.5
> ...has these 'after market' modules:
>   DBD::Oracle 1.15
>   DBI 1.43
>   Note: Identical inode numbers indicate hard-linked files.
>   Note: There may be Perl executables under a different name.
> 
> Searching for Perl symbolic links...
>   /usr/bin/perl -> /usr/perl5/5.00503/bin/perl (5.005_03)
>   /m1/shared/bin/perl -> /m1/shared/perl/5.8.5-09/bin/perl (5.8.5)
> 
> Searching for locations of Perl DBI module(s)...
>   /usr/perl5/site_perl/5.005/sun4-solaris/DBI.pm
>   /m1/shared/perl/5.8.5-09/lib/5.8.5/sun4-solaris/DBI.pm
>   /m1/shared/perl/5.8.5-09/lib/site_perl/5.8.5/sun4-solaris/DBI.pm
>   /m1/shared/perl/5.8.5-09/.cpan/build/DBI-1.43/DBI.pm
>   /m1/shared/perl/5.8.5-09/.cpan/build/DBI-1.43/blib/lib/DBI.pm
>  
> Searching for locations of Perl DBD::Oracle module(s)...
>   /usr/perl5/site_perl/5.005/sun4-solaris/DBD/Oracle.pm
>  
> /m1/shared/perl/5.8.5-09/lib/site_perl/5.8.5/sun4-solaris/DBD/
> Oracle.pm
>   /m1/shared/perl/5.8.5-09/.cpan/build/DBD-Oracle-1.15/Oracle.pm
>  
> /m1/shared/perl/5.8.5-09/.cpan/build/DBD-Oracle-1.15/blib/lib/
> DBD/Oracle.pm
>  
> Searching for Oracle versions on this system...
>   /oracle/app/oracle/product has these versions...
>  9.2.0 9.2.0.3
> 
> 
> Does this mean that I have to change the first line of my script to
> /m1/shared/bin/perl to get it to point to 5.8.5 vs. /usr/bin/perl?
> 
> Thanks,
> Kindra
> 
> >>> "Doran, Michael D" <[EMAIL PROTECTED]> 05/19/05 2:03 PM >>>
> Hi Kindra,
> 
> Although you are only aware of the one Perl installation on your
> server,
> there are probably at least two others.  A new Sun server from
> Endeavor
> will have Solaris 9, which comes with Perl 5.005 and Perl 5.6.  In
> addition, as part of the Voyager software installation, you will have
> Perl 5.8.5 installed as well as some useful non-default modules such
> as
> the DBI and DBD::Oracle modules (try looking in /m1/shared/).  To get
> a
> better idea of what you have on your system, try downloading and
> running
> the "dope.sh" utility [1].
> 
> > how do I uninstall 5.005...
> 
> I don't recommend un-installing the Solaris OS installed versions of
> Perl (unless you really understand the ramifications) since Perl is
> required by the operating system.
> 
> > ...and reinstall 5.8.6
> 
> Installing Perl from the source tarball is fairly straight forward,
> but
> it does require that you have a C compiler already installed on the
> server.  Endeavor doesn't put a C compiler on, but the GNU Compiler
> Collection (gcc) is available on the Solaris software companion CD
> that
> comes as part of the media kit.  As I mentioned above, you shouldn't
> have to install Perl 5.8.6, as it is likely already on your system.
> 
> -- Michael
> 
> [1] dope.sh (Discover Oracle-Perl Environment)

RE: installing perl 5.8.6

2005-05-19 Thread Doran, Michael D

Hi Kindra,

Although you are only aware of the one Perl installation on your server,
there are probably at least two others.  A new Sun server from Endeavor
will have Solaris 9, which comes with Perl 5.005 and Perl 5.6.  In
addition, as part of the Voyager software installation, you will have
Perl 5.8.5 installed as well as some useful non-default modules such as
the DBI and DBD::Oracle modules (try looking in /m1/shared/).  To get a
better idea of what you have on your system, try downloading and running
the "dope.sh" utility [1].

> how do I uninstall 5.005...

I don't recommend un-installing the Solaris OS installed versions of
Perl (unless you really understand the ramifications) since Perl is
required by the operating system.

> ...and reinstall 5.8.6

Installing Perl from the source tarball is fairly straight forward, but
it does require that you have a C compiler already installed on the
server.  Endeavor doesn't put a C compiler on, but the GNU Compiler
Collection (gcc) is available on the Solaris software companion CD that
comes as part of the media kit.  As I mentioned above, you shouldn't
have to install Perl 5.8.6, as it is likely already on your system.

-- Michael

[1] dope.sh (Discover Oracle-Perl Environment) 
http://rocky.uta.edu/doran/dope.html

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 cell
# [EMAIL PROTECTED]
# http://rocky.uta.edu/doran/ 

> -Original Message-
> From: Kindra Morelock [mailto:[EMAIL PROTECTED] 
> Sent: Thursday, May 19, 2005 10:41 AM
> To: perl4lib@perl.org
> Subject: installing perl 5.8.6
> 
> Hi everyone,
> 
> I'm new here, and I'm here because I need some major assistance.
> 
> We recently purchased a new Sun server through Endeavor. 
> Unfortunately, perl 5.005 is installed on it vs. the 5.8.x that we had
> installed on our old server.  My predecessor (who was much more of a
> guru in all things unix/perl than I) programmed several customized
> applications in perl 5.8.x that work on the old server, but
> unfortunately don't work with perl 5.005 apparently.
> 
> My question is:  how do I uninstall 5.005 and reinstall 5.8.6?  I
> downloaded the stable.tar.gz from http://www.perl.com/download.csp.  I
> am fairly familiar with unix commands and some really basic 
> programming
> (I'm more of a VB programmer though).  I get a little anxious when I'm
> messing around with stuff on our server, so if anyone has a 
> resource or
> step-by-step instructions they could forward me, I would really
> appreciate it.
> 
> Thanks,
> Kindra
> 
> ^*^*^*^*^*^*^*^*^*^*^*^*^*^*^*^*^*^*^*^*^*^*^*^*^
> Kindra I. Morelock
> Library Systems Administrator
> Trinity International University
> 2065 Half Day Road
> Deerfield, IL  60015
> (847) 317-4021
> 
> 
>

RE: LC call number sorting utilities

2005-03-29 Thread Doran, Michael D

Hi Bryan,

Thanks for the tip. :-)

I'm still a Perl beginner so I often don't code the most efficient way.

-- Michael

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 cell
# [EMAIL PROTECTED]
# http://rocky.uta.edu/doran/ 

> -Original Message-
> From: Bryan Baldus [mailto:[EMAIL PROTECTED] 
> Sent: Tuesday, March 29, 2005 5:22 PM
> To: Doran, Michael D; perl4lib@perl.org
> Subject: RE: LC call number sorting utilities
> 
> On Sunday, March 27, 2005 7:09 PM, Michael Doran wrote:
> >I recently converted a Library of Congress (LC) call number
> >normalization routine (that I had written for a shelf list 
> application)
> >into a couple of Perl LC call number sorting utilities.  
> 
> Thank you for this. It seems to work well (45000+ numbers 
> sorted, a quick
> scroll-through seems to show everything sorted correctly). However, as
> written, it seems to bog down on my machine after a few 
> thousand numbers. 
> 
> Instead of:
> 
> @input_list = (@input_list, $call_no); 
> and 
> @sorted_list = (@sorted_list, $call_no_array{$key});
> 
> perhaps:
> 
> push @input_list, $call_no;
> and
> push @sorted_list, $call_no_array{$key};
> 
> might help to speed things up (it did in my case).
> 
> I hope this helps. Thank you,
> 
> Bryan Baldus
> [EMAIL PROTECTED]
> [EMAIL PROTECTED]
> http://home.inwave.com/eija
>  
>

LC call number sorting utilities

2005-03-27 Thread Doran, Michael D

I recently converted a Library of Congress (LC) call number
normalization routine (that I had written for a shelf list application)
into a couple of Perl LC call number sorting utilities.  

sortLC.pl is a standalone application.  Usage is:
sortLC.pl < call_number_file
- or -
cat call_number_file | sortLC.pl

sortLC.lib is a library to be included in a Perl app.  Usage is:
require "sortLC.lib"; 
@sorted_list = &sortLC(@unsorted_list); 

The sortLC utilities can be downloaded from
http://rocky.uta.edu/doran/sortlc/

Any call number sorting routine is only as good as the underlying
normalization algorithm.  Mine is a work in progress, so results may
vary.   I welcome any feedback so that I can fix the bugs and make
improvements.

-- Michael

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 cell
# [EMAIL PROTECTED]
# http://rocky.uta.edu/doran/

listserv vs. Google Group

2005-03-23 Thread Doran, Michael D

I'm not sure that everybody who subscribes to this listserv is aware
that perl4lib listserv postings end up in the perl.perl4lib Google
Group.  I know that I was a bit surprised to find that out.

Although serving a similar purpose, I make a distinction between
listservs and news groups.  The main distinction being that I assume the
audience for the listserv is the people who explicitly subscribe to the
listserv, whereas the audience for Google Groups are basically the whole
world.  This distinction affects my posting behavior. 

For instance... I use a full "signature" (containing personal
information) for listserv postings, but am more restrictive in my
postings to news groups [1].  I tend to use a familiar and informal tone
with my listserv postings since the responses are often going to
somebody I actually know (if only by reputation).  I wouldn't
necessarily use that tone for responses to a news group posting.  In
addition, I'm more *likely* to post a response to a listserv request for
assistance, knowing (I thought) that any resulting mistakes and
ignorance on my part get a limited distribution.

The fact that perl4lib postings also go to Google Groups should at least
be mentioned in the "WELCOME to perl4lib@perl.org" automated
subscription response (preferably at the top).  Nothing was in there as
of my July 30, 2003 WELCOME message.

-- Michael

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 cell
# [EMAIL PROTECTED]
# http://rocky.uta.edu/doran/

RE: MARC::Record and UTF-8 & related threads

2005-03-07 Thread Doran, Michael D

Hi Ed,

> How would people feel about the next version of MARC-Record (perhaps
> a v2.0) which handled utf8 properly and required a modern perl? 

Definitely a *good* thing.  Worth upgrading Perl version for, if
necessary.
 
> Perhaps if people could respond to the list (or me if you prefer) with
> the version of Perl that you use MARC::Record with I could keep
> tallies and report back to the list.

I have MARC::Record installed on two machines:
1) Perl 5.6.1 & MARC::Record 0.94
2) Perl 5.8.5 & MARC::Record 1.4

> > Here's my main question -- is that the principal
> > concern/question/problem, i.e. that directory lengths will not be
> > computed correctly using the existing MARC::Record module with a
> > Unicode record? Or is it only in certain situations that 
> > the directory length would not be computed correctly?
> 
> Yes, but only if the record actually contains unicode :)

My understanding of Anne's posting was that the record she tested *did*
contain unicode: "I started with the Unicode version of the record and
modified it...".

-- Michael

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 cell
# [EMAIL PROTECTED]
# http://rocky.uta.edu/doran/ 

> -Original Message-
> From: Ed Summers [mailto:[EMAIL PROTECTED] 
> Sent: Monday, March 07, 2005 8:37 AM
> To: perl4lib@perl.org
> Subject: Re: MARC::Record and UTF-8 & related threads
> 
> On Fri, Mar 04, 2005 at 09:18:00AM -0500, Anne L. Highsmith wrote:
> > Here's my main question -- is that the principal
> > concern/question/problem, i.e. that directory lengths will not be
> > computed correctly using the existing MARC::Record module with a
> > Unicode record? Or is it only in certain situations that 
> the directory
> > length would not be computed correctly?
> 
> Yes, but only if the record actually contains unicode :) If you are
> looking for an example of how MARC::Record breaks when there is utf8 
> in the record you can look at t/utf8.t which is a test 
> distributed with
> the MARC-Record package. Currently, this test is skipped 
> because otherwise 
> it would fail.
> 
> > If anyone is inspired to make the necessary updates to the 
> MARC::Record module to handle unicode records, I'd certainly 
> be happy to test. I'd also be eternally grateful, since my 
> alternative might be re-writing 8 or 10 job streams in the 
> next 10 weeks so that I can: 1) export the records from my 
> database in MARC8; 2) edit them; 3) reload them doing a 
> MARC8-Unicode conversion utility provided by the lms vendor.
> 
> I've been meaning to write to the list about this for 
> sometime now. How
> would people feel about the next version of MARC-Record (perhaps a
> v2.0) which handled utf8 properly and required a modern perl? 
> By modern
> perl I mean a version >= 5.8.1. The reason why 5.8.1 is 
> required is that
> it's the first perl with a byte oriented substr() (available via the
> bytes pragma).
> 
> Perhaps if people could respond to the list (or me if you prefer) with
> the version of Perl that you use MARC::Record with I could 
> keep tallies
> and report back to the list.
> 
> //Ed
>

dope.sh - a shell script for discovery of Oracle-Perl environment

2005-01-12 Thread Doran, Michael D

dope.sh is a shell script that facilitates discovery of the Oracle-Perl
environment on a Unix (Solaris) system [1].  I distribute an open-source
Perl application that incorporates a DBI/DBD::Oracle connection.  The
users that implement the application generally (but not always) have the
requisite DBI/DBD modules but also often have multiple
instances/versions of Perl installed.  When those users have database
connection problems, the first thing I do is try to sort out exactly
what Perl/Oracle components they do and don't have available on their
system.  This can be time consuming since I don't have access to their
systems.  This utility was designed to output the information that I
usually ask users to provide and I now include it with my application
distribution.

If anybody else thinks they can get any use out of it, they are welcome
to use and/or modify it.

[1] dope.sh
http://rocky.uta.edu/doran/dope.html

-- Michael

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 cell
# [EMAIL PROTECTED]
# http://rocky.uta.edu/doran/

RE: Ignoring Diacritics accessing Fixed Field Data

2005-01-11 Thread Doran, Michael D

A bulletin from the "haste makes waste" department...

>   $ME =~ s/[\xE1-\xFE]//g;
>   $TITLE =~ s/[\xE1-\xFE]//g;

Ooops, that should be "E0" instead of "E1" as the first hex value in the 
substitutions:
   $ME =~ s/[\xE0-\xFE]//g;
   $TITLE =~ s/[\xE0-\xFE]//g;

Sorry,

-- Michael

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 cell
# [EMAIL PROTECTED]
# http://rocky.uta.edu/doran/ 

> -----Original Message-
> From: Doran, Michael D 
> Sent: Tuesday, January 11, 2005 2:13 PM
> To: perl4lib@perl.org
> Subject: RE: Ignoring Diacritics accessing Fixed Field Data
> 
> Hi Jane,
> 
> These answers assume that the data you are processing:
> 1) is encoded in the MARC-8 character set, and
> 2) consists of the MARC-8 default basic and extended Latin characters.
> 
> > Dave,Ayod\2003
> > Paòt,Kaâs\2002
> > Baks,Dasa\2003
> > ,Viâs\2002
> >
> > Problem 1: As you can see, I don't really want the first four 
> > characters, I want the first four SEARCHABLE characters. How
> > can I tell MARC Record to give me the first four characters, 
> > excluding diacritics?
> 
> Assuming that you asking how to strip out the MARC-8 
> combining diacritic characters, try inserting the 
> substitution commands listed (as shown below) just prior to 
> the substr commands:
> 
> > my $ME = $field->subfield('a');
>   $ME =~ s/[\xE1-\xFE]//g;
> > my $four100 = substr( $ME, 0, 4 );
> 
> > my $TITLE = $field->subfield('a');
>   $TITLE =~ s/[\xE1-\xFE]//g;
> > my $four245 = substr( $TITLE, 0, 4 );
> 
> -- Michael
> 
> # Michael Doran, Systems Librarian
> # University of Texas at Arlington
> # 817-272-5326 office
> # 817-688-1926 cell
> # [EMAIL PROTECTED]
> # http://rocky.uta.edu/doran/ 
> 
> > -Original Message-
> > From: Jacobs, Jane W [mailto:[EMAIL PROTECTED] 
> > Sent: Tuesday, January 11, 2005 12:30 PM
> > To: perl4lib@perl.org
> > Subject: Ignoring Diacritics accessing Fixed Field Data
> > 
> > Hi folks,
> > 
> > I'm trying to write a routine to construct a text file of 
> > OCLC search key from a group of existing records.  What I 
> > want is something like:
> > 
> > Brah,vasa/2003
> > 
> > That is 1st four letters of 100 + comma + 1st four letters of 
> > 245 + slash + date.
> > 
> > In principle I have this working with:
> > 
> > 
> > open( FOURS, ">4-4-date.txt" );
> > 
> > 
> > while ( my $r = $batch->next() ) {
> >   
> > my @fields = $r->field( '100' );
> > foreach my $field ( @fields ) {
> > my $ME = $field->subfield('a');
> > my $four100 = substr( $ME, 0, 4 );
> >   
> > print FOURS "$four100";
> > } 
> > 
> > my @fields = $r->field( '245' );
> > foreach my $field ( @fields ) {
> > my $TITLE = $field->subfield('a');
> > my $four245 = substr( $TITLE, 0, 4 );
> > print FOURS ",$four245";
> > } 
> > 
> > my @fields = $r->field( '260' );
> > foreach my $field ( @fields ) {
> > my $PD = $field->subfield('c');
> > my $four260 = substr( $PD, 0, 4);
> > print FOURS "\\$four260\n";
> > } 
> > 
> > 
> > My result was something like:
> > 
> > Dave,Ayod\2003
> > Paòt,Kaâs\2002
> > Baks,Dasa\2003
> > ,Viâs\2002
> > 
> > Problem 1: As you can see, I don't really want the first four 
> > characters, I want the first four SEARCHABLE characters.  How 
> > can I tell MARC Record to give me the first four characters, 
> > excluding diacritics?
> > 
> > Problem 2:  In these examples 260 $c works OK, but I could 
> > get a cleaner result by accessing the date from the fixed 
> > field (008 07-10).  How would I do that?  I was looking in 
> > the tutorial, but couldn't seem to find anything that seemed 
> > to help.  If I'm missing something there please point it up.
> > 
> >  Thanks in advance to anyone who can help.
> > 
> >  
> > JJ
> > 
> > 
> > 
> > **Views expressed by the author do not necessarily represent 
> > those of the Queens Library.**
> > 
> > Jane Jacobs
> > Asst. Coord., Catalog Division
> > Queens Borough Public Library
> > 89-11 Merrick Blvd.
> > Jamaica, NY 11432
> > 
> > tel.: (718) 990-0804
> > e-mail: [EMAIL PROTECTED]
> > FAX. (718) 990-8566 
> > 
> > 
>

RE: Ignoring Diacritics accessing Fixed Field Data

2005-01-11 Thread Doran, Michael D

Hi Jane,

These answers assume that the data you are processing:
1) is encoded in the MARC-8 character set, and
2) consists of the MARC-8 default basic and extended Latin characters.

> Dave,Ayod\2003
> Paòt,Kaâs\2002
> Baks,Dasa\2003
> ,Viâs\2002
>
> Problem 1: As you can see, I don't really want the first four 
> characters, I want the first four SEARCHABLE characters. How
> can I tell MARC Record to give me the first four characters, 
> excluding diacritics?

Assuming that you asking how to strip out the MARC-8 combining diacritic 
characters, try inserting the substitution commands listed (as shown below) 
just prior to the substr commands:

> my $ME = $field->subfield('a');
  $ME =~ s/[\xE1-\xFE]//g;
> my $four100 = substr( $ME, 0, 4 );

> my $TITLE = $field->subfield('a');
  $TITLE =~ s/[\xE1-\xFE]//g;
> my $four245 = substr( $TITLE, 0, 4 );

-- Michael

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 cell
# [EMAIL PROTECTED]
# http://rocky.uta.edu/doran/ 

> -Original Message-
> From: Jacobs, Jane W [mailto:[EMAIL PROTECTED] 
> Sent: Tuesday, January 11, 2005 12:30 PM
> To: perl4lib@perl.org
> Subject: Ignoring Diacritics accessing Fixed Field Data
> 
> Hi folks,
> 
> I'm trying to write a routine to construct a text file of 
> OCLC search key from a group of existing records.  What I 
> want is something like:
> 
> Brah,vasa/2003
> 
> That is 1st four letters of 100 + comma + 1st four letters of 
> 245 + slash + date.
> 
> In principle I have this working with:
> 
> 
> open( FOURS, ">4-4-date.txt" );
> 
> 
> while ( my $r = $batch->next() ) {
>   
> my @fields = $r->field( '100' );
> foreach my $field ( @fields ) {
> my $ME = $field->subfield('a');
> my $four100 = substr( $ME, 0, 4 );
>   
> print FOURS "$four100";
> } 
> 
> my @fields = $r->field( '245' );
> foreach my $field ( @fields ) {
> my $TITLE = $field->subfield('a');
> my $four245 = substr( $TITLE, 0, 4 );
> print FOURS ",$four245";
> } 
> 
> my @fields = $r->field( '260' );
> foreach my $field ( @fields ) {
> my $PD = $field->subfield('c');
> my $four260 = substr( $PD, 0, 4);
> print FOURS "\\$four260\n";
> } 
> 
> 
> My result was something like:
> 
> Dave,Ayod\2003
> Paòt,Kaâs\2002
> Baks,Dasa\2003
> ,Viâs\2002
> 
> Problem 1: As you can see, I don't really want the first four 
> characters, I want the first four SEARCHABLE characters.  How 
> can I tell MARC Record to give me the first four characters, 
> excluding diacritics?
> 
> Problem 2:  In these examples 260 $c works OK, but I could 
> get a cleaner result by accessing the date from the fixed 
> field (008 07-10).  How would I do that?  I was looking in 
> the tutorial, but couldn't seem to find anything that seemed 
> to help.  If I'm missing something there please point it up.
> 
>  Thanks in advance to anyone who can help.
> 
>  
> JJ
> 
> 
> 
> **Views expressed by the author do not necessarily represent 
> those of the Queens Library.**
> 
> Jane Jacobs
> Asst. Coord., Catalog Division
> Queens Borough Public Library
> 89-11 Merrick Blvd.
> Jamaica, NY 11432
> 
> tel.: (718) 990-0804
> e-mail: [EMAIL PROTECTED]
> FAX. (718) 990-8566 
> 
>

RE: Documentation_about_'Unix_for_librarians'

2005-01-09 Thread Doran, Michael D

Hi Carlos,

> I am writing you for the following: the next month I'll be giving a
> training course called "UNIX for librarians". ... Sadly there's no
> material available in Spanish about this topic.

I'm guessing that you won't find much material (in any language) on the topic 
of "UNIX for librarians".  I'm basing my assumption on the fact that using, or 
administering, UNIX is pretty much the same regardless of the type of 
institution.  Yes, libraries *do* have specialized ILS applications, but that 
is a layer separate from UNIX.  The analogy I would use is that of a library 
with a fleet of bookmobiles.  Although bookmobiles have a library-specific 
functionality, their diesel engines are the same as ones found in other buses 
or trucks and we wouldn't expect the maintenance mechanics to consult a "diesel 
mechanics for libraries" manual. 

So if you broaden your search for more general UNIX manuals and materials in 
Spanish, you are likely to have better luck.  A fair amount of vendor 
documentation on the web can be found in Spanish and there should be a fair 
selection of UNIX books in Spanish [1].  You can then take that material and 
gear your course towards the skill level and interests of your audience, using 
*examples* that are relevant to a particular ILS or library situation.

> For that reason I'd ask you if you know a website (or PDF files) where 
> I can get documentarion about training courses on "UNIX and librarians".

I have found that for many libraries, their ILS is the *only* application that 
runs on a UNIX platform and because of that, many systems librarians wind up 
becoming a de facto UNIX systems administrator without any previous experience 
in that realm.  If that describes your audience, you may want to take a look at 
a presentation I gave that did not cover UNIX itself, but some of the other 
critical things that a new UNIX system administrator needs to know [2].  
Another example is a presentation on shell scripting that focused on tasks 
relevant to a particular ILS [3].  

Good luck on your training course!

-- Michael

[1] Vendor docs example: Documentación de los productos Sun (on docs.sun.com)
http://docs.sun.com/app/docs?l=es
Book examples:
UNIX/LINUX : iniciación y referencia / Miguel Catalina Gallardo y Alfredo 
Catalina Gallego. 
Madrid : Osborne McGraw Hill, 1999.  
UNIX práctico / Grace Todino ; traducción, Raúl Bautista Gutiérrez
Prentice Hall, 1995

[2] Unix Sysadmin 101: What Newbies Need to Know, But Nobody Tells Them
http://rocky.uta.edu/doran/scvugm2001/

[3] Using Scripts to Automate Voyager Tasks
http://rocky.uta.edu/doran/vugm2000/

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 cell
# [EMAIL PROTECTED]
# http://rocky.uta.edu/doran/ 

> -Original Message-
> From: Carlos Vílchez Román [mailto:[EMAIL PROTECTED] 
> Sent: Saturday, January 08, 2005 8:51 AM
> To: [EMAIL PROTECTED]; perl4lib@perl.org
> Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED]
> Subject: Documentation_about_'Unix_for_librarians'
> 
> Hi everybody,
>  
> Sorry for the cross-posting
>  
> My name is Carlos Vilchez-Roman, head of the Automation Office 
> of the Universidad Nacional Mayor de San Marcos (UNMSM) Library 
> university, located in Lima, Peru.
>  
> I am writing you for the following: the next month I'll be giving a
> training course called "UNIX for librarians". I think this course will
> be very helpful because our ILS is running on a UNIX Tru64 box.
>  
> Sadly there's no material available in Spanish about this topic. I have
> only found a book titled 'UNIX and libraries' and an article published in
> Computers in Libraries, 16 (10), 34-36. In Amazon.com the book is 
> out of print (I only work with Amazon. They give me a guarantee).
>  
> For that reason I'd ask you if you know a website (or PDF files) where 
> I can get documentarion about training courses on "UNIX and librarians".
> 
> Any information would be appreciated.
>  
> Thanks for your time.
>  
> Cordially yours,
>  
> Carlos Vílchez-Román
> Head of Automation Office
> Library university - UNMSM
>

RE: MARC::Record and UTF-8

2005-01-07 Thread Doran, Michael D

> ...the ILS can be upgraded to a new version and  and
> people can start using Unicode, not only for Western
> European languages, but also for languages like Thai.

This is not really apropos to the discussion at hand, but since Thai was
mentioned I thought I would contribute my two cents on an issue that
perhaps not everyone is aware of...  

Although the ILS itself will be able to accommodate the full Unicode
repertoire, according to the MARC 21 specifications, the MARC 21
UCS/Unicode environment is simply the MARC-8 character repertoire
translated into the Unicode equivalent code points.  One of the things
that means is that characters in vernacular alphabets such as Thai are
*not* valid characters in MARC 21 records.  The rational behind this
approach to implementing Unicode is based on the ability to translate
MARC data back and forth (i.e. "round trip") between the MARC-8 and
Unicode character sets [1].  Supported alphabets (and/or ideographs) are
Latin, Greek, Cyrillic, Arabic, Hebrew, and East Asian (CJK) [2].

I think our ILS is fairly typical as to implementation of Unicode [3].
There is nothing stopping you from creating, storing, and displaying
MARC records in Thai (or any other vernacular language) -- other than an
institutional decision to adhere to the MARC 21 standard.  Of course,
the ILS software clients also have validation rules that can be turned
on (or off, since not everyone uses MARC 21).

At some point, when a large enough portion of the library world has
upgraded their systems to MARC Unicode, round tripping will no longer be
a constraint and the MARC 21 standard will be revised to include the
full range of Unicode characters, but that is liable to be awhile.

[1] Coded Character Sets > A Technical Primer for Librarians > MARC
Unicode 
http://rocky.uta.edu/doran/charsets/unicode.html

[2] An exception is the Unified Canadian Aboriginal Syllabic character
set, which is not defined in MARC-8 but is permitted in the MARC
UCS/Unicode environment. 

[3] Endeavor's Voyager - and we are scheduled for the Unicode version
upgrade on Monday

-- Michael

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 cell
# [EMAIL PROTECTED]
# http://rocky.uta.edu/doran/

RE: inserting diacrtics

2005-01-05 Thread Doran, Michael D

> it might be a little clearer if you call a grave a grave...

Of course.  Thanks Alan.  This is apparently not my day for clarity. 

-- Michael

> -Original Message-
> From: Manifold, Alan B. [mailto:[EMAIL PROTECTED] 
> Sent: Wednesday, January 05, 2005 1:46 PM
> To: perl4lib@perl.org
> Subject: RE: inserting diacrtics
> 
> It's compact, but it might be a little clearer if you call a
> grave a grave...
> 
> $grave = chr(0xE1);
> $field = MARC::Field->new( '710', '2', '', 
>   a => 'Biblioth'.$grave.'eque nationale de france.' );
> 
> Alan Manifold
> Systems Implementation Manager
> Purdue University Libraries ITD
> 504 West State Street
> West Lafayette, Indiana  47907-2058
> (765) 494-2884   FAX:  494-0156
> [EMAIL PROTECTED]
> http://www.mashiyyat.net/ABM.html 
> 
> > -Original Message-
> > From: Ed Summers [mailto:[EMAIL PROTECTED] 
> > Sent: Wednesday, January 05, 2005 2:42 PM
> > To: perl4lib@perl.org
> > Subject: Re: inserting diacrtics
> > 
> > On Wed, Jan 05, 2005 at 01:22:54PM -0600, Doran, Michael D wrote:
> > >$acute = chr(0xE1);
> > >$field = MARC::Field->new( '710', '2', '', 
> > >   a => 'Biblioth'.$acute.'eque nationale de france.' );
> > 
> > Much more compact, thanks Michael.
> > 
> > //Ed
> > 
>

RE: inserting diacrtics

2005-01-05 Thread Doran, Michael D

One more time (in an attempt to better clarify my own posting --
sorry)...

> I think all Jackie needs to do is add the combining grave character.

The combining grave, in G1, would be hex 'E1', so (theoretically) this
should work:

   $acute = chr(0xE1);
   $field = MARC::Field->new( '710', '2', '', 
a => 'Biblioth'.$acute.'eque nationale de france.' );

-- Michael

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 cell
# [EMAIL PROTECTED]
# http://rocky.uta.edu/doran/ 

> -Original Message-
> From: Doran, Michael D [mailto:[EMAIL PROTECTED] 
> Sent: Wednesday, January 05, 2005 1:04 PM
> To: perl4lib@perl.org
> Subject: RE: inserting diacrtics
> 
> Oops, that's the second time today I've inadvertantly sent an email
> message (some keystroke combination from vi I think).
> 
> The reference was:
> 
> [1] The exception to this is if you had previously escaped to an
> alternate character set (such as Arablic or Greek) and 
> desired to return
> to Extended Latin as either GO or G1.
> 
> -- Michael
> 
> # Michael Doran, Systems Librarian
> # University of Texas at Arlington
> # 817-272-5326 office
> # 817-688-1926 cell
> # [EMAIL PROTECTED]
> # http://rocky.uta.edu/doran/ 
> 
> > -Original Message-
> > From: Doran, Michael D 
> > Sent: Wednesday, January 05, 2005 1:03 PM
> > To: perl4lib@perl.org
> > Subject: RE: inserting diacrtics
> > 
> > > You need to escape to ExtendedLatin, add the combining 
> > acute, escape back to 
> > > BasicLatin, and then put the 'e'. Or in code:
> > 
> > Extended Latin (as G1) is part of the MARC-8 default 
> > character set and shouldn't require any escape sequences [1]. 
> >  I think all Jackie needs to do is add the combining grave 
> character.
> > 
> > -- Michael
> > 
> > # Michael Doran, Systems Librarian
> > # University of Texas at Arlington
> > # 817-272-5326 office
> > # 817-688-1926 cell
> > # [EMAIL PROTECTED]
> > # http://rocky.uta.edu/doran/ 
> > 
> > > -Original Message-
> > > From: Ed Summers [mailto:[EMAIL PROTECTED] 
> > > Sent: Wednesday, January 05, 2005 11:30 AM
> > > To: perl4lib@perl.org
> > > Subject: Re: inserting diacrtics
> > > 
> > > On Tue, Jan 04, 2005 at 02:20:55PM -0500, Jackie Shieh wrote:
> > > > MARC::Field->new('710','2','', a=>'Bibliotheque nationale 
> > > de france.')
> > > >^
> > > 
> > > I'm assuming that you want a combining acute on the e, and 
> > that you want to 
> > > encode with MARC-8 since UTF-8 in MARC data hasn't hit the 
> > mainstream yet...
> > > eventhough I've heard OCLC is converting all their MARC 
> > data to UTF-8.
> > > 
> > > This is kind of a pain, but here's how you could do it. 
> You need to
> > > escape to ExtendedLatin, add the combining acute, escape back to 
> > > BasicLatin, and then put the 'e'. Or in code:
> > > 
> > > # building blocks for escaping G0 to ExtendedLatin and
> > > # back to BasicLatin, details at: 
> > > # http://www.loc.gov/marc/specifications/speccharmarc8.html
> > > $escapeToExtendedLatin = 
> > chr(0x1B).chr(0x28).chr(0x21).chr(0x45);
> > > $escapeToBasicLatin = chr(0x1B).chr(0x28).chr(0x52); 
> > > 
> > > # acute in the G0 register is chr(0x62) from the table at:
> > > # http://lcweb2.loc.gov/cocoon/codetables/45.html
> > > $acute = $escapeToExtendedLatin.chr(0x62).$escapeToBasicLatin;
> > > 
> > > # now make the field
> > > $field = MARC::Field->new( '710', '2', '', 
> > > a => 'Biblioth'.$acute.'eque nationale de france.' );
> > > 
> > > This is long because I wanted to explain what was going 
> > on...I imagine
> > > it could be compressed nicely...maybe
> > > 
> > > Please give this a try on one record and make sure your 
> > > catalog displays
> > > it properly before doing anything drastic to your data. 
> > Like I needed
> > > to mention that :-)
> > > 
> > > //Ed
> > > 
> > 
>

RE: inserting diacrtics

2005-01-05 Thread Doran, Michael D

Oops, that's the second time today I've inadvertantly sent an email
message (some keystroke combination from vi I think).

The reference was:

[1] The exception to this is if you had previously escaped to an
alternate character set (such as Arablic or Greek) and desired to return
to Extended Latin as either GO or G1.

-- Michael

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 cell
# [EMAIL PROTECTED]
# http://rocky.uta.edu/doran/ 

> -Original Message-
> From: Doran, Michael D 
> Sent: Wednesday, January 05, 2005 1:03 PM
> To: perl4lib@perl.org
> Subject: RE: inserting diacrtics
> 
> > You need to escape to ExtendedLatin, add the combining 
> acute, escape back to 
> > BasicLatin, and then put the 'e'. Or in code:
> 
> Extended Latin (as G1) is part of the MARC-8 default 
> character set and shouldn't require any escape sequences [1]. 
>  I think all Jackie needs to do is add the combining grave character.
> 
> -- Michael
> 
> # Michael Doran, Systems Librarian
> # University of Texas at Arlington
> # 817-272-5326 office
> # 817-688-1926 cell
> # [EMAIL PROTECTED]
> # http://rocky.uta.edu/doran/ 
> 
> > -Original Message-
> > From: Ed Summers [mailto:[EMAIL PROTECTED] 
> > Sent: Wednesday, January 05, 2005 11:30 AM
> > To: perl4lib@perl.org
> > Subject: Re: inserting diacrtics
> > 
> > On Tue, Jan 04, 2005 at 02:20:55PM -0500, Jackie Shieh wrote:
> > > MARC::Field->new('710','2','', a=>'Bibliotheque nationale 
> > de france.')
> > >^
> > 
> > I'm assuming that you want a combining acute on the e, and 
> that you want to 
> > encode with MARC-8 since UTF-8 in MARC data hasn't hit the 
> mainstream yet...
> > eventhough I've heard OCLC is converting all their MARC 
> data to UTF-8.
> > 
> > This is kind of a pain, but here's how you could do it. You need to
> > escape to ExtendedLatin, add the combining acute, escape back to 
> > BasicLatin, and then put the 'e'. Or in code:
> > 
> > # building blocks for escaping G0 to ExtendedLatin and
> > # back to BasicLatin, details at: 
> > # http://www.loc.gov/marc/specifications/speccharmarc8.html
> > $escapeToExtendedLatin = 
> chr(0x1B).chr(0x28).chr(0x21).chr(0x45);
> > $escapeToBasicLatin = chr(0x1B).chr(0x28).chr(0x52); 
> > 
> > # acute in the G0 register is chr(0x62) from the table at:
> > # http://lcweb2.loc.gov/cocoon/codetables/45.html
> > $acute = $escapeToExtendedLatin.chr(0x62).$escapeToBasicLatin;
> > 
> > # now make the field
> > $field = MARC::Field->new( '710', '2', '', 
> > a => 'Biblioth'.$acute.'eque nationale de france.' );
> > 
> > This is long because I wanted to explain what was going 
> on...I imagine
> > it could be compressed nicely...maybe
> > 
> > Please give this a try on one record and make sure your 
> > catalog displays
> > it properly before doing anything drastic to your data. 
> Like I needed
> > to mention that :-)
> > 
> > //Ed
> > 
>

RE: inserting diacrtics

2005-01-05 Thread Doran, Michael D

> You need to escape to ExtendedLatin, add the combining acute, escape
back to 
> BasicLatin, and then put the 'e'. Or in code:

Extended Latin (as G1) is part of the MARC-8 default character set and
shouldn't require any escape sequences [1].  I think all Jackie needs to
do is add the combining grave character.

-- Michael

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 cell
# [EMAIL PROTECTED]
# http://rocky.uta.edu/doran/ 

> -Original Message-
> From: Ed Summers [mailto:[EMAIL PROTECTED] 
> Sent: Wednesday, January 05, 2005 11:30 AM
> To: perl4lib@perl.org
> Subject: Re: inserting diacrtics
> 
> On Tue, Jan 04, 2005 at 02:20:55PM -0500, Jackie Shieh wrote:
> > MARC::Field->new('710','2','', a=>'Bibliotheque nationale 
> de france.')
> >^
> 
> I'm assuming that you want a combining acute on the e, and that you
want to 
> encode with MARC-8 since UTF-8 in MARC data hasn't hit the mainstream
yet...
> eventhough I've heard OCLC is converting all their MARC data to UTF-8.
> 
> This is kind of a pain, but here's how you could do it. You need to
> escape to ExtendedLatin, add the combining acute, escape back to 
> BasicLatin, and then put the 'e'. Or in code:
> 
> # building blocks for escaping G0 to ExtendedLatin and
> # back to BasicLatin, details at: 
> # http://www.loc.gov/marc/specifications/speccharmarc8.html
> $escapeToExtendedLatin = chr(0x1B).chr(0x28).chr(0x21).chr(0x45);
> $escapeToBasicLatin = chr(0x1B).chr(0x28).chr(0x52); 
> 
> # acute in the G0 register is chr(0x62) from the table at:
> # http://lcweb2.loc.gov/cocoon/codetables/45.html
> $acute = $escapeToExtendedLatin.chr(0x62).$escapeToBasicLatin;
> 
> # now make the field
> $field = MARC::Field->new( '710', '2', '', 
> a => 'Biblioth'.$acute.'eque nationale de france.' );
> 
> This is long because I wanted to explain what was going on...I imagine
> it could be compressed nicely...maybe
> 
> Please give this a try on one record and make sure your 
> catalog displays
> it properly before doing anything drastic to your data. Like I needed
> to mention that :-)
> 
> //Ed
>

RE: Character sets - kind of solved?

2004-12-06 Thread Doran, Michael D

> One (perhaps large) caveat: as of now all USMARC records are assumed
> to be MARC-8 encoded, and the data within is always run through
> to_utf8/to_marc8 during XML export/import.

The MARC-21 standard allows for either MARC-8 or UCS/Unicode.  Position
09 in the record leader indicates the character encoding: a "blank" for
MARC-8, and an "a" for UCS/Unicode.  Perhaps your patch could test for
this and then only apply the transformation when required.  Note: I
believe the leader itself is limited to characters in the ASCII range,
so you wouldn't have to know the encoding of the record prior to parsing
the leader.

> What that means is that
> the records from the problem below (containing UTF8 directly in the
> data, without an encoding marker) would probably break during export
> to XML.

The original record from John Hammer did not contain UTF-8, it contained
MARC-8.  I believe that the fact that the combining MARC-8 characters
were replaced by a generic replacement character only indicates that the
app he was using to view the data (post processing by MARC::Record) was
using a character set in which hex E5 and F2, encoded as single octets,
were not valid characters in that app's character set.  That app's
character set was apparently Unicode (UTF-8) and so E5 and F2 were
replaced by U+FFFD.  That's the long way of saying that the patch should
work fine in his case.  :-)

-- Michael

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 cell
# [EMAIL PROTECTED]
# http://rocky.uta.edu/doran/ 

> -Original Message-
> From: Mike Rylander [mailto:[EMAIL PROTECTED] 
> Sent: Saturday, December 04, 2004 1:31 PM
> To: [EMAIL PROTECTED]
> Subject: Re: Character sets - kind of solved?
> 
> I've run into some record encoding issues myself, though not the
> problem from below.  In any case, this got me thinking about the
> current state of MARC::File::XML, specifically that it could not
> handle MARC8 encoded records.
> 
> I submitted a patch a while back to hack around this, but that just
> lets us get the MARC records into well formed XML.  Basically, it just
> lets you set the encoding on the XML to something that has embedded
> 8-bit characters, like ISO-8859-1, aka LATIN1.
> 
> But that is far from optimal, since the data is being misinterpreted. 
> So I took a look at using MARC::Charset inside MARC::File::XML, and
> I've got a working patch that correctly transcodes records from
> USMARC(MARC-8) to MARC21slim(UTF8) and back again.
> 
> It's attached below, if anyone would be so kind as to test it.  If all
> goes well we sould be able to actually use MARC::File::XML in
> production.  If you do decide to test it, it requires MARC::Charset.
> 
> One (perhaps large) caveat: as of now all USMARC records are assumed
> to be MARC-8 encoded, and the data within is always run through
> to_utf8/to_marc8 during XML export/import.  What that means is that
> the records from the problem below (containing UTF8 directly in the
> data, without an encoding marker) would probably break during export
> to XML.
> 
> The attached tarball contains a patched XML.pm and SAX.pm.  Replace
> your current MARC/File/XML.pm and MARC/File/SAX.pm with those and you
> should be good to go.  I've also included the scripts I used to test
> and one of my old MARC8 encoded records.  http://redlightgreen.com
> confirms that the illustrators name is properly transcoded.
> 
> On Fri, 3 Dec 2004 17:53:32 -0600, Doran, Michael D 
> <[EMAIL PROTECTED]> wrote:
> > First off, Ashley's suggestion that the original encoding was likely
> > MARC-8 is correct.  The author's Arabic name, 
> transliterated into the
> > Latin alphabet, should be "Bis{latin small letter a with 
> macron}{latin
> > small letter t with dot below}{latin small letter i with macron},
> > Mu{latin small letter h with dot below}ammad."  I am basing this on
> > MARC-21 records that can be seen in UCLA's online catalog 
> [1].  So, if
> > the above name is encoded in MARC-8 then the underlying 
> code would match
> > John's original code points [2]:
> >  > >> Looking at the name with a hex editor, it gives, with 
> hex values
> > in curly brackets,
> >  > >> "Bis{e5}a{f2}t{e5}i, Mu{f2}hammad."
> > 
> > Then the question becomes: "What happened?"
> > 
> >  > >> the name now appears as
> >  > >> "Bis{ef bf bd}a{ef bf bd}t{ef bf bd}i, Mu{ef bf bd}hammad."
> > 
> > The fact that one byte turned into three bytes, suggests 
> UTF-8 encoding.
> > And the fact that *both* MARC-8 combining characters (i.e. &

RE: Character sets - kind of solved?

2004-12-03 Thread Doran, Michael D

First off, Ashley's suggestion that the original encoding was likely
MARC-8 is correct.  The author's Arabic name, transliterated into the
Latin alphabet, should be "Bis{latin small letter a with macron}{latin
small letter t with dot below}{latin small letter i with macron},
Mu{latin small letter h with dot below}ammad."  I am basing this on
MARC-21 records that can be seen in UCLA's online catalog [1].  So, if
the above name is encoded in MARC-8 then the underlying code would match
John's original code points [2]:
 > >> Looking at the name with a hex editor, it gives, with hex values
in curly brackets,
 > >> "Bis{e5}a{f2}t{e5}i, Mu{f2}hammad."

Then the question becomes: "What happened?" 

 > >> the name now appears as
 > >> "Bis{ef bf bd}a{ef bf bd}t{ef bf bd}i, Mu{ef bf bd}hammad."

The fact that one byte turned into three bytes, suggests UTF-8 encoding.
And the fact that *both* MARC-8 combining characters (i.e. "e5" and
"f2") now appear as the *same* combination of characters (i.e. "ef bf
bd") suggests that it was not an encoding translation from one coded
character set to the equivalent codepoint in another character set.  If
we assume UTF-8 and convert UTF-8 "ef bf bd" to its Unicode code point,
we get U+FFFD [3].  If we look up U+FFFD we see that it is the
"REPLACEMENT CHARACTER" [4].  

Since MARC::Record (obviously) would't object to the original MARC-8
character encoding, I'm guessing that sometime *after* processing the
record with MARC::Record that it was either moved to, or viewed in, a
client/platform/environment that was not MARC-8 savvy (which is pretty
much everything) and that the client/platform/environment, not
recognizing the hex e5 and f2 as valid character encodings, replaced
them with the generic replacement character for that
client/platform/environment.

So I'm thinking that we can rule out MARC::Record and look closer at
what happened to the data subsequent to MARC::Record processing.  That's
my guess anyway, and I'm sticking with it until I hear a better story.
;-)

[1] UCLA's Voyager ILMS has been upgraded to a Unicode version, and is
able to display the characters accurately.  My assumption is that the
author in the links below is the one in question.
See for example (looking at the title field, rather than the underlined
author/name field):
 http://catalog.library.ucla.edu/cgi-bin/Pwebrecon.cgi?bbid=603048
 http://catalog.library.ucla.edu/cgi-bin/Pwebrecon.cgi?bbid=603049
 http://catalog.library.ucla.edu/cgi-bin/Pwebrecon.cgi?bbid=5053287
 http://catalog.library.ucla.edu/cgi-bin/Pwebrecon.cgi?bbid=4490052

[2] In MARC-8, combining diacritic characters precede the base
character, and as Ashley pointed out, E5 is "macron" and F2 is "dot
below."

[3] hex "ef bf bd" = binary "1110 1011 1001"  
A three-octet UTF-8 character has the format of 1110 10xx
10xx, with the "x" positions being the significant values in
determining the Unicode code point.  When we concatenate those x
position values from the above binary code, we get 1101,
which converted to hex, is FFFD

[4] See:
http://rocky.uta.edu/doran/urdu/search.cgi?char_set=unicode&char_type=he
x&char_value=fffd
(or just go to http://rocky.uta.edu/doran/urdu/search.cgi and plug
in fffd

-- Michael

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 cell
# [EMAIL PROTECTED]
# http://rocky.uta.edu/doran/ 

> -Original Message-
> From: Ashley Sanders [mailto:[EMAIL PROTECTED] 
> Sent: Wednesday, November 24, 2004 2:23 AM
> Cc: [EMAIL PROTECTED]
> Subject: Re: Character sets
> 
> Ed Summers wrote:
> > On Tue, Nov 23, 2004 at 04:10:05PM -0600, John Hammer wrote:
> > 
> >>I have a character problem that I hope someone can help me with. In
> >>a MARC record I am modifying using MARC::Record, one of the names
> >>contains letters with diacritics. Looking at the name with a hex
editor,
> >>it gives, with hex values in curly brackets,"Bis{e5}a{f2}t{e5}i,
> >>Mu{f2}hammad." After running through MARC::Record, the name now
appears
> >>as "Bis{ef bf bd}a{ef bf bd}t{ef bf bd}i, Mu{ef bf bd}hammad."
> > 
> > 
> > That's pretty odd. Any chance you could send me the MARC record? At
this
> > time MARC::Record does not play nicely with Unicode (UTF8). 
> > 
> > http://rt.cpan.org/NoAuth/Bug.html?id=3707
> 
> It is possible they are MARC-8 characters rather than utf-8. In MARC-8
> E5 is "macron" and F2 is "dot below." Is MARC::Record trying to treat
> than as Unicode when in fact they are MARC-8?
> 
> Ashley.
> 
> -- 
> Ashley Sanders [EMAIL PROTECTED]
> Copac http://copac.ac.uk -- A MIMAS service funded by JISC
>

62 matches

Mail list logo