Ignoring Diacritics accessing Fixed Field Data

2005-01-11 Thread Jacobs, Jane W
Hi folks,

I'm trying to write a routine to construct a text file of OCLC search key from 
a group of existing records.  What I want is something like:

Brah,vasa/2003

That is 1st four letters of 100 + comma + 1st four letters of 245 + slash + 
date.

In principle I have this working with:


open( FOURS, ">4-4-date.txt" );


while ( my $r = $batch->next() ) {
  
my @fields = $r->field( '100' );
foreach my $field ( @fields ) {
my $ME = $field->subfield('a');
my $four100 = substr( $ME, 0, 4 );
  
print FOURS "$four100";
} 

my @fields = $r->field( '245' );
foreach my $field ( @fields ) {
my $TITLE = $field->subfield('a');
my $four245 = substr( $TITLE, 0, 4 );
print FOURS ",$four245";
} 

my @fields = $r->field( '260' );
foreach my $field ( @fields ) {
my $PD = $field->subfield('c');
my $four260 = substr( $PD, 0, 4);
print FOURS "\\$four260\n";
} 


My result was something like:

Dave,Ayod\2003
Paòt,Kaâs\2002
Baks,Dasa\2003
,Viâs\2002

Problem 1: As you can see, I don't really want the first four characters, I 
want the first four SEARCHABLE characters.  How can I tell MARC Record to give 
me the first four characters, excluding diacritics?

Problem 2:  In these examples 260 $c works OK, but I could get a cleaner result 
by accessing the date from the fixed field (008 07-10).  How would I do that?  
I was looking in the tutorial, but couldn't seem to find anything that seemed 
to help.  If I'm missing something there please point it up.

 Thanks in advance to anyone who can help.

 
JJ



**Views expressed by the author do not necessarily represent those of the 
Queens Library.**

Jane Jacobs
Asst. Coord., Catalog Division
Queens Borough Public Library
89-11 Merrick Blvd.
Jamaica, NY 11432

tel.: (718) 990-0804
e-mail: [EMAIL PROTECTED]
FAX. (718) 990-8566 



Re: Ignoring Diacritics accessing Fixed Field Data

2005-01-11 Thread Ed Summers
Hi Jane:

On Tue, Jan 11, 2005 at 01:29:55PM -0500, Jacobs, Jane W wrote:
> My result was something like:
> 
> Dave,Ayod\2003
> Paòt,Kaâs\2002
> Baks,Dasa\2003
> ,Viâs\2002
> 
> Problem 1: As you can see, I don't really want the first four characters, I 
> want the first four SEARCHABLE characters.  How can I tell MARC Record to 
> give me the first four characters, excluding diacritics?

What output would you have rather seen?

Dave,Ayod\2003
Paot, Kaas\2002
Baks,Dasa\2003
,Vias\2002

?

> Problem 2:  In these examples 260 $c works OK, but I could get a cleaner 
> result by accessing the date from the fixed field (008 07-10).  How would I 
> do that?  I was looking in the tutorial, but couldn't seem to find anything 
> that seemed to help.  If I'm missing something there please point it up.

You probably want to use the data() method on the MARC::Field object for
the '008' field, in combination with substr() to extract a substring
based on an offset and a length.

my $f008 = $record->field('008');
if ( $f008 ) { $year = substr( $f008->data(), 7, 4 ); }

I only added the if statement since it may not be true that all your
records have an 008 field...

//Ed


RE: Ignoring Diacritics accessing Fixed Field Data

2005-01-11 Thread Jacobs, Jane W
> Problem 1: As you can see, I don't really want the first four 
> characters, I want the first four SEARCHABLE characters.  How can I 
> tell MARC Record to give me the first four characters, excluding 
> diacritics?

What output would you have rather seen?

Dave,Ayod\2003
Paot, Kaas\2002
Baks,Dasa\2003
,Vias\2002

?

I changed out the order to put the problem children at the bottom. Thus the 
correct output would be:

Baks,Dasa\2003
Dave,Ayod\2003
Pata,Kasm\2002   * actual text is: 100 PatÌanÌiÌ, RaÌjana. 245 
KasÌmakasÌa 
  ** Raw MARC reads:
100 PaÃtaÃnÃi, RÃajana. 245 
KaÃsmakaÃsa 
,Vias\2002* actual text is: 245 VisÌvaprasiddha vaÌrtaÌo
  ** Raw MARC reads: 245 ViÃsvaprasiddha 
vÃartÃao

>You probably want to use the data() method on the MARC::Field object for >the 
>'008' field, in combination with substr() to extract a substring based on an 
>offset and a length.

Worked brilliantly; Thanks!

JJ

**Views expressed by the author do not necessarily represent those of the 
Queens Library.**

Jane Jacobs
Asst. Coord., Catalog Division
Queens Borough Public Library
89-11 Merrick Blvd.
Jamaica, NY 11432

tel.: (718) 990-0804
e-mail: [EMAIL PROTECTED]
FAX. (718) 990-8566



RE: Ignoring Diacritics accessing Fixed Field Data

2005-01-11 Thread Doran, Michael D
Hi Jane,

These answers assume that the data you are processing:
1) is encoded in the MARC-8 character set, and
2) consists of the MARC-8 default basic and extended Latin characters.

> Dave,Ayod\2003
> Paòt,Kaâs\2002
> Baks,Dasa\2003
> ,Viâs\2002
>
> Problem 1: As you can see, I don't really want the first four 
> characters, I want the first four SEARCHABLE characters. How
> can I tell MARC Record to give me the first four characters, 
> excluding diacritics?

Assuming that you asking how to strip out the MARC-8 combining diacritic 
characters, try inserting the substitution commands listed (as shown below) 
just prior to the substr commands:

> my $ME = $field->subfield('a');
  $ME =~ s/[\xE1-\xFE]//g;
> my $four100 = substr( $ME, 0, 4 );

> my $TITLE = $field->subfield('a');
  $TITLE =~ s/[\xE1-\xFE]//g;
> my $four245 = substr( $TITLE, 0, 4 );

-- Michael

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 cell
# [EMAIL PROTECTED]
# http://rocky.uta.edu/doran/ 

> -Original Message-
> From: Jacobs, Jane W [mailto:[EMAIL PROTECTED] 
> Sent: Tuesday, January 11, 2005 12:30 PM
> To: perl4lib@perl.org
> Subject: Ignoring Diacritics accessing Fixed Field Data
> 
> Hi folks,
> 
> I'm trying to write a routine to construct a text file of 
> OCLC search key from a group of existing records.  What I 
> want is something like:
> 
> Brah,vasa/2003
> 
> That is 1st four letters of 100 + comma + 1st four letters of 
> 245 + slash + date.
> 
> In principle I have this working with:
> 
> 
> open( FOURS, ">4-4-date.txt" );
> 
> 
> while ( my $r = $batch->next() ) {
>   
> my @fields = $r->field( '100' );
> foreach my $field ( @fields ) {
> my $ME = $field->subfield('a');
> my $four100 = substr( $ME, 0, 4 );
>   
> print FOURS "$four100";
> } 
> 
> my @fields = $r->field( '245' );
> foreach my $field ( @fields ) {
> my $TITLE = $field->subfield('a');
> my $four245 = substr( $TITLE, 0, 4 );
> print FOURS ",$four245";
> } 
> 
> my @fields = $r->field( '260' );
> foreach my $field ( @fields ) {
> my $PD = $field->subfield('c');
> my $four260 = substr( $PD, 0, 4);
> print FOURS "\\$four260\n";
> } 
> 
> 
> My result was something like:
> 
> Dave,Ayod\2003
> Paòt,Kaâs\2002
> Baks,Dasa\2003
> ,Viâs\2002
> 
> Problem 1: As you can see, I don't really want the first four 
> characters, I want the first four SEARCHABLE characters.  How 
> can I tell MARC Record to give me the first four characters, 
> excluding diacritics?
> 
> Problem 2:  In these examples 260 $c works OK, but I could 
> get a cleaner result by accessing the date from the fixed 
> field (008 07-10).  How would I do that?  I was looking in 
> the tutorial, but couldn't seem to find anything that seemed 
> to help.  If I'm missing something there please point it up.
> 
>  Thanks in advance to anyone who can help.
> 
>  
> JJ
> 
> 
> 
> **Views expressed by the author do not necessarily represent 
> those of the Queens Library.**
> 
> Jane Jacobs
> Asst. Coord., Catalog Division
> Queens Borough Public Library
> 89-11 Merrick Blvd.
> Jamaica, NY 11432
> 
> tel.: (718) 990-0804
> e-mail: [EMAIL PROTECTED]
> FAX. (718) 990-8566 
> 
> 


RE: Ignoring Diacritics accessing Fixed Field Data

2005-01-11 Thread Bryan Baldus
On Tuesday, January 11, 2005 2:13 PM, Michael Doran wrote:

>Assuming that you asking how to strip out the MARC-8 combining diacritic
characters, try inserting the substitution >commands listed (as shown below)
just prior to the substr commands:
>> my $ME = $field->subfield('a');
>  $ME =~ s/[\xE1-\xFE]//g;
>> my $four100 = substr( $ME, 0, 4 );
>
>> my $TITLE = $field->subfield('a');
>  $TITLE =~ s/[\xE1-\xFE]//g;
>> my $four245 = substr( $TITLE, 0, 4 );


You might want to change the procedure for getting the title to skip
articles (untested, may need corrections):

#given $record being the MARC::Record object, and exactly 1 245 field being
present, as required by MARC21 rules
my $titleind2 = $record->$field('245')->indicator(2);
my $TITLE = $field->subfield('a');
$TITLE =~ s/[\xE1-\xFE]//g;
my $four245 = substr( $TITLE, 0+$titleind2, 4 ) if $titleind2 =~/^[0-9]$/;
#the if statement should be unnecessary, since 245 2nd indicator should
always be some number, but just in case.

Hope this helps,

Bryan Baldus
[EMAIL PROTECTED]
[EMAIL PROTECTED]
http://home.inwave.com/eija
 


RE: Ignoring Diacritics accessing Fixed Field Data

2005-01-11 Thread Doran, Michael D
A bulletin from the "haste makes waste" department...

>   $ME =~ s/[\xE1-\xFE]//g;
>   $TITLE =~ s/[\xE1-\xFE]//g;

Ooops, that should be "E0" instead of "E1" as the first hex value in the 
substitutions:
   $ME =~ s/[\xE0-\xFE]//g;
   $TITLE =~ s/[\xE0-\xFE]//g;

Sorry,

-- Michael

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 cell
# [EMAIL PROTECTED]
# http://rocky.uta.edu/doran/ 

> -Original Message-
> From: Doran, Michael D 
> Sent: Tuesday, January 11, 2005 2:13 PM
> To: perl4lib@perl.org
> Subject: RE: Ignoring Diacritics accessing Fixed Field Data
> 
> Hi Jane,
> 
> These answers assume that the data you are processing:
> 1) is encoded in the MARC-8 character set, and
> 2) consists of the MARC-8 default basic and extended Latin characters.
> 
> > Dave,Ayod\2003
> > Paòt,Kaâs\2002
> > Baks,Dasa\2003
> > ,Viâs\2002
> >
> > Problem 1: As you can see, I don't really want the first four 
> > characters, I want the first four SEARCHABLE characters. How
> > can I tell MARC Record to give me the first four characters, 
> > excluding diacritics?
> 
> Assuming that you asking how to strip out the MARC-8 
> combining diacritic characters, try inserting the 
> substitution commands listed (as shown below) just prior to 
> the substr commands:
> 
> > my $ME = $field->subfield('a');
>   $ME =~ s/[\xE1-\xFE]//g;
> > my $four100 = substr( $ME, 0, 4 );
> 
> > my $TITLE = $field->subfield('a');
>   $TITLE =~ s/[\xE1-\xFE]//g;
> > my $four245 = substr( $TITLE, 0, 4 );
> 
> -- Michael
> 
> # Michael Doran, Systems Librarian
> # University of Texas at Arlington
> # 817-272-5326 office
> # 817-688-1926 cell
> # [EMAIL PROTECTED]
> # http://rocky.uta.edu/doran/ 
> 
> > -Original Message-
> > From: Jacobs, Jane W [mailto:[EMAIL PROTECTED] 
> > Sent: Tuesday, January 11, 2005 12:30 PM
> > To: perl4lib@perl.org
> > Subject: Ignoring Diacritics accessing Fixed Field Data
> > 
> > Hi folks,
> > 
> > I'm trying to write a routine to construct a text file of 
> > OCLC search key from a group of existing records.  What I 
> > want is something like:
> > 
> > Brah,vasa/2003
> > 
> > That is 1st four letters of 100 + comma + 1st four letters of 
> > 245 + slash + date.
> > 
> > In principle I have this working with:
> > 
> > 
> > open( FOURS, ">4-4-date.txt" );
> > 
> > 
> > while ( my $r = $batch->next() ) {
> >   
> > my @fields = $r->field( '100' );
> > foreach my $field ( @fields ) {
> > my $ME = $field->subfield('a');
> > my $four100 = substr( $ME, 0, 4 );
> >   
> > print FOURS "$four100";
> > } 
> > 
> > my @fields = $r->field( '245' );
> > foreach my $field ( @fields ) {
> > my $TITLE = $field->subfield('a');
> > my $four245 = substr( $TITLE, 0, 4 );
> > print FOURS ",$four245";
> > } 
> > 
> > my @fields = $r->field( '260' );
> > foreach my $field ( @fields ) {
> > my $PD = $field->subfield('c');
> > my $four260 = substr( $PD, 0, 4);
> > print FOURS "\\$four260\n";
> > } 
> > 
> > 
> > My result was something like:
> > 
> > Dave,Ayod\2003
> > Paòt,Kaâs\2002
> > Baks,Dasa\2003
> > ,Viâs\2002
> > 
> > Problem 1: As you can see, I don't really want the first four 
> > characters, I want the first four SEARCHABLE characters.  How 
> > can I tell MARC Record to give me the first four characters, 
> > excluding diacritics?
> > 
> > Problem 2:  In these examples 260 $c works OK, but I could 
> > get a cleaner result by accessing the date from the fixed 
> > field (008 07-10).  How would I do that?  I was looking in 
> > the tutorial, but couldn't seem to find anything that seemed 
> > to help.  If I'm missing something there please point it up.
> > 
> >  Thanks in advance to anyone who can help.
> > 
> >  
> > JJ
> > 
> > 
> > 
> > **Views expressed by the author do not necessarily represent 
> > those of the Queens Library.**
> > 
> > Jane Jacobs
> > Asst. Coord., Catalog Division
> > Queens Borough Public Library
> > 89-11 Merrick Blvd.
> > Jamaica, NY 11432
> > 
> > tel.: (718) 990-0804
> > e-mail: [EMAIL PROTECTED]
> > FAX. (718) 990-8566 
> > 
> > 
> 


RE: Ignoring Diacritics accessing Fixed Field Data

2005-01-11 Thread Jacobs, Jane W
That worked well!
Thanks!
JJ

**Views expressed by the author do not necessarily represent those of the 
Queens Library.**

Jane Jacobs
Asst. Coord., Catalog Division
Queens Borough Public Library
89-11 Merrick Blvd.
Jamaica, NY 11432

tel.: (718) 990-0804
e-mail: [EMAIL PROTECTED]
FAX. (718) 990-8566



-Original Message-
From: Doran, Michael D [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, January 11, 2005 3:13 PM
To: perl4lib@perl.org
Subject: RE: Ignoring Diacritics accessing Fixed Field Data


Hi Jane,

These answers assume that the data you are processing:
1) is encoded in the MARC-8 character set, and
2) consists of the MARC-8 default basic and extended Latin characters.

> Dave,Ayod\2003
> Paòt,Kaâs\2002
> Baks,Dasa\2003
> ,Viâs\2002
>
> Problem 1: As you can see, I don't really want the first four
> characters, I want the first four SEARCHABLE characters. How
> can I tell MARC Record to give me the first four characters, 
> excluding diacritics?

Assuming that you asking how to strip out the MARC-8 combining diacritic 
characters, try inserting the substitution commands listed (as shown below) 
just prior to the substr commands:

> my $ME = $field->subfield('a');
  $ME =~ s/[\xE1-\xFE]//g;
> my $four100 = substr( $ME, 0, 4 );

> my $TITLE = $field->subfield('a');
  $TITLE =~ s/[\xE1-\xFE]//g;
> my $four245 = substr( $TITLE, 0, 4 );

-- Michael

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 cell
# [EMAIL PROTECTED]
# http://rocky.uta.edu/doran/ 

> -Original Message-
> From: Jacobs, Jane W [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, January 11, 2005 12:30 PM
> To: perl4lib@perl.org
> Subject: Ignoring Diacritics accessing Fixed Field Data
> 
> Hi folks,
> 
> I'm trying to write a routine to construct a text file of
> OCLC search key from a group of existing records.  What I 
> want is something like:
> 
> Brah,vasa/2003
> 
> That is 1st four letters of 100 + comma + 1st four letters of
> 245 + slash + date.
> 
> In principle I have this working with:
> 
> 
> open( FOURS, ">4-4-date.txt" );
> 
> 
> while ( my $r = $batch->next() ) {
>   
> my @fields = $r->field( '100' );
> foreach my $field ( @fields ) {
> my $ME = $field->subfield('a');
> my $four100 = substr( $ME, 0, 4 );
>   
> print FOURS "$four100";
> } 
> 
> my @fields = $r->field( '245' );
> foreach my $field ( @fields ) {
> my $TITLE = $field->subfield('a');
> my $four245 = substr( $TITLE, 0, 4 );
> print FOURS ",$four245";
> } 
> 
> my @fields = $r->field( '260' );
> foreach my $field ( @fields ) {
> my $PD = $field->subfield('c');
> my $four260 = substr( $PD, 0, 4);
> print FOURS "\\$four260\n";
> } 
> 
> 
> My result was something like:
> 
> Dave,Ayod\2003
> Paòt,Kaâs\2002
> Baks,Dasa\2003
> ,Viâs\2002
> 
> Problem 1: As you can see, I don't really want the first four
> characters, I want the first four SEARCHABLE characters.  How 
> can I tell MARC Record to give me the first four characters, 
> excluding diacritics?
> 
> Problem 2:  In these examples 260 $c works OK, but I could
> get a cleaner result by accessing the date from the fixed 
> field (008 07-10).  How would I do that?  I was looking in 
> the tutorial, but couldn't seem to find anything that seemed 
> to help.  If I'm missing something there please point it up.
> 
>  Thanks in advance to anyone who can help.
> 
>  
> JJ
> 
> 
> 
> **Views expressed by the author do not necessarily represent
> those of the Queens Library.**
> 
> Jane Jacobs
> Asst. Coord., Catalog Division
> Queens Borough Public Library
> 89-11 Merrick Blvd.
> Jamaica, NY 11432
> 
> tel.: (718) 990-0804
> e-mail: [EMAIL PROTECTED]
> FAX. (718) 990-8566
> 
>