NACO Normalization and Text::Normalize

2003-08-25 Thread Brian Cassidy
Hi All,

The basis for this message is to get a feeling whether or not I should
submit a module that will do NACO normalization
(http://lcweb.loc.gov/catdir/pcc/naco/normrule.html) to CPAN.

As part of a previous project I was importing MARC records into an RDBMS
structure. In order to facilitate better searching, it was suggested to
me that I do some normalization on my data and that NACO normalization
would be a good choice for guidelines. So, away I went and came back
with normalize() sub which does the trick.

I now wonder if this code would have greater utility as a module on
CPAN. And if I do decide to upload it to CPAN, perhaps a base class
(Text::Normalize) should be created to which NACO normalization could be
added as a subclass.

Any advice would be greatly appreciated.

Thanks in advance,

-Brian Cassidy ( [EMAIL PROTECTED] )


http://www.gordano.com - Messaging for educators.


Re: NACO Normalization and Text::Normalize

2003-08-25 Thread Paul Hoffman
On Monday, August 25, 2003, at 03:29  PM, Brian Cassidy wrote:

The basis for this message is to get a feeling whether or not I should
submit a module that will do NACO normalization
(http://lcweb.loc.gov/catdir/pcc/naco/normrule.html) to CPAN.
[...]
So, away I went and came back with normalize() sub which does the 
trick.
Fabulous!  (Disclaimer: I'd never heard of NACO normalization before, 
but it sounds like it could be useful -- for MARC bib records, too.)

I now wonder if this code would have greater utility as a module on
CPAN.
Yes, please!  (You're not BRICAS on cpan.org, are you?)

And if I do decide to upload it to CPAN, perhaps a base class
(Text::Normalize) should be created to which NACO normalization could 
be
added as a subclass.
I would recommend putting it in the MARC::* namespace, since it's 
specific to MARC records -- maybe MARC::Transform::NACO or some such.

A class hierarchy rooted at MARC:: Transform might be useful, if (for 
example) people wanted to apply arbitrary transformations to a single 
record:

   my @records = ... some MARC::Record objects ... ;
   my @transforms = (
   MARC::Transform::Delete9xx->new,
   MARC::Transform::StripInitialArticles->new,
   some_other_transforms(),
   );
   foreach my $t (@ transforms) {
   $t->transform($_) foreach @records;
   }
Thanks for your hard work.

Paul.

--
Paul Hoffman :: Taubman Medical Library :: Univ. of Michigan
[EMAIL PROTECTED] :: [EMAIL PROTECTED] :: http://www.nkuitse.com/


RE: NACO Normalization and Text::Normalize

2003-08-26 Thread Brian Cassidy
> -Original Message-
> 
> > I now wonder if this code would have greater utility as a module on
> > CPAN.
> 
> Yes, please!  (You're not BRICAS on cpan.org, are you?)

Yes, I am BRICAS on CPAN...is that a bad thing? :)

> I would recommend putting it in the MARC::* namespace, since it's
> specific to MARC records -- maybe MARC::Transform::NACO or some such.
> 
> A class hierarchy rooted at MARC:: Transform might be useful, if (for
> example) people wanted to apply arbitrary transformations to a single
> record:
> 
> my @records = ... some MARC::Record objects ... ;
> my @transforms = (
> MARC::Transform::Delete9xx->new,
> MARC::Transform::StripInitialArticles->new,
> some_other_transforms(),
> );
> foreach my $t (@ transforms) {
> $t->transform($_) foreach @records;
> }

The current behavior is currently to take a string in, normalize, then
output it. There isn't necessarily a defined behavior on a MARC record.

Also, as far as "transforms" are concerned, the decode() method in
MARC::File::USMARC can take a filter sub as a second parameter. So, I'm
still not 100% sure it should be a MARC-specific module rather than a
general normalizing module.

Perhaps we need to explore exactly how a transform would interact with a
MARC::Record object if we wish to go in that direction.

-Brian Cassidy ( [EMAIL PROTECTED] )


http://www.gordano.com - Messaging for educators.


Re: NACO Normalization and Text::Normalize

2003-08-26 Thread Ed Summers
Hi Brian: thanks for writing,

On Mon, Aug 25, 2003 at 04:29:37PM -0300, Brian Cassidy wrote:
> As part of a previous project I was importing MARC records into an RDBMS
> structure. In order to facilitate better searching, it was suggested to
> me that I do some normalization on my data and that NACO normalization
> would be a good choice for guidelines. So, away I went and came back
> with normalize() sub which does the trick.
> 
> I now wonder if this code would have greater utility as a module on
> CPAN. And if I do decide to upload it to CPAN, perhaps a base class
> (Text::Normalize) should be created to which NACO normalization could be
> added as a subclass.

I think this is a great idea. At first I was thinking that it would be nice to
be able to pass your normalize() function a MARC::Record object, which would
magically normalize all the relevant fields (like a good cataloger).  This 
could be a subclass MARC::Record::NACO which adds a new method normalize(),
or if Andy was willing could be added to the MARC::Record core.

However, the docs [1] seem to say that it is only possible to determine how a 
field should normalize in the context of the collection of records that it is a
part of...and that MARC::Record has no way of determining this, so perhaps 
this idea is not on target?

If you would like to contribute your NACO normalization function to cpan (as I
definitely think you should), and my reading of the lc docs are correct, then
I would recommend you add a Text::NACO module.  The Normalize part is a bit 
redundant because all the modules in Text do some kind of normalization. The 
package could export a function normalize() on demand, which you then pass a 
string, and get back the NACO normalized version. You could also add it to the
Biblio namespace as Biblio::NACO, or MARC::NACO, but that's really your call as
the module author :) The main thing is to get it up there somewhere.

Please post to the list if you decide to upload. I'd like to add a section to
the tutorial, and to the perl4lib.perl.org website!

//Ed

[1] http://lcweb.loc.gov/catdir/pcc/naco/normrule.html


RE: NACO Normalization and Text::Normalize

2003-08-27 Thread Brian Cassidy
Hi Ed,

> I think this is a great idea. At first I was thinking that it would be
> nice to be able to pass your normalize() function a MARC::Record
object, 
> which would magically normalize all the relevant fields (like a good 
> cataloger).  This could be a subclass MARC::Record::NACO which adds a
new > method normalize(), or if Andy was willing could be added to the 
> MARC::Record core.
> 
> However, the docs [1] seem to say that it is only possible to
determine
> how a field should normalize in the context of the collection of
records 
> that it is a part of...and that MARC::Record has no way of determining

> this, so perhaps this idea is not on target?

Okay, I think you're right that subclassing MARC::Record isn't going to
cut the mustard, since MARC::Batch would still not pick it up (thus it
isn't exactly a drop-in replacement, which would be ideal).

> If you would like to contribute your NACO normalization function to
cpan
> (as I definitely think you should), and my reading of the lc docs are 
> correct, then I would recommend you add a Text::NACO module.  The 
> Normalize part is a bit redundant because all the modules in Text do
some > kind of normalization. The package could export a function
normalize() on > demand, which you then pass a string, and get back the
NACO normalized 
> version. You could also add it to the Biblio namespace as
Biblio::NACO, or > MARC::NACO, but that's really your call as the module
author :) The main 
> thing is to get it up there somewhere.

What I'm now envisioning is a module, still called MARC::Record::NACO,
which is not a subclass, but would export two functions on demand,
normalize() and compare().

---

* normalize()

inputs: either a MARC::Record object or a string. This should probably
accept an arbitrary number of inputs so, you can do

my @normrecs = normalize( @records );

rather than

my @normrecs;
foreach my $rec ( @records ) {
push @normrecs, normalize( $rec );
}

But you still could if you wanted to.

Given a M::R object it would do as the rules state [1] for the
appropriate fields in the record. Returns a M::R object.

Given a string, it would apply the string normalization rules. Returns a
string.

* compare()

inputs: either two M::R objects or two strings.

Given two M::R objects, both are normalize()'ed. It would return false
(or should it be true?) if, based on the rules [1], some field in $a
matches some field in $b.

Given two strings, both are again normalize()'ed and a simple "cmp" is
performed.

---

It sucks that given different inputs the results returned are a bit
inconsistent. However, there's no way to say that $a > $b for a M::R (is
there? :). One might want to be able to sort normalized strings, so it
makes sense that compare()'ing two strings does a "cmp".

How's that sound?

-Brian Cassidy ( [EMAIL PROTECTED] )

[1] http://lcweb.loc.gov/catdir/pcc/naco/normrule.html


http://www.gordano.com - Messaging for educators.


RE: NACO Normalization and Text::Normalize

2003-08-27 Thread Houghton,Andrew
From: Brian Cassidy [mailto:[EMAIL PROTECTED]
Subject: RE: NACO Normalization and Text::Normalize

> * normalize()
>
> inputs: either a MARC::Record object or a string. This should probably
> accept an arbitrary number of inputs so, you can do
> * compare()
> 
> inputs: either two M::R objects or two strings.
> 
> Given two M::R objects, both are normalize()'ed. It would return false
> (or should it be true?) if, based on the rules [1], some field in $a
> matches some field in $b.

You may need some additional parameters, like what tags to normalize,
since you may want to do NACO normalization on fields other than the
1XX.  For example, I currently do NACO normalization on the 1XX, 4XX,
5XX and 7XX in my Authority records.  By doing that I can quickly
build a hash that allows me to find the broader, narrower, related 
and use-for references for a record in the entire Authority file.

Andy.


Re: NACO Normalization and Text::Normalize

2003-08-27 Thread Ed Summers
On Wed, Aug 27, 2003 at 09:15:25AM -0300, Brian Cassidy wrote:
> * normalize()
> 
> inputs: either a MARC::Record object or a string. This should probably
> accept an arbitrary number of inputs so, you can do
> 
> my @normrecs = normalize( @records );
> 
> rather than
> 
> my @normrecs;
> foreach my $rec ( @records ) {
>   push @normrecs, normalize( $rec );
> }
> 
> But you still could if you wanted to.
> 
> Given a M::R object it would do as the rules state [1] for the
> appropriate fields in the record. Returns a M::R object.
> 
> Given a string, it would apply the string normalization rules. Returns a
> string.
> 
> * compare()
> 
> inputs: either two M::R objects or two strings.
> 
> Given two M::R objects, both are normalize()'ed. It would return false
> (or should it be true?) if, based on the rules [1], some field in $a
> matches some field in $b.
> 
> Given two strings, both are again normalize()'ed and a simple "cmp" is
> performed.

I like the idea of a package MARC::Record::NACO which exports the normalize() 
and compare() functions. My $.02 are that you not overload normalize() and 
compare() too much, but create different functions, since you'll have the 
entire MARC::Record::NACO namespace to play with! 

normalize( $string );
normalize_record( $record, 100, 110, etc );
compare( $string );
compare_record( $record1, $record2, 100, 110, etc );

I know its heresy, but when it comes to designing programs and interfaces I've 
come to trust an aspect of the Unix philosophy over the Perl philosophy. 

Unix: Make each program (function) do one thing well.
Perl: DWIM (Do What I Mean)

I see you've got CPAN modules up there already, but if you need any help with 
the test suite or anything I would be willing to help out. At any rate, please 
post to the list if you end up releasing something.

//Ed