Re: [CODE4LIB] Net::OAI::Harvester

2011-08-11 Thread Westman, Stephen
Thanks!  I'll look into it.

Take care,

Stephen

From: Code for Libraries [CODE4LIB@LISTSERV.ND.EDU] on behalf of Thomas Krichel 
[kric...@openlib.org]
Sent: Thursday, August 11, 2011 1:07 AM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] Net::OAI::Harvester

  Westman, Stephen writes

 I'm working on a Perl-based OAI harvester and have run a problem.
 The module that I'm using - Net::OAI::Harvester - does a great job
 of parsing out the different OAI tagged fields so that they can be
 put into a MySQL table of retrieved OAI records for searching.

  I suggest you use HTTP::OAI instead.


  Cheers,

  Thomas Krichelhttp://openlib.org/home/krichel
  http://authorprofile.org/pkr1
   skype: thomaskrichel


Re: [CODE4LIB] Net::OAI::Harvester

2011-08-11 Thread Thomas Krichel
  Westman, Stephen writes

 Thanks!  I'll look into it.

  I've put out a sample script using it at 

http://wotan.liu.edu/home/mamf/tmp/westman

  In this work I used OpenDOAR to get repository sources.  I am not
  maintaining this any more becaues I now use a source from BASE for
  feeding repository data into AuthorClaim. That source is documented
  at 

http://wotan.liu.edu/base

 Take care,

  Enjoy,


  Thomas Krichelhttp://openlib.org/home/krichel
  http://authorprofile.org/pkr1
   skype: thomaskrichel


Re: [CODE4LIB] Net::OAI::Harvester

2011-08-11 Thread Thomas Berger
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi Stephen,


 Does anyone know a way in which Net::OAI::Harvester can be used with oai_dc
 records in a way where multiple instances of a tag can be captured and then
 concatenated with the first one.

It should work to extract the tag in array context:

my $dc = $r-metadata();

@their_identifiers = $dc-identifier();

Further concatenation would be left to the application.


 That being said, I do have one other question: Is there a way within the
 Net::OAI::Harvester to output the actual metadata structure that's being 
 harvested?

There is no mandatory method for any metadata handlers within the
context of Net::OAI::Harvester to provide this.

And for the default OAI_DC handler: It provides the asString method
which at least shows the complete content.

hope this helps
Thomas Berger
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iJwEAQECAAYFAk5DfooACgkQYhMlmJ6W47NKfQP+M448CtRS+FyWHvsNY3efPaHk
Ywl8yNJ/NlC0cjyHfjgaJpWB3qPM++Fn2BjnIVcXs9LeHmyc+gVB4BuYBuh6qlBg
0e7FDgA4YftPZz35vIQUjFrNuEkZfImKOTEf6NzlW7q30kTxaDTxt7xndAI+7bAJ
e0436AcRTyJIAI2uUv0=
=f45u
-END PGP SIGNATURE-


Re: [CODE4LIB] Net::OAI::Harvester

2011-08-11 Thread Westman, Stephen
Hey, Thomas - friend and fellow night owl.  This solution is great!!

This just goes to show that sometimes it IS better to go to bed and get some 
sleep than to keep struggling in the wrong direction.

I was so focused on trying to approach it the way that I would with an XSLT 
stylesheet - sequentially/iteratively - that I hadn't thought to try reading 
all at once it into an array.

Thanks a million!!

Stephen


From: Code for Libraries [CODE4LIB@LISTSERV.ND.EDU] on behalf of Thomas Berger 
[t...@gymel.com]
Sent: Thursday, August 11, 2011 3:02 AM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] Net::OAI::Harvester

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi Stephen,


 Does anyone know a way in which Net::OAI::Harvester can be used with oai_dc
 records in a way where multiple instances of a tag can be captured and then
 concatenated with the first one.

It should work to extract the tag in array context:

my $dc = $r-metadata();

@their_identifiers = $dc-identifier();

Further concatenation would be left to the application.


 That being said, I do have one other question: Is there a way within the
 Net::OAI::Harvester to output the actual metadata structure that's being 
 harvested?

There is no mandatory method for any metadata handlers within the
context of Net::OAI::Harvester to provide this.

And for the default OAI_DC handler: It provides the asString method
which at least shows the complete content.

hope this helps
Thomas Berger
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iJwEAQECAAYFAk5DfooACgkQYhMlmJ6W47NKfQP+M448CtRS+FyWHvsNY3efPaHk
Ywl8yNJ/NlC0cjyHfjgaJpWB3qPM++Fn2BjnIVcXs9LeHmyc+gVB4BuYBuh6qlBg
0e7FDgA4YftPZz35vIQUjFrNuEkZfImKOTEf6NzlW7q30kTxaDTxt7xndAI+7bAJ
e0436AcRTyJIAI2uUv0=
=f45u
-END PGP SIGNATURE-


[CODE4LIB] Net::OAI::Harvester

2011-08-10 Thread Westman, Stephen
I'm working on a Perl-based OAI harvester and have run a problem.  The module 
that I'm using - Net::OAI::Harvester - does a great job of parsing out the 
different OAI tagged fields so that they can be put into a MySQL table of 
retrieved OAI records for searching.

Unfortunately, in using the University of Michigan OAI Toolkit, I have found 
that at least one repository has repeated tags.  In particular, multiple 
identifier tags.  This presents a problem in that it seems that 
Net::OAI::Harvester gets the first (and, as far as I know how to use it, only 
the first) instance of a tag.  In addition to the loss of data (which is always 
bad), it is made worse here by the fact that the repository that I'm trying to 
harvest usually places the URL to connect to the repository item in  the second 
identifier tag.  That being the case, the URL does not get saved to the 
database and the harvest is less-than-useful to our users.

Does anyone know a way in which Net::OAI::Harvester can be used with oai_dc 
records in a way where multiple instances of a tag can be captured and then 
concatenated with the first one.

I have spent some time trying a number of different approaches, including 
trying different libraries (such as XML::LibXML and XML::SAX::Parser), but I 
can't seem to get it to work with the input I get inside the 
Net::OAI::Harvester module, which has been run through the Storable module).

Unfortunately, the documentation that I have been able to find on the Web does 
not provide information on any methods that I could use.

Would it make more sense just to move to the University of Michigan Toolkit to 
harvest the XML records?  I would prefer to continue with the 
Net::OAI::Harvester module if I can in that it allows me to be flexible in what 
sorts of schemas I'm able to harvest, not just unqualified Dublin Core.  

That being said, I do have one other question: Is there a way within the 
Net::OAI::Harvester to output the actual metadata structure that's being 
harvested?

Thanks in advance for any assistance that you can provide!

Stephen Westman


Re: [CODE4LIB] Net::OAI::Harvester

2011-08-10 Thread Thomas Krichel
  Westman, Stephen writes

 I'm working on a Perl-based OAI harvester and have run a problem.
 The module that I'm using - Net::OAI::Harvester - does a great job
 of parsing out the different OAI tagged fields so that they can be
 put into a MySQL table of retrieved OAI records for searching.

  I suggest you use HTTP::OAI instead.


  Cheers,

  Thomas Krichelhttp://openlib.org/home/krichel
  http://authorprofile.org/pkr1
   skype: thomaskrichel


Re: [CODE4LIB] net::oai::harvester [resolved]

2007-12-17 Thread Eric Lease Morgan

On Dec 13, 2007, at 10:33 AM, Eric Lease Morgan wrote:


Put another way, if I want to use repository using
NET::OAI::Harvester to read repository data in a
form other than DC will I need to write an additional
module such as NET::OAI::Record::MARCXML?


But I'm lazy, and even though it is not the best solution, I will
explore another option. Specifically, I will use oai_dump (which
comes with N::O::H), change the metadata scheme from oai_dc to
marc21, run the script, and parse the resulting XML. If I'm lucky
my parser will able to be written as a SAX filter that can be added
to the N::O::H distribution. In the meantime, at least I will have
the data. Wish me luck.




After getting most of my MARCXML/SAX parser written, Ed Summmers
presented me with a couple of Perl modules allowing me to return
MARC::Record objects from the harvest of OAI repositories supporting
the marc21 metadata schema. This is originally what I wanted to do.

Using this technology I was able to harvest the metadata (MARC
records) of 70,000 University of Michigan digitized books (MBooks). I
then fed them to an indexer -- Zebra -- that reads raw MARC very
well, and provided a rudimentary interface to the index via SRU:

  http://infomotions.com/ii/

In the end the process was almost trivial and can easily be expanded
to include other types of content.

Thank you to all who helped along the way!

--
Eric Lease Morgan
University Libraries of Notre Dame

(574) 631-8604


Re: [CODE4LIB] net::oai::harvester [resolved]

2007-12-17 Thread Ed Summers
fwiw, my proposed solution was to use MARC::File::XML from the
marc-xml cpan module [1]

  use Net::OAI::Harvester;
  use MARC::File::SAX;

  my $url = 'http://memory.loc.gov/cgi-bin/oai2_0';
  my $harvester = Net::OAI::Harvester-new(baseURL = $url);
  my $response = $harvester-listRecords(metadataPrefix = 'marc21',
metadataHandler = 'MARC::File::SAX');

  while ($record = $response-next()) {
# get the oai-pmh record as a MARC::Record object and print it out
print $record-metadata()-record()-as_formatted();
  }

Note, this required some small adjustments to MARC::File::SAX. So
you'd need the latest version from CVS [2], which will go out to CPAN
some time I guess :-)

//Ed

[1] http://search.cpan.org/dist/MARC-XML/
[2] http://sourceforge.net/cvs/?group_id=1254


[CODE4LIB] net::oai::harvester

2007-12-13 Thread Eric Lease Morgan

I am looking for some guidelines regarding the use of
NET::OAI::Harvester.

Specifically, I'm hoping someone would outline one or two different
techniques for using the NET::OAI::Harvester Perl modules to harvest
and store OAI metadata other than oai_dc. Suppose the data provider
supports MARCXML, or MODS. How can I iterate through a repository and
then either: 1) save the MARCXML or MODS data to my file system, or
2) parse the MARCXML or MODS and do some processing on it within my
application.

Put another way, if I want to use repository using
NET::OAI::Harvester to read repository data in a form other than DC
will I need to write an additional module such as
NET::OAI::Record::MARCXML?

--
Eric Lease Morgan
University Libraries of Notre Dame


Re: [CODE4LIB] net::oai::harvester

2007-12-13 Thread Don Gourley
On Thu, December 13, 2007 8:18 am, Eric Lease Morgan wrote:
...
 Put another way, if I want to use repository using
 NET::OAI::Harvester to read repository data in a form other than DC
 will I need to write an additional module such as
 NET::OAI::Record::MARCXML?

I don't know if this is the only way to do it, but that is how
I use NET::OAI::Harvester to handle metadata in DIDL that I get
out of a DSpace repository via a custom crosswalk plugin.  The
harvest script simply invokes the Harvester like this:

my $rec = $harvester-getRecord('identifier' = $oaiid,
'metadataPrefix' = 'oai_didl',
'metadataHandler' = 'DOC_DIDL',
'set' = $set);

DOC_DIDL includes the SAX event handlers to build a hash of
metadata and some methods to retrieve that metadata used by
the harvest script.

Writing DOC_DIDL.pm was a little messy and specific to the way
we encode structural metadata in DIDL, mostly because I was
flattening an hierarchical schema into a hash.  Obviously you'll
need a completely different module but if you want to see this
one just to see what is involved I can send it to you off-list.

-Don


Re: [CODE4LIB] net::oai::harvester

2007-12-13 Thread Ed Summers
Hey Eric:

N::O::H uses XML::SAX for XML parsing, which provides a standard
interface to multiple back end XML parsers, and also provides a
facility known as XML Filters [1].

Net::OAI::Record::OAI_DC is an example of a SAX filter which receives
SAX events for each metadata record in a response and builds up a
representation of the record. Since oai_dc is standard in oai-pmh-land
it's assumed as a default a lot of the time.

So if you want to retrieve another kind of metadata you have to write
a SAX filter for it, and then reference it when you are calling
getRecord(), listRecords() or listAllRecords().

So for example here's a test script for a MODSHandler detailed below:

--

  use Net::OAI::Harvester;
  use MODSHandler;

  my $url = 'http://memory.loc.gov/cgi-bin/oai2_0';
  my $harvester = Net::OAI::Harvester-new(baseURL = $url);
  my $records = $harvester-listRecords(
 metadataPrefix = 'mods',
 metadataHandler = 'MODSHandler'
  );

  while ($record = $records-next()) {
  print $record-metadata()-title(), \n;
  }

--

And here's a barely functional MODSHandler that just pulls out the title:

--

package MODSHandler;

  use XML::SAX::Base;
  use base qw(XML::SAX::Base);

  sub new {
  my $class = shift;
  return bless {inside = 0}, ref($class) || $class;
  }

  sub title {
  return shift-{title};
  }

  sub start_element {
 my ($self, $element) = @_;
 if ($element-{Name} eq 'title') {$self-{inside} = 1;}
  }

  sub end_element {
  my ($self, $element) = @_;
  if ($element-{Name} eq 'title') {$self-{inside} = 0;}
  }

  sub characters {
  my ($self, $chars) = @_;
  if ($self-{inside}) {
  $self-{title} .= $chars-{Data};
  }
  }

  1;

--

Kind of sad that there's that much code to just get at the contents of
the title element. Perhaps there are some SAX Filters on CPAN that can
build up a DOM like object for you.

Interestingly back in 2000 or whatever when this was written it felt
like pretty state of the art to use filters in this way. But today it
seems kind of overkill to have to write a state-machine just to get at
some XML. The ruby oai library [2] I worked on more recently kind of
bucks the trend of not trying to create fancy objects for records and
hand waving memory concerns (which never seemed to surface) and just
returns back what amounts to a DOM and lets the user figure out what
they want.

Let me know if you run into any trouble.

//Ed

[1] http://www.xml.com/pub/a/2001/10/10/sax-filters.html
[2] http://oai.rubyforge.org


Re: [CODE4LIB] net::oai::harvester

2007-12-13 Thread Eric Lease Morgan

On Dec 13, 2007, at 9:10 AM, Don Gourley wrote:


Put another way, if I want to use repository using
NET::OAI::Harvester to read repository data in a
form other than DC will I need to write an additional
module such as NET::OAI::Record::MARCXML?


I don't know if this is the only way to do it, but that is how
I use NET::OAI::Harvester to handle metadata in DIDL that I get
out of a DSpace repository via a custom crosswalk plugin.  The
harvest script simply invokes the Harvester like this:

my $rec = $harvester-getRecord('identifier' = $oaiid,
'metadataPrefix' = 'oai_didl',
'metadataHandler' = 'DOC_DIDL',
'set' = $set);




On Dec 13, 2007, at 9:48 AM, Ed Summers wrote:


Net::OAI::Record::OAI_DC is an example of a SAX filter
which receives SAX events for each metadata record in a
response and builds up a representation of the record.
Since oai_dc is standard in oai-pmh-land it's assumed
as a default a lot of the time.

So if you want to retrieve another kind of metadata you
have to write a SAX filter for it, and then reference
it when you are calling getRecord(), listRecords() or
listAllRecords()

And here's a barely functional MODSHandler that just
pulls out the title:

package MODSHandler;

  use XML::SAX::Base;
  use base qw(XML::SAX::Base);

  sub new {
  my $class = shift;
  return bless {inside = 0}, ref($class) || $class;
  }

  sub title {
  return shift-{title};
  }

  sub start_element {
 my ($self, $element) = @_;
 if ($element-{Name} eq 'title') {$self-{inside} = 1;}
  }

  sub end_element {
  my ($self, $element) = @_;
  if ($element-{Name} eq 'title') {$self-{inside} = 0;}
  }

  sub characters {
  my ($self, $chars) = @_;
  if ($self-{inside}) {
  $self-{title} .= $chars-{Data};
  }
  }

  1;




Thank you for the prompt replies, and y'all have confirmed what I
believed. The best way to accomplish my goal is to write a SAX
filter for the metadata schema I desire.

But I'm lazy, and even though it is not the best solution, I will
explore another option. Specifically, I will use oai_dump (which
comes with N::O::H), change the metadata scheme from oai_dc to
marc21, run the script, and parse the resulting XML. If I'm lucky my
parser will able to be written as a SAX filter that can be added to
the N::O::H distribution. In the meantime, at least I will have the
data. Wish me luck.

--
Eric Lease Morgan
University Libraries of Notre Dame


Re: [CODE4LIB] net::oai::harvester

2007-12-13 Thread Ed Summers
Another option if you are in Perl land would be to take a look at Tim
Brody's HTTP::OAI library [1] which returns XML::DOM::Document objects
for record metadata, which you can walk around in and use to evaluate
xpaths:

--

  use HTTP::OAI;

  my $harvester = HTTP::OAI::Harvester-new(
  baseURL = 'http://memory.loc.gov/cgi-bin/oai2_0'
  );

  my $response = $harvester-ListRecords(metadataPrefix = 'mods');

  while (my $record = $response-next()) {
  print $record-metadata()-dom()-findvalue('//valid/xpath/here'), \n;
  }

--

I left the xpath as an exercise for the reader since I couldn't figure
out how to set use the http://www.loc.gov/mods/v3 namespace properly.

Ahh, namespaces :-)

//Ed

[1] http://search.cpan.org/dist/HTTP-OAI/


Re: [CODE4LIB] net::oai::harvester

2007-12-13 Thread Eric Lease Morgan

On Dec 13, 2007, at 9:48 AM, Ed Summers wrote:


  use Net::OAI::Harvester;
  use MODSHandler;

  my $url = 'http://memory.loc.gov/cgi-bin/oai2_0';
  my $harvester = Net::OAI::Harvester-new(baseURL = $url);
  my $records = $harvester-listRecords(
 metadataPrefix = 'mods',
 metadataHandler = 'MODSHandler'
  );

  while ($record = $records-next()) {
  print $record-metadata()-title(), \n;
  }

...

Interestingly back in 2000 or whatever when this was written it felt
like pretty state of the art to use filters in this way. But today it
seems kind of overkill to have to write a state-machine just to get at
some XML. The ruby oai library [2] I worked on more recently kind of
bucks the trend of not trying to create fancy objects for records and
hand waving memory concerns (which never seemed to surface) and just
returns back what amounts to a DOM and lets the user figure out what
they want.




What type(s) of data are methods applied against the metadata method
(above) expected to return? Only scalars? How about objects? How
about other Perl data structures like a hash (of hashes)? Are there a
pre-defined set of methods that can be called against the metadata
method?

I suppose the afore mentioned MODSHandler can be designed to support
any number of methods returning  different types of data. Correct?
For example, the code above is designed to return a title. Additional
methods might return authors, subjects, publishers, etc.

Spurned on by the availability of MBooks from the University of
Michigan [1], I have written the beginnings of a SAX filter for
MARCXML data. Currently it iterates over MARCXML, parses the data,
and prints to STDOUT something looking like a MARC tagged display.
Ironically, this was rather easy because MARCXML only has a limited
number of elements: leader, controlfield, datafield, and subfield.

Using Ed's code as a model, I think I could create a method called
MARC that returns a MARC::Record object, like this:

  use Net::OAI::Harvester;
  use MARCXML;

  my $url = 'http://memory.loc.gov/cgi-bin/oai2_0';
  my $harvester = Net::OAI::Harvester-new( baseURL = $url );
  my $records = $harvester-listRecords(

   metadataPrefix  = 'marc21',
   metadataHandler = 'MARCXML'

  );

  while ( $record = $records-next ) {

   # call the MARC method returning a MARC::Record object
   $marc = $record-metadata()-MARC, \n;

   # apply cool MARC::Record methods against the object
   print $marc-title;

  }

Alternatively, I suppose I could create methods like this:

  $leader  = $record-metadata()-leader;
  $control = $record-metadata()-control;
  $title   = $record-metadata()-datafield( '245', 'a' );
  $author  = $record-metadata()-datafield( '100', 'a' );
  $url = $record-metadata()-datafield( '856', 'u' );

Is this approach a good idea? On the other hand, maybe I should
return the whole record in all of its MARC glory. Which approach is
better? Maybe I should do both? Maybe I should return a DOM as Ed
alludes to. Ah, the choices!


[1] http://lists.webjunction.org/wjlists/xml4lib/2007-December/
005978.html

--
Eric Lease Morgan
University Libraries of Notre Dame