Re: [CODE4LIB] Net::OAI::Harvester
Thanks! I'll look into it. Take care, Stephen From: Code for Libraries [CODE4LIB@LISTSERV.ND.EDU] on behalf of Thomas Krichel [kric...@openlib.org] Sent: Thursday, August 11, 2011 1:07 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] Net::OAI::Harvester Westman, Stephen writes I'm working on a Perl-based OAI harvester and have run a problem. The module that I'm using - Net::OAI::Harvester - does a great job of parsing out the different OAI tagged fields so that they can be put into a MySQL table of retrieved OAI records for searching. I suggest you use HTTP::OAI instead. Cheers, Thomas Krichelhttp://openlib.org/home/krichel http://authorprofile.org/pkr1 skype: thomaskrichel
Re: [CODE4LIB] Net::OAI::Harvester
Westman, Stephen writes Thanks! I'll look into it. I've put out a sample script using it at http://wotan.liu.edu/home/mamf/tmp/westman In this work I used OpenDOAR to get repository sources. I am not maintaining this any more becaues I now use a source from BASE for feeding repository data into AuthorClaim. That source is documented at http://wotan.liu.edu/base Take care, Enjoy, Thomas Krichelhttp://openlib.org/home/krichel http://authorprofile.org/pkr1 skype: thomaskrichel
Re: [CODE4LIB] Net::OAI::Harvester
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi Stephen, Does anyone know a way in which Net::OAI::Harvester can be used with oai_dc records in a way where multiple instances of a tag can be captured and then concatenated with the first one. It should work to extract the tag in array context: my $dc = $r-metadata(); @their_identifiers = $dc-identifier(); Further concatenation would be left to the application. That being said, I do have one other question: Is there a way within the Net::OAI::Harvester to output the actual metadata structure that's being harvested? There is no mandatory method for any metadata handlers within the context of Net::OAI::Harvester to provide this. And for the default OAI_DC handler: It provides the asString method which at least shows the complete content. hope this helps Thomas Berger -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.9 (Cygwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iJwEAQECAAYFAk5DfooACgkQYhMlmJ6W47NKfQP+M448CtRS+FyWHvsNY3efPaHk Ywl8yNJ/NlC0cjyHfjgaJpWB3qPM++Fn2BjnIVcXs9LeHmyc+gVB4BuYBuh6qlBg 0e7FDgA4YftPZz35vIQUjFrNuEkZfImKOTEf6NzlW7q30kTxaDTxt7xndAI+7bAJ e0436AcRTyJIAI2uUv0= =f45u -END PGP SIGNATURE-
Re: [CODE4LIB] Net::OAI::Harvester
Hey, Thomas - friend and fellow night owl. This solution is great!! This just goes to show that sometimes it IS better to go to bed and get some sleep than to keep struggling in the wrong direction. I was so focused on trying to approach it the way that I would with an XSLT stylesheet - sequentially/iteratively - that I hadn't thought to try reading all at once it into an array. Thanks a million!! Stephen From: Code for Libraries [CODE4LIB@LISTSERV.ND.EDU] on behalf of Thomas Berger [t...@gymel.com] Sent: Thursday, August 11, 2011 3:02 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] Net::OAI::Harvester -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi Stephen, Does anyone know a way in which Net::OAI::Harvester can be used with oai_dc records in a way where multiple instances of a tag can be captured and then concatenated with the first one. It should work to extract the tag in array context: my $dc = $r-metadata(); @their_identifiers = $dc-identifier(); Further concatenation would be left to the application. That being said, I do have one other question: Is there a way within the Net::OAI::Harvester to output the actual metadata structure that's being harvested? There is no mandatory method for any metadata handlers within the context of Net::OAI::Harvester to provide this. And for the default OAI_DC handler: It provides the asString method which at least shows the complete content. hope this helps Thomas Berger -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.9 (Cygwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iJwEAQECAAYFAk5DfooACgkQYhMlmJ6W47NKfQP+M448CtRS+FyWHvsNY3efPaHk Ywl8yNJ/NlC0cjyHfjgaJpWB3qPM++Fn2BjnIVcXs9LeHmyc+gVB4BuYBuh6qlBg 0e7FDgA4YftPZz35vIQUjFrNuEkZfImKOTEf6NzlW7q30kTxaDTxt7xndAI+7bAJ e0436AcRTyJIAI2uUv0= =f45u -END PGP SIGNATURE-
[CODE4LIB] Net::OAI::Harvester
I'm working on a Perl-based OAI harvester and have run a problem. The module that I'm using - Net::OAI::Harvester - does a great job of parsing out the different OAI tagged fields so that they can be put into a MySQL table of retrieved OAI records for searching. Unfortunately, in using the University of Michigan OAI Toolkit, I have found that at least one repository has repeated tags. In particular, multiple identifier tags. This presents a problem in that it seems that Net::OAI::Harvester gets the first (and, as far as I know how to use it, only the first) instance of a tag. In addition to the loss of data (which is always bad), it is made worse here by the fact that the repository that I'm trying to harvest usually places the URL to connect to the repository item in the second identifier tag. That being the case, the URL does not get saved to the database and the harvest is less-than-useful to our users. Does anyone know a way in which Net::OAI::Harvester can be used with oai_dc records in a way where multiple instances of a tag can be captured and then concatenated with the first one. I have spent some time trying a number of different approaches, including trying different libraries (such as XML::LibXML and XML::SAX::Parser), but I can't seem to get it to work with the input I get inside the Net::OAI::Harvester module, which has been run through the Storable module). Unfortunately, the documentation that I have been able to find on the Web does not provide information on any methods that I could use. Would it make more sense just to move to the University of Michigan Toolkit to harvest the XML records? I would prefer to continue with the Net::OAI::Harvester module if I can in that it allows me to be flexible in what sorts of schemas I'm able to harvest, not just unqualified Dublin Core. That being said, I do have one other question: Is there a way within the Net::OAI::Harvester to output the actual metadata structure that's being harvested? Thanks in advance for any assistance that you can provide! Stephen Westman
Re: [CODE4LIB] Net::OAI::Harvester
Westman, Stephen writes I'm working on a Perl-based OAI harvester and have run a problem. The module that I'm using - Net::OAI::Harvester - does a great job of parsing out the different OAI tagged fields so that they can be put into a MySQL table of retrieved OAI records for searching. I suggest you use HTTP::OAI instead. Cheers, Thomas Krichelhttp://openlib.org/home/krichel http://authorprofile.org/pkr1 skype: thomaskrichel
Re: [CODE4LIB] net::oai::harvester [resolved]
On Dec 13, 2007, at 10:33 AM, Eric Lease Morgan wrote: Put another way, if I want to use repository using NET::OAI::Harvester to read repository data in a form other than DC will I need to write an additional module such as NET::OAI::Record::MARCXML? But I'm lazy, and even though it is not the best solution, I will explore another option. Specifically, I will use oai_dump (which comes with N::O::H), change the metadata scheme from oai_dc to marc21, run the script, and parse the resulting XML. If I'm lucky my parser will able to be written as a SAX filter that can be added to the N::O::H distribution. In the meantime, at least I will have the data. Wish me luck. After getting most of my MARCXML/SAX parser written, Ed Summmers presented me with a couple of Perl modules allowing me to return MARC::Record objects from the harvest of OAI repositories supporting the marc21 metadata schema. This is originally what I wanted to do. Using this technology I was able to harvest the metadata (MARC records) of 70,000 University of Michigan digitized books (MBooks). I then fed them to an indexer -- Zebra -- that reads raw MARC very well, and provided a rudimentary interface to the index via SRU: http://infomotions.com/ii/ In the end the process was almost trivial and can easily be expanded to include other types of content. Thank you to all who helped along the way! -- Eric Lease Morgan University Libraries of Notre Dame (574) 631-8604
Re: [CODE4LIB] net::oai::harvester [resolved]
fwiw, my proposed solution was to use MARC::File::XML from the marc-xml cpan module [1] use Net::OAI::Harvester; use MARC::File::SAX; my $url = 'http://memory.loc.gov/cgi-bin/oai2_0'; my $harvester = Net::OAI::Harvester-new(baseURL = $url); my $response = $harvester-listRecords(metadataPrefix = 'marc21', metadataHandler = 'MARC::File::SAX'); while ($record = $response-next()) { # get the oai-pmh record as a MARC::Record object and print it out print $record-metadata()-record()-as_formatted(); } Note, this required some small adjustments to MARC::File::SAX. So you'd need the latest version from CVS [2], which will go out to CPAN some time I guess :-) //Ed [1] http://search.cpan.org/dist/MARC-XML/ [2] http://sourceforge.net/cvs/?group_id=1254
[CODE4LIB] net::oai::harvester
I am looking for some guidelines regarding the use of NET::OAI::Harvester. Specifically, I'm hoping someone would outline one or two different techniques for using the NET::OAI::Harvester Perl modules to harvest and store OAI metadata other than oai_dc. Suppose the data provider supports MARCXML, or MODS. How can I iterate through a repository and then either: 1) save the MARCXML or MODS data to my file system, or 2) parse the MARCXML or MODS and do some processing on it within my application. Put another way, if I want to use repository using NET::OAI::Harvester to read repository data in a form other than DC will I need to write an additional module such as NET::OAI::Record::MARCXML? -- Eric Lease Morgan University Libraries of Notre Dame
Re: [CODE4LIB] net::oai::harvester
On Thu, December 13, 2007 8:18 am, Eric Lease Morgan wrote: ... Put another way, if I want to use repository using NET::OAI::Harvester to read repository data in a form other than DC will I need to write an additional module such as NET::OAI::Record::MARCXML? I don't know if this is the only way to do it, but that is how I use NET::OAI::Harvester to handle metadata in DIDL that I get out of a DSpace repository via a custom crosswalk plugin. The harvest script simply invokes the Harvester like this: my $rec = $harvester-getRecord('identifier' = $oaiid, 'metadataPrefix' = 'oai_didl', 'metadataHandler' = 'DOC_DIDL', 'set' = $set); DOC_DIDL includes the SAX event handlers to build a hash of metadata and some methods to retrieve that metadata used by the harvest script. Writing DOC_DIDL.pm was a little messy and specific to the way we encode structural metadata in DIDL, mostly because I was flattening an hierarchical schema into a hash. Obviously you'll need a completely different module but if you want to see this one just to see what is involved I can send it to you off-list. -Don
Re: [CODE4LIB] net::oai::harvester
Hey Eric: N::O::H uses XML::SAX for XML parsing, which provides a standard interface to multiple back end XML parsers, and also provides a facility known as XML Filters [1]. Net::OAI::Record::OAI_DC is an example of a SAX filter which receives SAX events for each metadata record in a response and builds up a representation of the record. Since oai_dc is standard in oai-pmh-land it's assumed as a default a lot of the time. So if you want to retrieve another kind of metadata you have to write a SAX filter for it, and then reference it when you are calling getRecord(), listRecords() or listAllRecords(). So for example here's a test script for a MODSHandler detailed below: -- use Net::OAI::Harvester; use MODSHandler; my $url = 'http://memory.loc.gov/cgi-bin/oai2_0'; my $harvester = Net::OAI::Harvester-new(baseURL = $url); my $records = $harvester-listRecords( metadataPrefix = 'mods', metadataHandler = 'MODSHandler' ); while ($record = $records-next()) { print $record-metadata()-title(), \n; } -- And here's a barely functional MODSHandler that just pulls out the title: -- package MODSHandler; use XML::SAX::Base; use base qw(XML::SAX::Base); sub new { my $class = shift; return bless {inside = 0}, ref($class) || $class; } sub title { return shift-{title}; } sub start_element { my ($self, $element) = @_; if ($element-{Name} eq 'title') {$self-{inside} = 1;} } sub end_element { my ($self, $element) = @_; if ($element-{Name} eq 'title') {$self-{inside} = 0;} } sub characters { my ($self, $chars) = @_; if ($self-{inside}) { $self-{title} .= $chars-{Data}; } } 1; -- Kind of sad that there's that much code to just get at the contents of the title element. Perhaps there are some SAX Filters on CPAN that can build up a DOM like object for you. Interestingly back in 2000 or whatever when this was written it felt like pretty state of the art to use filters in this way. But today it seems kind of overkill to have to write a state-machine just to get at some XML. The ruby oai library [2] I worked on more recently kind of bucks the trend of not trying to create fancy objects for records and hand waving memory concerns (which never seemed to surface) and just returns back what amounts to a DOM and lets the user figure out what they want. Let me know if you run into any trouble. //Ed [1] http://www.xml.com/pub/a/2001/10/10/sax-filters.html [2] http://oai.rubyforge.org
Re: [CODE4LIB] net::oai::harvester
On Dec 13, 2007, at 9:10 AM, Don Gourley wrote: Put another way, if I want to use repository using NET::OAI::Harvester to read repository data in a form other than DC will I need to write an additional module such as NET::OAI::Record::MARCXML? I don't know if this is the only way to do it, but that is how I use NET::OAI::Harvester to handle metadata in DIDL that I get out of a DSpace repository via a custom crosswalk plugin. The harvest script simply invokes the Harvester like this: my $rec = $harvester-getRecord('identifier' = $oaiid, 'metadataPrefix' = 'oai_didl', 'metadataHandler' = 'DOC_DIDL', 'set' = $set); On Dec 13, 2007, at 9:48 AM, Ed Summers wrote: Net::OAI::Record::OAI_DC is an example of a SAX filter which receives SAX events for each metadata record in a response and builds up a representation of the record. Since oai_dc is standard in oai-pmh-land it's assumed as a default a lot of the time. So if you want to retrieve another kind of metadata you have to write a SAX filter for it, and then reference it when you are calling getRecord(), listRecords() or listAllRecords() And here's a barely functional MODSHandler that just pulls out the title: package MODSHandler; use XML::SAX::Base; use base qw(XML::SAX::Base); sub new { my $class = shift; return bless {inside = 0}, ref($class) || $class; } sub title { return shift-{title}; } sub start_element { my ($self, $element) = @_; if ($element-{Name} eq 'title') {$self-{inside} = 1;} } sub end_element { my ($self, $element) = @_; if ($element-{Name} eq 'title') {$self-{inside} = 0;} } sub characters { my ($self, $chars) = @_; if ($self-{inside}) { $self-{title} .= $chars-{Data}; } } 1; Thank you for the prompt replies, and y'all have confirmed what I believed. The best way to accomplish my goal is to write a SAX filter for the metadata schema I desire. But I'm lazy, and even though it is not the best solution, I will explore another option. Specifically, I will use oai_dump (which comes with N::O::H), change the metadata scheme from oai_dc to marc21, run the script, and parse the resulting XML. If I'm lucky my parser will able to be written as a SAX filter that can be added to the N::O::H distribution. In the meantime, at least I will have the data. Wish me luck. -- Eric Lease Morgan University Libraries of Notre Dame
Re: [CODE4LIB] net::oai::harvester
Another option if you are in Perl land would be to take a look at Tim Brody's HTTP::OAI library [1] which returns XML::DOM::Document objects for record metadata, which you can walk around in and use to evaluate xpaths: -- use HTTP::OAI; my $harvester = HTTP::OAI::Harvester-new( baseURL = 'http://memory.loc.gov/cgi-bin/oai2_0' ); my $response = $harvester-ListRecords(metadataPrefix = 'mods'); while (my $record = $response-next()) { print $record-metadata()-dom()-findvalue('//valid/xpath/here'), \n; } -- I left the xpath as an exercise for the reader since I couldn't figure out how to set use the http://www.loc.gov/mods/v3 namespace properly. Ahh, namespaces :-) //Ed [1] http://search.cpan.org/dist/HTTP-OAI/
Re: [CODE4LIB] net::oai::harvester
On Dec 13, 2007, at 9:48 AM, Ed Summers wrote: use Net::OAI::Harvester; use MODSHandler; my $url = 'http://memory.loc.gov/cgi-bin/oai2_0'; my $harvester = Net::OAI::Harvester-new(baseURL = $url); my $records = $harvester-listRecords( metadataPrefix = 'mods', metadataHandler = 'MODSHandler' ); while ($record = $records-next()) { print $record-metadata()-title(), \n; } ... Interestingly back in 2000 or whatever when this was written it felt like pretty state of the art to use filters in this way. But today it seems kind of overkill to have to write a state-machine just to get at some XML. The ruby oai library [2] I worked on more recently kind of bucks the trend of not trying to create fancy objects for records and hand waving memory concerns (which never seemed to surface) and just returns back what amounts to a DOM and lets the user figure out what they want. What type(s) of data are methods applied against the metadata method (above) expected to return? Only scalars? How about objects? How about other Perl data structures like a hash (of hashes)? Are there a pre-defined set of methods that can be called against the metadata method? I suppose the afore mentioned MODSHandler can be designed to support any number of methods returning different types of data. Correct? For example, the code above is designed to return a title. Additional methods might return authors, subjects, publishers, etc. Spurned on by the availability of MBooks from the University of Michigan [1], I have written the beginnings of a SAX filter for MARCXML data. Currently it iterates over MARCXML, parses the data, and prints to STDOUT something looking like a MARC tagged display. Ironically, this was rather easy because MARCXML only has a limited number of elements: leader, controlfield, datafield, and subfield. Using Ed's code as a model, I think I could create a method called MARC that returns a MARC::Record object, like this: use Net::OAI::Harvester; use MARCXML; my $url = 'http://memory.loc.gov/cgi-bin/oai2_0'; my $harvester = Net::OAI::Harvester-new( baseURL = $url ); my $records = $harvester-listRecords( metadataPrefix = 'marc21', metadataHandler = 'MARCXML' ); while ( $record = $records-next ) { # call the MARC method returning a MARC::Record object $marc = $record-metadata()-MARC, \n; # apply cool MARC::Record methods against the object print $marc-title; } Alternatively, I suppose I could create methods like this: $leader = $record-metadata()-leader; $control = $record-metadata()-control; $title = $record-metadata()-datafield( '245', 'a' ); $author = $record-metadata()-datafield( '100', 'a' ); $url = $record-metadata()-datafield( '856', 'u' ); Is this approach a good idea? On the other hand, maybe I should return the whole record in all of its MARC glory. Which approach is better? Maybe I should do both? Maybe I should return a DOM as Ed alludes to. Ah, the choices! [1] http://lists.webjunction.org/wjlists/xml4lib/2007-December/ 005978.html -- Eric Lease Morgan University Libraries of Notre Dame