Dirk, Johan, Matti and others Just an update on the OAI-Harvester. Here is a rewritten harvester that should work well on archives of any size. Parameters needs to be adjusted for the actual request (like dates and the like, this one uses sets), but the logic with higher order functions (hof namespace i BaseX) works as it should. Instead of adding to database it writes results to a file. Consequently it does not require much memory.
Best, Lars declare namespace oai = "http://www.openarchives.org/OAI/2.0/"; (:*** URL for initial request - add suitable parameters for subquerying - this one uses set - if sets are not used, just delete references to it:) declare variable $URL2 :="_OAI-URL_?verb=ListRecords&metadataPrefix=marc21&set="; (:*** URL for resumption tokens: fill in a suitable OAI-URL - this URL need not be changed :) declare variable $URL := "_OAI-URL_?verb=ListRecords&resumptionToken="; (: *** basex http :) declare variable $http-option := <http:request method='get' />; (:********************** Function for fetching data using resumption token which writes result to file **********************:) declare function local:getResumption($file, $token) { if (empty($token)) then () else let $http-request := http:send-request($http-option, $URL || $token) let $result := if ($http-request instance of node()) then $http-request else <http-err>{$http-request}</http-err> return ( file:append($file, $result//oai:metadata), $result//oai:resumptionToken/text() ) }; (:************ Define oai set and file for storage *************:) let $file := 'file.xml' let $oai-set := "aset" (:************* Get the first batch of data and retrieve the resumptiontoken If sets are not used, just remove the expression adding $oai-set This is the place for building up a more complex OAI-query if needed, by manipulating the variable $URL2 and the joining of parameters. ***************:) let $first := http:send-request($http-option, $URL2 || $oai-set) let $init := $first//oai:resumptionToken/text() (:************** write data to disk and call hof:until(), quitting on empty resumption token ****************:) return ( file:write-text($file, "<root>"), (: insert start tag of root element :) file:append($file, $first//oai:metadata), (: write initial sequence of elements :) hof:until( (: call hof:until() :) function($x) { empty($x) }, function($y) { local:getResumption($file, $y) }, $init ), file:append-text($file, "</root>") (: insert end tag of root element :) ) 2016-05-12 17:30 GMT+02:00 Lars Johnsen <yoon...@gmail.com>: > > Thanks for pointer! > > Code is rewritten using hof:until() and tested towards a particular set at > our national provider of library data. > > The script still accumulates data, so it will probably still run into > memory troubles with larger datasets, but the stack-overflow should be > taken care of. > > For anyone interested, the code is attached below, and using hof:until() > as the higher order function. To make it work, fill in URLs for a choosen > OAI-endpoint, and maybe change som of the request parameters - this one > fetches marc21 posts and uses sets. Some error checking may also be > implemented. > > Cheers, > Lars > > > declare namespace oai = "http://www.openarchives.org/OAI/2.0/"; > > (:URL for resumption tokens :) > declare variable $URL := "oai-URL?verb=ListRecords&resumptionToken="; > > (:URL for initial request:) > declare variable $URL2 := > "oai-URL?verb=ListRecords&metadataPrefix=marc21&set="; > > (: Variable for OAI-set - if not used, remove "set=" in URL2 :) > declare variable $oai-set := "aset"; > > > (: basex http :) > declare variable $http-option := <http:request method='get' />; > > > (: ------ > > Fetch data from OAI-endpoint using a start map containing resumption token > and the first set of data. > The map has two keys, 'resume' and 'chunk', where 'chunk' is an > accumulator holding data from the current and previous requests. > hof:until() does not return an aggregated list of maps, so data must be > collected somehow > > ------:) > > declare function local:getResumption($startmap) { > > let $token := map:get($startmap, 'resume') > return if (empty($token)) then > $startmap > else > let $http-request := http:send-request($http-option, $URL || $token) > let $result := if ($http-request instance of node()) then > $http-request > else > <http-err>{$http-request}</http-err> > return map { > 'resume': $result//oai:resumptionToken/text(), > 'chunk': ( > map:get($startmap, 'chunk'), > $result//oai:metadata > ) > } > }; > > > (: Issue initial request :) > > let $first := http:send-request($http-option, $URL2 || $oai-set) > > (: Create startmap :) > > let $init := map { > 'chunk': $first//oai:metadata, > 'resume': $first//oai:resumptionToken/text() > } > > let $oai := hof:until( > > function($x) { > empty(map:get($x, 'resume')) > }, > > function($y) { > local:getResumption($y) > }, > $init > ) > > (: Amend with additional code like db:add() of file:write() here :) > > return element oai {map:get($oai, 'chunk')} > > > 2016-05-12 15:07 GMT+02:00 Dirk Kirsten <d...@basex.org>: > >> Hello Lars, >> >> just a thought (and really just a pointer, I am neither a purely >> functional guy and also I feel like I am missing something obious...): >> Maybe you could rewrite the recursive approach using higher order >> functions. Consider a query like the following >> >> hof:scan-left(1 to 100, >> map { "token": "starttoken" }, >> function($result, $index) { >> let $req := http:send-request(<http:request method="get"/>, >> "http://google.com?q=" <http://google.com?q=> || $result("token")) >> return map { >> "result": $req, >> "token" : $req//http:header[@name = "Date"]/@value/data() >> } >> }) >> >> It will issue 100 requests to google and use some specific token from the >> query before (in this case I used the date). This will output a sequence of >> the map entries and in a subsequent step you could return only the actual >> result values. >> >> Best regards, Dirk >> >> On 05/12/2016 12:55 PM, Lars Johnsen wrote: >> >> Thanks Johan and Matti for useful suggestions. >> >> Cutting down on the chunks seems to be a viable alternative. >> >> It would have been nice, though, to have a robust harvester in XQuery >> that could take on anything, although the recursive version works fine as >> long as the dataset consist of a couple of thousand entries. >> >> Best, >> Lars >> >> 2016-05-12 8:16 GMT+02:00 Lassila, Matti <matti.j.lass...@jyu.fi>: >> >>> Hello, >>> >>> If your case allows using external tools for harvesting, I can highly >>> recommend metha (https://github.com/miku/metha) which is a fairly full >>> featured command line OAI-PMH harvester. >>> >>> Best regards, >>> >>> Matti L. >>> >>> On 11/05/16 18:31 , "basex-talk-boun...@mailman.uni-konstanz.de on >>> behalf >>> of Johan Mörén" <basex-talk-boun...@mailman.uni-konstanz.de on behalf of >>> johan.mo...@gmail.com> wrote: >>> >>> >Maybe there is some other way to get the data over. I'll have a talk >>> with >>> >the guys providing the OAI-endpoint. >>> >>> >> >> -- >> Dirk Kirsten, BaseX GmbH, http://basexgmbh.de >> |-- Firmensitz: Blarerstrasse 56, 78462 Konstanz >> |-- Registergericht Freiburg, HRB: 708285, Geschäftsführer: >> | Dr. Christian Grün, Dr. Alexander Holupirek, Michael Seiferle >> `-- Phone: 0049 7531 91 68 276, Fax: 0049 7531 20 05 22 >> >> >