Dirk, Johan, Matti and others

Just an update on the OAI-Harvester. Here is a rewritten harvester that
should work well on archives of any size. Parameters needs to be adjusted
for the actual request (like dates and the like, this one uses sets), but
the logic with higher order functions (hof  namespace i BaseX) works as it
should. Instead of adding to database it writes results to a file.
Consequently it does not require much memory.

Best, Lars


declare namespace oai = "http://www.openarchives.org/OAI/2.0/";;

(:***  URL for initial request - add suitable parameters for subquerying -
this one uses set - if sets are not used, just delete references to it:)
declare variable $URL2
:="_OAI-URL_?verb=ListRecords&metadataPrefix=marc21&set=";

(:***  URL for  resumption tokens: fill in a suitable OAI-URL - this URL
need not be changed :)
declare variable $URL := "_OAI-URL_?verb=ListRecords&resumptionToken=";


(: *** basex http :)
declare variable $http-option := <http:request method='get' />;


(:**********************

Function for fetching data using resumption token which writes result to
file

 **********************:)

declare function local:getResumption($file, $token) {

  if (empty($token)) then
    ()
  else
    let $http-request := http:send-request($http-option, $URL || $token)
    let $result := if ($http-request instance of node()) then
        $http-request
      else
        <http-err>{$http-request}</http-err>
    return  (
             file:append($file, $result//oai:metadata),
        $result//oai:resumptionToken/text()
   )

 };


(:************

Define oai set and file for storage

 *************:)

let $file := 'file.xml'
let $oai-set :=  "aset"



(:*************

Get the first batch of data and retrieve the resumptiontoken
If sets are not used, just remove the expression adding $oai-set
This is the place for building up a more complex OAI-query if needed, by
manipulating the variable $URL2 and the joining of parameters.

***************:)

let $first := http:send-request($http-option, $URL2 || $oai-set)
let $init := $first//oai:resumptionToken/text()


(:**************

write data to disk and call hof:until(), quitting on empty resumption token

****************:)

return (
  file:write-text($file, "<root>"),  (: insert start tag of root element :)

  file:append($file, $first//oai:metadata),  (: write initial sequence of
elements :)

  hof:until(              (:   call hof:until()   :)
    function($x) {
      empty($x)
    },

    function($y) {
      local:getResumption($file, $y)
   },
   $init
  ),

  file:append-text($file, "</root>")  (: insert end tag of root element :)
)




2016-05-12 17:30 GMT+02:00 Lars Johnsen <yoon...@gmail.com>:

>
> Thanks for pointer!
>
> Code is rewritten using hof:until() and tested towards a particular set at
> our national provider of library data.
>
> The script still accumulates data, so it will probably still run into
> memory troubles with larger datasets, but the stack-overflow should be
> taken care of.
>
> For anyone interested, the code is attached below, and using hof:until()
> as the higher order function. To make it work, fill in URLs for a choosen
> OAI-endpoint, and maybe change som of the request parameters - this one
> fetches marc21 posts and uses sets. Some error checking may also be
> implemented.
>
> Cheers,
> Lars
>
>
> declare namespace oai = "http://www.openarchives.org/OAI/2.0/";;
>
> (:URL for  resumption tokens :)
> declare variable $URL := "oai-URL?verb=ListRecords&amp;resumptionToken=";
>
> (:URL for initial request:)
> declare variable $URL2 :=
> "oai-URL?verb=ListRecords&amp;metadataPrefix=marc21&amp;set=";
>
> (: Variable for OAI-set - if not used, remove "set=" in URL2 :)
> declare variable $oai-set := "aset";
>
>
> (: basex http :)
> declare variable $http-option := <http:request method='get' />;
>
>
> (: ------
>
> Fetch data from OAI-endpoint using a start map containing resumption token
> and the first set of data.
> The map has two keys, 'resume' and 'chunk', where 'chunk' is an
> accumulator holding data from the current and previous requests.
> hof:until() does not return an aggregated list of maps, so data must be
> collected somehow
>
> ------:)
>
> declare function local:getResumption($startmap) {
>
>   let $token := map:get($startmap, 'resume')
>   return if (empty($token)) then
>     $startmap
>   else
>     let $http-request := http:send-request($http-option, $URL || $token)
>     let $result := if ($http-request instance of node()) then
>         $http-request
>       else
>         <http-err>{$http-request}</http-err>
>     return  map {
>       'resume':  $result//oai:resumptionToken/text(),
>       'chunk': (
>         map:get($startmap, 'chunk'),
>         $result//oai:metadata
>       )
>     }
>  };
>
>
> (: Issue initial request :)
>
> let $first := http:send-request($http-option, $URL2 || $oai-set)
>
> (:  Create startmap :)
>
> let $init := map {
>   'chunk': $first//oai:metadata,
>   'resume': $first//oai:resumptionToken/text()
> }
>
> let $oai :=  hof:until(
>
>   function($x) {
>     empty(map:get($x, 'resume'))
>   },
>
>   function($y) {
>     local:getResumption($y)
>  },
>  $init
> )
>
> (: Amend with additional code like db:add() of file:write() here :)
>
> return element oai {map:get($oai, 'chunk')}
>
>
> 2016-05-12 15:07 GMT+02:00 Dirk Kirsten <d...@basex.org>:
>
>> Hello Lars,
>>
>> just a thought (and really just a pointer, I am neither a purely
>> functional guy and also I feel like I am missing something obious...):
>> Maybe you could rewrite the recursive approach using higher order
>> functions. Consider a query like the following
>>
>> hof:scan-left(1 to 100,
>>   map { "token": "starttoken" },
>>   function($result, $index) {
>>     let $req := http:send-request(<http:request method="get"/>,
>> "http://google.com?q="; <http://google.com?q=> || $result("token"))
>>     return map {
>>       "result": $req,
>>       "token" : $req//http:header[@name = "Date"]/@value/data()
>>     }
>> })
>>
>> It will issue 100 requests to google and use some specific token from the
>> query before (in this case I used the date). This will output a sequence of
>> the map entries and in a subsequent step you could return only the actual
>> result values.
>>
>> Best regards, Dirk
>>
>> On 05/12/2016 12:55 PM, Lars Johnsen wrote:
>>
>> Thanks Johan and Matti for useful suggestions.
>>
>> Cutting down on the chunks seems to be a viable alternative.
>>
>> It would have been nice, though,  to have a robust harvester in XQuery
>> that could take on anything, although the recursive version works fine as
>> long as the dataset consist of a couple of  thousand entries.
>>
>> Best,
>> Lars
>>
>> 2016-05-12 8:16 GMT+02:00 Lassila, Matti <matti.j.lass...@jyu.fi>:
>>
>>> Hello,
>>>
>>> If your case allows using external tools for harvesting, I can highly
>>> recommend metha (https://github.com/miku/metha) which is a fairly full
>>> featured command line OAI-PMH harvester.
>>>
>>> Best regards,
>>>
>>> Matti L.
>>>
>>> On 11/05/16 18:31 , "basex-talk-boun...@mailman.uni-konstanz.de on
>>> behalf
>>> of Johan Mörén" <basex-talk-boun...@mailman.uni-konstanz.de on behalf of
>>> johan.mo...@gmail.com> wrote:
>>>
>>> >Maybe there is some other way to get the data over. I'll have a talk
>>> with
>>> >the guys providing the OAI-endpoint.
>>>
>>>
>>
>> --
>> Dirk Kirsten, BaseX GmbH, http://basexgmbh.de
>> |-- Firmensitz: Blarerstrasse 56, 78462 Konstanz
>> |-- Registergericht Freiburg, HRB: 708285, Geschäftsführer:
>> |   Dr. Christian Grün, Dr. Alexander Holupirek, Michael Seiferle
>> `-- Phone: 0049 7531 91 68 276, Fax: 0049 7531 20 05 22
>>
>>
>

Reply via email to