Re: Manually completing an incomplete harvested site

Ferran Jorba Mon, 4 Apr 2011 09:27:27 +0200

Hello Jan,

> I have been working a bit with BibHarvest lately, and perhaps I can
> help answer some of your questions. See comments below.
>
> On Fri, 2011-04-01 at 14:00 +0200, Ferran Jorba wrote:
>> After appying it localy, now I've forced another harvest, but it
>> seems that it doesn't collect older-than-last-harvesting-time
>> records, even if those records do not exist in my site.  In a very
>> related way, we are not sure if older records of a
>> this-oaiset-that-I'm-checking-now are going to be collected next
>> harvesting session.
>
> The OAI harvester looks in the table oaiHARVEST column 'lastrun' to
> determine if lastrun + frequency > today. If true, no harvesting
> happens. In your case, perhaps a dirty fix could be to do an explicit
> UPDATE statement to change lastrun to a previous date, then run the
> harvesting job again. E.g.
>
>>>> run_sql("UPDATE oaiHARVEST SET lastrun = 'yyyy-mm-dd hh:mm:ss' WHERE 
>>>> id=SOMEID")


So I understand it better.  There is a *single* update time for a whole
repository, even if this repository has multiple oaisets to be
harvested.  Thus, if the procedure fails before the whole site is
completed, the lastrun time is not updated and I end up with duplicate
records.

May I suggest to keep a more granular table for remote oaisets?  The
admin UI is fine, I'm happy to have a single entry for a a single remote
site, but I understand that the table with the lastrun value should be
set to the oaiset value, in a more relational fashion, so to say.

This way, we have an additional benefit: if I add a new oaiset to be
harvested now, even if this oaiset has existed for some time, I have the
guarantee that all those records are going to be harvested.  Now it is
not the case.

And, as errors happen, if something goes wrong, it would be easier to
cherry-pick the missing oaisets to harvest them manually.

Do you think is it reasonable?  And can we expect it?

>> I can do a manual harvesting-converting-and-uploading (h-c-u) of the
>> records that I've identified, no problem.  But I'd like to know how does
>> Invenio decides that a record has to be collected for the two related
>> scenarios that I've tried to explain in my previous paragraph.
>> 
>> Do I have to do any post-processing after doing my manual h-c-u action?
>> Or, is there a way that I can feed a known list of local records (or
>> remote identifiers) to oaiharvest?
>
> I believe if you have a simple post-process workflow i.e. h-c-u, you
> should be able to run single harvesting runs through OAI Harvest Admin
> Interface. Only caveat is that it only accepts single(!) identifiers.

I've counted 314 records not harvested from 32 oaisets.  I'll write some
script to process and load them.  However, this leads me to the second
part of my question: does the bibharvest procedure add some information
in some table that I have to update, or it is just this lastrun column
in oaiHARVEST table?

> Support for these sorts of identifier lists is something that would be
> nice to have, and hopefully something we can look into adding soon.
> Similar updates to the interface and harvester is already planned. See
> http://invenio-software.org/ticket/483 

That improvement would help, sure it would.

> (I see from the OAI v2.0 spec that harvesting lists of identifiers is
> not directly supported by the protocol either)
>
> Hope this helps.

Sure it does.

Have you taken a look at the patch I sent?  Anything that we do in order
to comply more with the Postel Law
(http://en.wikipedia.org/wiki/Robustness_principle) would be benefitial.

Thanks,

Ferran

Re: Manually completing an incomplete harvested site

Reply via email to