Hi Ferran,

On Mon, 2011-04-04 at 09:23 +0200, Ferran Jorba wrote: 
> So I understand it better.  There is a *single* update time for a whole
> repository, even if this repository has multiple oaisets to be
> harvested.  Thus, if the procedure fails before the whole site is
> completed, the lastrun time is not updated and I end up with duplicate
> records.

Yes. The separate "filtering" (f) post-process step can come into play
here, to help avoid duplicate entries of this sort. BibUpload also
provides some basic duplicate checks, like on OAI ID, but it may not be
enough.

As a side-note, duplicates can also happen coming from records
associated with many OAI sets being harvested several times in the same
session. Recent changes to BibHarvest helps avoid this issue, though.

> May I suggest to keep a more granular table for remote oaisets?  The
> admin UI is fine, I'm happy to have a single entry for a a single remote
> site, but I understand that the table with the lastrun value should be
> set to the oaiset value, in a more relational fashion, so to say.
> 
> This way, we have an additional benefit: if I add a new oaiset to be
> harvested now, even if this oaiset has existed for some time, I have the
> guarantee that all those records are going to be harvested.  Now it is
> not the case.
> 
> And, as errors happen, if something goes wrong, it would be easier to
> cherry-pick the missing oaisets to harvest them manually.
> 
> Do you think is it reasonable?  And can we expect it?

This is a good point, and seems very reasonable indeed. I can add this
suggestion in a new Trac ticket at http://invenio-software.org to put it
in the pipeline for future updates.

> I've counted 314 records not harvested from 32 oaisets.  I'll write some
> script to process and load them.  However, this leads me to the second
> part of my question: does the bibharvest procedure add some information
> in some table that I have to update, or it is just this lastrun column
> in oaiHARVEST table?

The harvesting procedure only updates the repository lastrun column,
yes. There are no other updates, apart from adding a log entry to the
harvest history upon successful bibupload.

> Have you taken a look at the patch I sent?  Anything that we do in order
> to comply more with the Postel Law
> (http://en.wikipedia.org/wiki/Robustness_principle) would be benefitial.

Yes, thank you. This makes total sense to me. We should be able to get
this in.

Thanks,
Jan

-- 
-------
Jan Age Lavik <[email protected]>
CERN Technical Student
Open-Access Group (GS-SIS-OA)

Phone: +41 22 767 9092
Office: 3-1-011


Reply via email to