Manually completing an incomplete harvested site

Ferran Jorba Fri, 1 Apr 2011 14:04:36 +0200

Hi,

I have quite a mess with a site that I'm supposed to harvest
selectively.  This remote site is a DSpace repository that holds
documents of all Catalan universities, and we are supposed to collect a
few dozens of oaisets, the ones that belong to our university.


We are facing a few problems or doubts.  First, let me show part of the
report I've produced, so some of the typical problems are shown:


 oaiset = hdl_2072_16100 (Reports on Environmental Sciences)
  identifiers = 9
  not found = 0
  duplicated = 0
 
 oaiset = hdl_2072_1749 (Working papers)
  identifiers = 401
  not found = 11
    oai:www.recercat.net:2072/41974
    oai:www.recercat.net:2072/41975
    oai:www.recercat.net:2072/41976
    oai:www.recercat.net:2072/41977
    oai:www.recercat.net:2072/41978
    oai:www.recercat.net:2072/41979
    oai:www.recercat.net:2072/41980
    oai:www.recercat.net:2072/41983
    oai:www.recercat.net:2072/41985
    oai:www.recercat.net:2072/41984
    oai:www.recercat.net:2072/41986
  duplicated = 27
    oai:www.recercat.net:2072/88000 = [63044, 67648]
    oai:www.recercat.net:2072/87998 = [63042, 67646, 69419]
    oai:www.recercat.net:2072/87999 = [63043, 67647]
    oai:www.recercat.net:2072/87990 = [63034, 67639]
    oai:www.recercat.net:2072/87991 = [63035, 67640]
    [...]


I think that the problem originated, at least in part, due to some
control chars in the records of the remote site, and the xsl
transformation fails, produces error and apparently the rest of the
records are not converted and loaded to our local site.  I've applied a
simple patch (attached) that seems to solve this.  Maybe there are
better ways to remove those problematic characters, so improvements are
welcome.  Otherwise, Tibor, please consider it for applying upstream.

After appying it localy, now I've forced another harvest, but it seems
that it doesn't collect older-than-last-harvesting-time records, even if
those records do not exist in my site.  In a very related way, we are
not sure if older records of a this-oaiset-that-I'm-checking-now are
going to be collected next harvesting session.

I can do a manual harvesting-converting-and-uploading (h-c-u) of the
records that I've identified, no problem.  But I'd like to know how does
Invenio decides that a record has to be collected for the two related
scenarios that I've tried to explain in my previous paragraph.

Do I have to do any post-processing after doing my manual h-c-u action?
Or, is there a way that I can feed a known list of local records (or
remote identifiers) to oaiharvest?

Thanks,

Ferran

bibconvert: remove invalid (control) chars

* Translate lower ASCII chars to spaces so the xsl translation doesn't fail
---
 bin/bibconvert |    4 ++++
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/bin/bibconvert b/bin/bibconvert
index 494c3f8..daba5e9 100755
--- a/bin/bibconvert
+++ b/bin/bibconvert
@@ -24,6 +24,8 @@
 
 __revision__ = "$Id: bibconvert.in,v 1.41 2008/06/06 09:53:46 jerome Exp $"
 
+validchars = ''.join([' ' for i in range(32)] + [chr(i) for i in range(32,256)])
+
 try:
     import fileinput
     import string
@@ -249,6 +251,7 @@ if opt_value.endswith('.'+
                       CFG_BIBFORMAT_BFX_FORMAT_TEMPLATE_EXTENSION):
     # BibConvert for bfx
     source_xml = sys.stdin.read()
+    source_xml = source_xml.translate(validchars)
     try:
         print bibconvert_bfx_engine.convert(source_xml, extract_tpl)
     except NameError:
@@ -263,6 +266,7 @@ if opt_value.endswith('.'+
 elif opt_value.endswith('.xsl'):
     # BibConvert for XSLT
     source_xml = sys.stdin.read()
+    source_xml = source_xml.translate(validchars)
     try:
         res = bibconvert_xslt_engine.convert(source_xml, extract_tpl)
         if res is not None:

Manually completing an incomplete harvested site

Reply via email to